0% found this document useful (0 votes)
19 views12 pages

Base Paper

Uploaded by

aadhivinay2002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views12 pages

Base Paper

Uploaded by

aadhivinay2002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Received 21 June 2023, accepted 10 July 2023, date of publication 21 July 2023, date of current version 28 July 2023.

Digital Object Identifier 10.1109/ACCESS.2023.3297651

Analysis of Facial Expressions to Estimate the


Level of Engagement in Online Lectures
RENJUN MIAO 1 , HARUKA KATO1 , YASUHIRO HATORI1,2 ,
YOSHIYUKI SATO 2,3 , AND SATOSHI SHIOIRI1,2,3
1 GraduateSchool of Information Sciences, Tohoku University, Sendai, Miyagi 980-8577, Japan
2 ResearchInstitute of Electrical Communication, Tohoku University, Sendai, Miyagi 980-8577, Japan
3 Advanced Institute for Yotta Informatics, Tohoku University, Sendai, Miyagi 980-8577, Japan

Corresponding author: Renjun Miao ([Link].s1@[Link])


This work was supported in part by the Research Project Program of Research Center for 21st Century Information Technology (IT-21
Center), Research Institute of Electrical Communication (RIEC), Tohoku University; and in part by the Yotta Informatics Project by
Ministry of Education, Culture, Sports, Science and Technology (MEXT), Japan. The work of Satoshi Shioiri was supported by the
Japan Society for the Promotion of Science (JSPS) KAKENHI under Grant 19H01111.

ABSTRACT The present study aimed to develop a method for estimating students’ attentional state from
facial expressions during online lectures. We estimated the level of attention while students watched a
video lecture by measuring reaction time (RT) to detect a target sound that was irrelevant to the lecture.
We assumed that RT to such a stimulus would be longer when participants were focusing on the lecture
compared with when they were not. We sought to estimate how much learners focus on a lecture using
RT measurement. In the experiment, the learner’s face was recorded by a video camera while watching a
video lecture. Facial features were analyzed to predict RT to a task-irrelevant stimulus, which was assumed
to be an index of the level of attention. We applied a machine learning method, light Gradient Boosting
Machine (LightGBM), to estimate RTs from facial features extracted as action units (AUs) corresponding
to facial muscle movements by an open-source software (OpenFace). The model obtained using LightGBM
indicated that RTs to the irrelevant stimuli can be estimated from AUs, suggesting that facial expressions are
useful for predicting attentional states while watching lectures. We re-analyzed the data while excluding RT
data with sleepy faces of the students to test whether decreased general arousal caused by sleepiness was a
significant factor in the RT lengthening observed in the experiment. The results were similar regardless of
the inclusion of RTs with sleepy faces, indicating that facial expression can be used to predict learners’ level
of attention to video lectures.

INDEX TERMS Attention, affective computing, engagement, facial features, online lecture.

I. INTRODUCTION preference estimation from facial expressions and found


Understanding students’ engagement levels while studying that this information was useful for estimating subjective
is important for improving learning outcomes. To improve judgments of image preference. In education-related studies,
the quality of education, it is crucial to estimate learners’ Thomas and Jayagopi recorded students’ face images in a
level of engagement with their studies. However, it is difficult classroom while they were studying with video material
for teachers to pay attention to all students, particularly on a screen and estimated the level of engagement from
in online classes. Automated measurement of engagement students’ facial expressions [4], [5]. The authors succeeded
levels may be helpful for improving learning conditions. For in predicting engagement, suggesting the usefulness of facial
online learning, webcams can be used to capture learners’ expressions for estimating the level of engagement. Heart
facial expressions, which can be used to estimate their mental rate has also been used to estimate mental states during
states [1], [2], [3]. For example, Shioiri et al. conducted image learning. Darnell and Krieg showed that changes in heart rate
are related to students’ activity during a class [6]. Although
The associate editor coordinating the review of this manuscript and previous studies have focused on engagement, which is
approving it for publication was Filbert Juwono . assessed externally, this research has also been extended to
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
VOLUME 11, 2023 For more information, see [Link] 76551
R. Miao et al.: Analysis of Facial Expressions to Estimate the Level of Engagement

the measurement of internal states, which can be investigated II. EXPERIMENT


by estimating internal states. In these studies, the mental state We conducted an experiment to investigate the relationship
used as ground truth is based on subjective judgments [4], between the attention level and facial expression while
[7]. However, mental states involve factors other than those watching video lectures. To estimate the level of attention in
that can be evaluated subjectively. Unconscious processes, video lectures, we measured RT to an auditory target that was
which cannot be estimated subjectively, may play more irrelevant to the lecture. We assumed that RT to an irrelevant
important roles than conscious processes. Thus, it is unlikely stimulus would be longer when participants were focusing
that subjective judgments are suitable for use as indexes of on the lecture compared with when they were not. The effect
mental states. For example, heart rate change is reported on brain responses to irrelevant stimuli has been suggested
to be a useful index of students’ activity, and is not to be able to estimate attention to the primary task. For
necessarily related to the subjective estimation of attention example, Kramer et al. conducted electroencephalography
and engagement [6]. As such, it is important to develop (EEG) measurements and reported that the event-related
methods involving objective measures for estimating the level response (ERP) to a task-irrelevant stimulus changes with
of engagement. A previous study showed that facial features the difficulty of a primary task [18]. Similar changes were
could be useful for estimating reaction time (RT) for mental expected with RT measurements because both ERP and RT
calculations [8]. This result suggests that RT could be a have been used to estimate attention in general [19]. In the
good index of attention if it varies depending on focusing on current study, we attempted to use recorded face images to
the task as typically assumed in attention studies for simple predict RT.
detection, discrimination, or identification of visual stimuli. The auditory target we used was the disappearance of
However, this type of measure is not available for lectures. continuous white noise instead of the appearance of a sound
Therefore, we attempted to use RT for task-irrelevant stimuli. stimulus, whereas previous experiments to measure attention
Although engagement is a term used with different have typically used a pulse stimulus [20], [21]. The reason for
meanings in different contexts [7], [9], it is often used using the disappearance of sound was to avoid the influence
in relation to attention [10], [11], [12], [13]. Attention to of bottom-up attention to a salient stimulus, such as an
lectures, classes, and tasks is thought to be closely related to auditory pulse. Bottom-up attention to a salient stimulus
engagement. Here we use the term attention to refer to the could be strong enough to mask the effect of attention to the
facilitation of sensory processing by endogenous intention or lecture. Indeed, the effect of top-down attention cannot be
salient exogenous stimulation, and consider it to be a major detected when there is only one transient stimulation, while
factor for engagement. It should be noted that engagement has a target is discriminated by top-down attention among many
also been used to indicate mental states of a longer duration transient stimuli [22], [23].
in some previous studies, such as a whole lecture [6], [7], Fifteen participants (average age, 23.1 years) took part
[14], [15], [16]. We measured levels of attention as an index in the experiment. Participants had normal or corrected-
of engagement during lectures in this study. to- normal vision and normal audition. Participants were
We designed an experiment in which participants were instructed to watch a series of nine video lectures and
asked to detect an auditory target while watching a lecture to answer questions at the end of each video (Fig. 1).
video. The primary task of the experiment was to understand Participants were also instructed to press a key when they
the lecture, and the secondary task was to detect the target. noticed the auditory target (the sudden disappearance of
RT to the auditory target was used as an objective measure white noise) while watching the video lecture. Participants
of attention level on the lecture videos. Here, we assumed were instructed that the lecture was the primary task of the
that the time required to detect a target that was irrelevant experiment while the detection of the target was a secondary
to the primary task would be longer when the participant task, and they were required to answer questions at the
focused more on the primary task (i.e., watching video end of the experiment. RT to the target was measured to
lectures in this experiment). Face images of participants were estimate participants’ attention level at the time of the target
recorded while watching the videos, and facial expressions presentation.
were analyzed after the experiment. The purpose of the study The learning materials were from an introductory course
was to estimate the RT from facial expressions to develop a about a computer language, PHP, which was posted on
method for estimating engagement level from learners’ face YouTube [24]. The videos were shown on a computer display
images. (MacBook Pro, Apple, California) with headphones (MDR-
Some of the results in this study with a smaller number 7506, Sony, Tokyo) in a room with office lighting (483.2 lx
of participants were published in a post-conference book as on the desk on which the computer was placed, and 211.8 lx
a preliminary report [14]. Here, we report analyses of facial at the location of the participant’s face). The average loudness
expressions in more detail with data from a larger number of the lecturer’s voice was 70 db, and that of the white noise
of participants to consider the contributions of specific facial was 0.66 db.
features, the effect of individual variation, and the effect of The white noise occasionally disappeared, which was the
general arousal level or sleepiness. target for the secondary task. The interval between two

76552 VOLUME 11, 2023


R. Miao et al.: Analysis of Facial Expressions to Estimate the Level of Engagement

FIGURE 1. Experimental design. While watching a video lecture of an introductory PHP course in a session, auditory signals of white noise were added
to the original auditory track of the lecturer video. At the end of a session after watching the video, participants answered to several quizzes about the
lecture.

targets, which was a period of white noise presentation, III. FACIAL FEATURE ANALYSIS
was randomly selected between 25 and 35 seconds. The Participants’ faces were recorded while watching lecture
white noise started again immediately after the key press videos, and their facial features were analyzed after the
to indicate detection, or after a period of 10 seconds if no experiment. We analyzed face images recorded in the 3 sec-
key press had been performed. Each lecture lasted between onds before the target presentation (disappearance of white
10 and 20 minutes, depending on the content. At the noise), using OpenFace [25] to extract the facial features.
end of each lecture, eight questions were provided in a To perform facial expression analysis using OpenFace, the
google form format. For each question, participants selected first step is to gather facial images or video data. From each
one of four choices as their [Link] one lecture video frame, OpenFace detects a face (multiple faces can be
is a session of the experiment. There were nine lecture detected while there was only one face in our experiment)
sessions. and locate it in the frame. Then, it makes facial appearance
In addition to the lecture sessions, there were two control as face orientation and makes facial landmarks such as
sessions to measure RT for the detection task without paying boundaries of eyes, eyebrows, and mouth. By analyzing
attention to the lecture, so that the total number of session the position changes of the facial landmarks and facial
was eleven. In the control sessions, two videos from the appearance, OpenFace evaluates the degree of facial muscle
same video lectures were used so that the participants knew activity as action units (AUs). AUs are assigned to muscle
the content and had little or no reason to be attracted to movements related to facial expressions based on the Facial
the content. Participants were asked to focus on the white Action Coding System (FACS) [26]. For example, AU1
noise and told that they did not have to pay attention to indicates the raising of the inner eyebrows, AU4 indicates
the content of the video on the display. The first control the lowering of the eyebrows, and AU5 indicates the
session was conducted as the 6th session with the first lecture raising of the upper lids (Table 1). OpenFace offers several
video, and the second control session was conducted as research advantages for facial analysis. Firstly, leveraging
the 11th session with the 6th lecture video used at the 7th deep learning techniques, particularly convolutional neural
session. The experiment was conducted over 2 days. Five networks (CNNs), OpenFace achieves high accuracy in facial
lecture sessions and the first control session were performed recognition and feature extraction tasks. This is crucial
on the first day, and the rest of the sessions (four lectures for research projects that require precise identification and
and one control session) were performed on the second day. comparison of facial features. Secondly, OpenFace not only
The interval between the first and second days was within enables facial recognition but also facilitates the extraction
1 week. The total duration of the experiment, 11 sessions, of facial features such as expressions and poses. This
was approximately 130 minutes. broadens its applications in research areas such as facial

VOLUME 11, 2023 76553


R. Miao et al.: Analysis of Facial Expressions to Estimate the Level of Engagement

TABLE 1. Meanings of AUs are also listed. and mean and standard deviation for AUc. The number of
parameters was 155 in total numbers of variables in total.
There are perhaps better statistical features of sequential data
rather than what we used here. However, they were sufficient
to show the usefulness of the AUs to predict RT (see later).
For better prediction in the future, we could investigate more
complex temporal features.
To investigate the relationship of facial expressions with
the RT of target detection, we attempted to predict RT from
AUs using a machine learning method called LightGBM [27].
LightGBM is a gradient boosting model, which operates
quickly and exhibits relatively accurate performance in
general. LightGBM is a decision tree model with gradient
boosting, in which the node of trees grows to minimize
the residuals. Since training data with large residuals are
used preferentially, thus learning proceeds efficiently, which
is a powerful machine learning technique that can be used
for both regression and classification tasks. It works by
combining multiple weak learners (simple decision trees) into
a strong learner, which is able to make accurate predictions
on new data. In this study, two different methods were tested
for RT predictions of AUs. One method was to train a model
with pooled data of all participants (pooled data model), and
the other was to train a model with all but one participant
and test with the remaining participant (across individual
test models). The latter method was to investigate individual
differences. If individual differences are small, the model
emotion recognition, facial tracking, and facial attribute built with other participants should be able to predict RTs
analysis. Thirdly, being an open-source toolkit, OpenFace of the participant tested. However, individual variations may
allows researchers to modify and customize it according to prevent the building of a general model that can be used for
their specific needs. This flexibility enables adjustments and anyone whose data are not used to build the model.
improvements tailored to individual research objectives and For the evaluation of the models, a 15-fold cross-validation
various application scenarios. Fourthly, OpenFace supports method was used. All data were divided into 15 groups
processing large datasets of facial images and videos. This randomly for the pooled data model, 14 of which were used
is particularly valuable for research projects that involve for training and the remaining group was used for testing.
handling extensive data, such as facial recognition in video The process was repeated 15 times, one test for each group,
surveillance systems or the establishment of facial image and the average was used as the model performance. For the
databases. All of these advatages are important to us across-individuals test model, data for 14 of 15 participants
particulary when to apply research achievements to practical were used for training, and data for the remaining participant
occasions. We arbitrarily chose the period of time between were used for testing. The process was repeated 15 times, with
3 sec and 0 sec before target presentation as the time window one test for each participant. The average of the 15 test scores
during which the effect of attention might be reflected in was used as the model performance. Prediction performance
target detection, but the uses of 1 or 5 seconds showed similar was assessed by the root mean square error (RMSE) of the
results (see Fig. 5). prediction against the data and by the Pearson’s correlation
The features of the facial expressions were extracted as coefficient between the data and the prediction.(Fig. 2)
AUs from the video taken for each target presentation using
OpenFace as well as the positions and angles of the head and IV. RESULTS
eyes. The meanings of AUs are shown in Table 1. Two types Target presentations without responses within 10 sec were
of AU indexes are available from OpenFaces: a continuous excluded from the reaction time (RT) analysis. Such target
value between 0 and 5 for 17 AUs (referred to as AUr) and presentations occurred on 5.5% of trials on average across
a binary value of 0 or 1 (absence or presence) for 18 AUs all participants. The average RT over all sessions of all
(referred to as AUc), which are 17 AUs and the AU28 for participants was 1.1 sec, with a standard deviation of
Lip Suck. Because we collected data for a 3-sec period for 2.3 sec. Because average RT varied among participants,
each target, we used statistical features of the time-varying we normalized RT as Z-scores after taking the logarithm.
values: minimum, maximum, mean, standard deviation, and We took the logarithm of RT to minimize the effects of
three levels of percentiles (25%, 50%, and 75%) for AUr asymmetrical distribution (usually a heavy tail for longer

76554 VOLUME 11, 2023


R. Miao et al.: Analysis of Facial Expressions to Estimate the Level of Engagement

FIGURE 2. The framework of the analysis: (a) Video recording of participants’ faces while watching online lectures and recording of reaction time of
target detection measured as the time from the target presentation (disappearance of white noise and the key press for the detection). (b) Orientation
information at each location as facial appearance and landmarks on a face, such as the eyes, nose, mouth, and chin are detected for all video frames
through each session by OpenFace. Facial appearance and landmarks are used to obtain AUs based on the Facial Action Coding System(FACS). Head
pose and eye gaze are also detected. Head pose is an important factor to analyzed face images as normalized fashion. (c) OpenFace extracted Action
units (AUs) from facial landmarks and appearance for each frame. (d) We used several statistical measures of sequential AU values from a time window
(3 s for main analysis and 1s and 5s were also used) before each target presentation. Used statistical measures were average, standard deviation,
minimum, maximum, and percentiles of 25th, 50th, and 75th for intensity indexes. Only average and standard deviation were used for binary indexes.
(e) The statistical measures from all AUs were used to predict reaction time using a machine learning method, LightGBM. LightGBM constructs a
tree-type model with leaf-wise tree growth, choosing the leaf with max delta loss to grow. (f) We compared predicted RTs with measured RTs, showing
their correlation. Higher correlation indicates that the LightGBM model can predict RTs to task-irrelevant stimulus well, so that the model can predict the
attention level at the time of target presentation under the assumption that higher attention to the lecture makes RT to the task irrelevant target longer.
We also analyzed strength of contribution using a method called Shapley additive explanations (SHAP). SHAP shows relationship between contribution
values (strength to contribute the prediction) and each of feature indexes.

RTs). We also used normalized values of AUs by Z-scoring training-test combinations (15 different combinations with
to avoid the effects of individual variations of facial features. different colors). The RMSE of data deviation from the
We expected that variations of AUs after normalization predictions (or the deviation of predictions from the data) was
were related to changes in mental processes, whereas the 0.75. The average of the RT data is zero, with a unit standard
absolute AU values include facial differences among different deviation after Z scoring by definition. Thus, the RMSE of
individuals. We then applied LightGBM to model the model prediction (0.75, which is smaller than 1) indicates
relationship between RT and facial expressions, and tested that the model can at least partially explain the data variation
the model using a 15-fold cross-validation method. Fig. 3a (25% in this case). The Pearson’s correlation coefficient
shows the prediction results of the pooled data model. The between data and prediction was 0.66. A statistical test of
horizontal axis shows RT measured in the experiment and no correlation showed that the correlation was statistically
the vertical axis shows the prediction from LightGBM. Each significant (p < 0.001, t(2412) = 11). We used a test to
point represents each target presentation from all sessions examine whether the Pearson’s correlation coefficient is not
of all participants and different colors indicate different significantly different from zero and showed the assumption

VOLUME 11, 2023 76555


R. Miao et al.: Analysis of Facial Expressions to Estimate the Level of Engagement

FIGURE 3. (a) Correlation between measured reaction time (RT) and the predicted RT of the model. Each point represents the RT of each target
presentation from all sessions of all participants. Different colors indicate different training-test combinations (15 combinations). (b) Indexes are
arranged according to the level of contribution to the prediction obtained using the Shapley additive explanations (SHAP) method. Each point is from
each RT, as in the correlation figure of figure3 (a), and the color (red or blue) indicates a positive or negative contribution. The horizontal axis indicates
the level of contribution to the prediction of the RT by the model (c) The absolute value that corresponds to the contribution of each index to the
prediction estimated by SHAP.

of not different was rejected with a level of 5%.In addition to provides the degree of contribution of each input feature for
the statistical significance of correlation coefficient, we also predicting each event of target detection, as shown by the dots
used a statistical test of RMSE to show that our prediction in Fig. 3 (b). Red dots indicate high values of facial feature
is better than chance. We compared RMSE of the model indexes and blue dots indicate low values. The patterns of dot
prediction and that of data, which is one after Z-scoring, distribution of data points in red and blue show, for example,
using a t-test (p < 0.001, t(14) = 16.62). The present that AU9 negatively contributes to RT. Higher values (red
analysis successfully predicted RT to task-irrelevant targets, dots) were distributed toward the negative direction of the
which we assumed to vary depending on attention states. horizontal axis, indicating that shorter RT was associated
This prediction of RT, in turn, predicted the attention state with more nose wrinkling, which, in turn, suggests that less
at the time some seconds before the target presentation attention was paid to the lecture when more nose wrinkling
during learning. We concluded that facial features and was exhibited. We will discuss the effect of these AUs in more
movements of the head and eyes contain information about detail in the Discussion section.
attention. We performed control sessions to confirm that watching
Further analysis revealed the level of contribution of each a lecture video influences RT to the auditory target. In the
index to the prediction (i.e., the importance of each index control condition, participants were asked to detect the target
for the prediction) using a method called Shapley additive without paying any attention to the video lecture. RTs in
explanations (SHAP) [28]. SHAP provides the value that this condition are considered to reflect full attention to the
corresponds to the contribution of each input feature to the auditory target. The average RT for the two control sessions
prediction (Fig. 3 c). AU9 (nose wrinkler), AU45 (blink), across all participants was 0.7 sec. This RT duration is clearly
AU15 (lip corner depressor), and AU7 (lid tightener) were shorter than the average RT in the lecture sessions, which was
the best five contributors among all AUs. The analysis also 1.1 sec. and the Pearson’s correlation coefficient between the

76556 VOLUME 11, 2023


R. Miao et al.: Analysis of Facial Expressions to Estimate the Level of Engagement

FIGURE 4. Comparison of four different models: Support Vector Regression (SVR), Multilayer perceptron (MLP), Linear Regression and LightGBM.

experiment and control sessions was statistically significant a method for predicting engagement to video lectures using
(p < 0.05, t(14) = 2.74), indicating that RT to the target a machine learning technique. Our approach was to predict
was an appropriate measure of attention to the lectures. the response time under the assumption that the response
We attempted to predict RTs of the control conditions with time would become longer when more attention was paid to
the same procedure used for the lecture session. The results the lecture, reducing attention to a target that was irrelevant
revealed that the RMSE of the predictions was 0.89, and the to the lecture. The model built for the prediction provided
Pearson’s correlation coefficient between data and prediction information about the facial features that contributed most to
was 0.45, which was not statistically significant (p = 0.092, the prediction, which were as follows: AU9 (nose wrinkler),
t(517) = 3.1). AU45 (blink), AU15 (lip corner depressor), and AU7 (lid
There are three issues to examine before accepting the tightener). Here, we discuss possible explanations for the
results. The first is whether the results depend on the choice importance of these factors in predicting RTs. AU9 was
of the machine learning methods, the second is whether they negatively related to RT. Longer RT was associated with more
depend on the selection of time windows and the third is attention to the lecture, suggesting that AU9 was negatively
whether they depend on individual variations. First, we used related to the amount of attention paid to the lecture.
three different models other than LightGBM as a comparison: Increased nose wrinkling was associated with deviation of
Support Vector Regression (SVR), Multilayer perceptron attention from the lecture. On the hand, the results suggest
(MLP), and Linear Regression. The results showed that that AU45 (blink), AU15 (lip corner depressor), and AU7
accuracy of lightGBM is similar to that of SVR, which is (lid tightener) were positively related to level of attention
better than MLP and Linear Regression (Fig.4), and that paid to the lecture. Thus, more depression of the lip corner,
the time required to analyze was the shortest for lightGBM more frequent blinking, and more tightening of the eyelids
among the four methods. are expected when a person pays more attention to lectures.
Second, there is no theoretical reason to select a certain Lip corner depression may be related to situations in which a
period of time for facial feature extraction to estimate RTs. learner has difficulty in understanding the lecture. This may
We used 1 and 5 second windows in addition to 3 second lead the learner to try and attend more to lectures, and to
windows to see the effect of the time on the analysis. The exhibit a serious facial expression. Tightening the eyelids and
results are similar for the three cases (fig. 5). A t-test of blinking are similar facial actions, and both may be related
RMSE of the model prediction with one showed statistical to making an effort to understand the content of lectures by
significance both for 1- and 5-second windows (p < 0.001, opening the eyes wider. However, more blinks and tightening
t(14) = 15.32 for 1s and p < 0.001, t(14) = 18.08 for 5s). eyelids may also be related to sleepiness. When a person is
Third, we tested whether a model built with other sleepy, they would be likely to not attend either to lectures
individuals’ data (across individual models) can predict or to any task-irrelevant stimuli, which would result in longer
the data of another individual. Figure 6 shows the results RT to the target even without a high level of attention being
of the predictions. Surprisingly, the results revealed no paid to the lecture. Although the present experiment was
successful prediction across participants. Thus, a model that designed assuming only two attention states, attending to
was based on a group of individuals could not be used to the lecture or to the task irrelevant target, attention level
predict the attention level of an individual in the group. The could potentially be reduced by sleepiness, resulting in longer
face information related to attention appeared to vary from RTs for the target with decreased attention to the lecture.
participant to participant. We attempted to estimate the effect of sleepiness during
lectures and re-analyzed the data.
V. DISCUSSION To exclude the possible influence of sleepiness on the
In the current study, we measured RT to task-irrelevant targets results, we re-analyzed the data after removing data with
as an index of attentional level. With the RTs, we developed sleepy faces. To identify times at which a participant appeared
VOLUME 11, 2023 76557
R. Miao et al.: Analysis of Facial Expressions to Estimate the Level of Engagement

FIGURE 5. We applied 1, 3 and 5 second time windows to see the effect of the time to analyze facial features.

location smaller than -1.13 log deg as faces that reflected


sleepiness. Note that this analysis is not based on accurate eye
movement measurements, but on rough estimation by image
processing using OpenFace, by which we estimated that the
spatial resolution was higher than 2 radians. Despite the low
precision of this method, gaze stability could be evaluated on
the basis of the distribution shown in Fig. 7.
We re-analyzed the data after removing data associated
with sleepiness using a threshold of the standard deviation
of gaze location lower than -1.13 log deg. The results without
sleepy faces revealed that the RMSE of the predictions was
0.77 (see Fig. 7 b), which was smaller than the baseline
RMSE of 1.0. The Pearson’s correlation coefficient between
data and prediction was 0.67. and the correlation was
statistically significant (p < 0.001, t(2298) = 11), we also
used a statistical test of RMSE to show that our prediction
FIGURE 6. Correlation between measured and predicted RTs, using the is better than chance. We compared RMSE of the model
across individual test model. Configurations are the same as in Fig. 3 (a) .
prediction and that of data, which is one after Z-scoring, using
a t-test (p < 0.001, t(14) = 15.07). These results confirm
to be sleepy, we used eye movement data and subjective eval- that facial expressions can be used to predict attention states
uation of sleepiness in videos. A previous study reported that while watching a lecture. Figure 7 (c) shows the contribution
the eyes become stationary when sleepy [28]. We attempted level of each AU to the prediction by SHAP. Similar to the
to detect when learners were sleepy using the gaze data. original analysis (Fig. 3), AU9 (nose wrinkler) was found
We calculated the standard deviation of gaze positions, to be the largest contributor, and AU45 (blink) and AU15
obtained through OpenFace analysis, for 3 sec before each (lip corner depressor) were the second and third largest
target presentation. The histogram in Fig. 7 (a) shows the contributors, respectively. However, AU7 (lid tightener),
distribution of standard deviation of gaze locations, which which was in the top five contributors in the original
reflects eye movement activity. The horizontal axis shows analysis, was no longer included in the top five. These results
the logarithmic scale of the visual angle in radians, showing indicate that wrinkling the nose, blinking, and depressing
data with small values clearly. The distribution results can the lip corners are major factors in predicting attention to
be described as standard deviation values following a single lectures.
peak distribution with a peak at approximately -0.75 in log Individual differences in the relationship between internal
deg. However, there appeared to be a peak at very small concentration state and facial expression, which have not
values at approximately -1.34 in log deg. The eye movements been captured in previous studies that used subjective
for the video images that were judged subjectively as sleepy ratings [14], [15], [16], were found in the present study.
exhibited a standard deviation less than -1.13 log deg. Thus, We consider several possible reasons for these results.
we defined the video faces with standard deviation of gaze One possibility is that individual identification affected the

76558 VOLUME 11, 2023


R. Miao et al.: Analysis of Facial Expressions to Estimate the Level of Engagement

FIGURE 7. (a) Histogram of standard deviation of gaze movements (gaze SD). The gaze SDs before target presentations with sleepy faces estimated
subjectively were smaller than the red line, and we assumed that RTs with gaze SDs larger than the red line were not influenced by sleepiness.
(b) Correlation between measured and predicted RTs for data without the influence of sleepiness. Configurations are the same as those in Fig. 3 (a).
(c) Indexes are arranged according to the level of contribution to the prediction obtained using SHAP. Configurations are the same as those in Fig. 3 (b).

findings. Because AUs themselves may contain information examined the effect of the size of the data set on prediction
about the facial features of individual participants, the AU accuracy, and we found that approximately 20% of all data
analysis might identify individuals. If there is substantial were required to obtain a training effect with RMSE of about
individual variation in RTs in the present experiment, iden- 0.8 (Fig. 8 f). To keep the proportion of the data set larger
tification of individuals by facial features could potentially than 20%, we compared the predictions between within and
predict RT results with some level of accuracy because there across participants using data sets of three or five groups
is a correlation between facial features and RT for individuals. of participants, instead of datasets of individual participants.
However, because we used normalized values of RTs and Better predictions in the within-group analysis compared
AUs for each participant, the averages of each parameter with those in the across-group analysis were expected if there
did not exhibit any correlations among the parameters. were large individual variations in feature expressions related
In other words, the individual differences we found could be to engagement with the lecture. In the case of three-group
explained by individual differences in contributions of facial division, two of three groups were used for training and the
features to RT estimation. third group was used for a test for across group analysis,
To investigate the effects of individual variability, we first while four groups were used for training with the fifth one
conducted the same analyses for data from each participant. as a test in the case of five-group division. For within-group
Because the amount of data for each participant is relatively analysis, data were divided into three or five sets, selecting
small, we performed a 5-fold (instead of 15-fold) cross- equal number of data from each group (each data set had one
validation analysis on each participant’s own data. The third of first group, one third of the second group and one third
average RMSE of the prediction against the data for all of the third group in the case of three groups). These three
participants was very close to baseline 1.01 (Fig. 8 c). The or five datasets were used for three- or five-fold validation
RMSE is as poor as the that across individual models (shown testing.
in Fig. 6), likely because of the small amount of data used Figure 8 shows the results of both within- and across-group
for each model even with 5-fold cross-validation. We, then, analyses for the three and five groups in addition to the

VOLUME 11, 2023 76559


R. Miao et al.: Analysis of Facial Expressions to Estimate the Level of Engagement

FIGURE 8. (a) the results of the three within groups, (b) five within groups, (c) each individual, (d) three across groups and (e) five across groups. (f) the
prediction performance as a function of the data size.

averaged individual predictions. The prediction accuracy model can be customized to each individual and models
was better for within-group analyses compared with that constructed for particular individuals may be more precise.
for across-group analyses. RMSE values were 0.86 and Although individual variation should be investigated further
0.78 for three and five within-group analyses, respectively, to understand the essential factors, the technique developed
and 1.10 and 1.08 for three and five across-group analyses, here can be used for applications in actual education
respectively. A t-test of RMSE of the model prediction conditions.
with one showed statistical significance both for 3- and Although psychophysical studies used sound stimuli as
5-groups (p < 0.001, t(14) = 7.11 for 3-group and p < a probe to measure attention level [30], [31], [32], such
0.001, t(14) = 11.19 for 5-group). These results indicate approach is not practical in the actual learning situations.
nontrivial individual variations in the relationship between Therefore, we investigated whether facial images are suf-
facial expressions and engagement. These variations do ficient to provide indexes of attention level. The model
not mean that there is no common factor shared by some performance depends on OpenFace performance. Although
individuals because pooling data from many participants Baltrusaitis et al. [33] reported that the accuracy of the
was shown to improve the prediction (compare Fig. 7 and OpenFace is better than other methods, it is obvious that
8). SHAP values for the three- and five-group analyses its performance is not perfect, and it depends on recording
showed that AU9 and AU2 were among the best five conditions of faces. Our estimation of RT from AUs,
features in both groups. AU9 was also included in the therefore, includes estimation errors of facial features at a
original analysis with all data. This result suggests that these certain amount. We believe that this analysis is useful to
features are important for all individuals, while other features obtain information of a learner’s conditions (mental states)
that differ substantially across individuals could impair the at each time to make appropriate feedbacks. For example,
across-participant predictions. Although individual variation 70% of correct detection of less attention to a lecture
limits to use the model without doubt, it is possible to should be useful to provide a warning signal to the leaners
construct a model for a group of individuals with similar and/or the lecturer. Three times of erroneous warnings out
properties. of ten should not be problem if the warning signal used
The results suggest that individual variation is substantial, does not disturb the class much. Also, the detection rate
and appears to be a disadvantage in general when the becomes higher than 99% if there are more than five learners
present technique is applied to a supporting system, using who loose attention to the class even that is 70% for one
a model trained with different individuals. However, the learner.

76560 VOLUME 11, 2023


R. Miao et al.: Analysis of Facial Expressions to Estimate the Level of Engagement

VI. CONCLUSION [12] M. N. Nguyen, S. Watanabe-Galloway, J. L. Hill, M. Siahpush,


In conclusion, we revealed that facial expressions can be used M. K. Tibbits, and C. Wichman, ‘‘Ecological model of school engagement
and attention-deficit/hyperactivity disorder in school-aged children,’’ Eur.
to predict learners’ level of attention to video lectures, which Child Adolescent Psychiatry, vol. 28, no. 6, pp. 795–805, Jun. 2019, doi:
serves as an index of student engagement. Facial features 10.1007/s00787-018-1248-3.
captured by a video camera can predict reaction times (RTs), [13] M. Kinnealey, B. Pfeiffer, J. Miller, C. Roan, R. Shoener, and M. L. Ellner,
‘‘Effect of classroom modification on attention and engagement of students
which are assumed to be indicative of attentional states. with autism or dyspraxia,’’ Amer. J. Occupational Therapy, vol. 66, no. 5,
Specific facial features, such as nose wrinkling, blinking, and pp. 511–519, 2012, doi: 10.5014/ajot.2012.004010.
lip corner depression, appear to be associated with attention [14] Ö. Sümer, P. Goldberg, S. D’Mello, P. Gerjets, U. Trautwein, and
E. Kasneci, ‘‘Multimodal engagement analysis from facial videos in the
during video lectures. The application of facial expression classroom,’’ IEEE Trans. Affect. Comput., vol. 14, no. 2, pp. 1012–1027,
technology has the potential to enhance the quality of Apr./Jun. 2023, doi: 10.1109/TAFFC.2021.3127692.
teaching. However, before implementing it in actual teaching [15] H. Monkaresi, N. Bosch, R. A. Calvo, and S. K. D’Mello, ‘‘Automated
detection of engagement using video-based estimation of facial expres-
conditions, a few considerations should be taken into account. sions and heart rate,’’ IEEE Trans. Affect. Comput., vol. 8, no. 1, pp. 15–28,
Firstly, the underlying mechanisms behind the contributions Jan. 2017, doi: 10.1109/TAFFC.2016.2515084.
of these features are not yet understood, which is essential for [16] J. Whitehill, Z. Serpell, Y.-C. Lin, A. Foster, and J. R. Movellan, ‘‘The faces
of engagement: Automatic recognition of student engagementfrom facial
generalization. Secondly, significant individual differences expressions,’’ IEEE Trans. Affect. Comput., vol. 5, no. 1, pp. 86–98,
have been observed. Customizing the model may be one Jan. 2014, doi: 10.1109/TAFFC.2014.2316163.
possible solution. In future research, we will focus on [17] R. Miao, H. Kato, Y. Hatori, Y. Sato, and S. Shioiri, ‘‘Analysis of facial
expressions for the estimation of concentration on online lectures,’’ in Proc.
exploring individual differences and the physiological rela- World Conf. Comput. Educ., Hiroshima, Japan, Aug. 2022.
tionship between engagement and facial expressions during [18] A. F. Kramer, L. J. Trejo, and D. Humphrey, ‘‘Assessment of mental
learning. workload with task-irrelevant auditory probes,’’ Biol. Psychol., vol. 40,
nos. 1–2, pp. 83–100, May 1995.
[19] A. Pfefferbaum, J. M. Ford, W. T. Roth, and B. S. Kopell, ‘‘Age
REFERENCES differences in P3-reaction time associations,’’ Electroencephalogr. Clinical
Neurophysiol., vol. 49, pp. 257–265, Aug. 1980, doi: 10.1016/0013-
[1] S. Shioiri, Y. Sato, Y. Horaguchi, H. Muraoka, and M. Nihei, ‘‘Quali- 4694(80)90220-5.
informatics in the society with Yotta scale data,’’ in Proc. IEEE Int. Symp. [20] M. I. Posner, ‘‘Orienting of attention,’’ Quart. J. Exp. Psychol., vol. 32,
Circuits Syst. (ISCAS), May 2021, pp. 1–4. pp. 3–25, Feb. 1980.
[2] Y. Sato, Y. Horaguchi, L. Vanel, and S. Shioiri, ‘‘Prediction of image [21] S. A. Hillyard, R. F. Hink, V. L. Schwent, and T. W. Picton, ‘‘Electrical
preferences from spontaneous facial expressions,’’ Interdiscipl. Inf. Sci., signs of selective attention in the human brain,’’ Science, vol. 182, no. 4108,
vol. 28, no. 1, pp. 45–53, 2022. pp. 177–180, Oct. 1973, doi: 10.1126/science.182.4108.177.
[3] Y. Horaguchi, Y. Sato, and S. Shioiri, ‘‘Estimation of preferences to [22] S. Shioiri, M. Ogawa, H. Yaguchi, and P. Cavanagh, ‘‘Attentional
images by facial expression analysis,’’ IEICE Tech. Rep., vol. 120, no. 306, facilitation of detection of flicker on moving objects,’’ J. Vis., vol. 15,
pp. 71–76, 2020. no. 14, p. 3, Oct. 2015, doi: 10.1167/15.14.3.
[4] C. Thomas and D. B. Jayagopi, ‘‘Predicting student engagement in [23] S. Shioiri, H. Honjyo, Y. Kashiwase, K. Matsumiya, and I. Kuriki, ‘‘Visual
classrooms using facial behavioral cues,’’ in Proc. 1st ACM SIGCHI attention spreads broadly but selects information locally,’’ Sci. Rep., vol. 6,
Int. Workshop Multimodal Interact. Educ., Glasgow, U.K., Nov. 2017, no. 1, p. 35513, Oct. 2016, doi: 10.1038/srep35513.
pp. 33–40. [24] (2023). @Fuku-Programming. Accessed: Feb. 14, 2023. [Online]. Avail-
[5] N. K. Mehta, S. S. Prasad, S. Saurav, R. Saini, and S. Singh, able: [Link]
‘‘Three-dimensional DenseNet self-attention neural network for [25] T. Baltrušaitis, P. Robinson, and L.-P. Morency, ‘‘OpenFace: An open
automatic detection of student’s engagement,’’ Appl. Intell., source facial behavior analysis toolkit,’’ in Proc. IEEE Winter Conf. Appl.
vol. 52, no. 12, pp. 13803–13823, 2022, doi: 10.1007/s10489-022- Comput. Vis. (WACV), Mar. 2016, pp. 1–10.
03200-4. [26] P. Ekman and W. V. Friesen, Facial Action Coding System: A Technique
[6] D. K. Darnell and P. A. Krieg, ‘‘Student engagement, assessed using for the Measurement of Facial Movement. San Francisco, CA, USA:
heart rate, shows no reset following active learning sessions in lectures,’’ Consulting Psychologists Press, 1978.
PLoS ONE, vol. 14, no. 12, Dec. 2019, Art. no. e0225709, doi: [27] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T. Y. Liu,
10.1371/[Link].0225709. ‘‘LightGBM: A highly efficient gradient boosting decision tree,’’ in Proc.
[7] D. M. Bunce, E. A. Flens, and K. Y. Neiles, ‘‘How long can students Adv. Neural Inf. Process. Syst., 2017, pp. 1–9.
pay attention in class? A study of student attention decline using [28] S. M. Lundberg and S. I. Lee, ‘‘A unified approach to interpreting model
clickers,’’ J. Chem. Educ., vol. 87, no. 12, pp. 1438–1443, Dec. 2010, doi: predictions,’’ in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 1–10.
10.1021/ed100409p. [29] T. Abe, T. Nonomura, Y. Komada, S. Asaoka, T. Sasai, A. Ueno,
[8] H. Kato, K. Takahashi, Y. Hatori, Y. Sato, and S. Shioiri, and Y. Inoue, ‘‘Detecting deteriorated vigilance using percentage of
‘‘Prediction of engagement from temporal changes in facial eyelid closure time during behavioral maintenance of wakefulness tests,’’
expression,’’ in Proc. World Conf. Comput. Educ., Hiroshima, Japan, Int. J. Psychophysiol., vol. 82, no. 3, pp. 269–274, Dec. 2011, doi:
Aug. 2022. 10.1016/[Link].2011.09.012.
[30] M. A. Bedard, F. El Massioui, B. Pillon, and J. L. Nandrino, ‘‘Time
[9] H. L. O’Brien and E. G. Toms, ‘‘The development and evaluation of a
for reorienting of attention: A premotor hypothesis of the underlying
survey to measure user engagement,’’ J. Amer. Soc. Inf. Sci. Technol.,
mechanism,’’ Neuropsychologia, vol. 31, no. 3, pp. 241–249, Mar. 1993,
vol. 61, no. 1, pp. 50–69, Jan. 2010.
doi: 10.1016/0028-3932(93)90088-h.
[10] A. M. Leiker, A. T. Bruzi, M. W. Miller, M. Nelson, R. Wegman, [31] G. Rhodes, ‘‘Auditory attention and the representation of spatial informa-
and K. R. Lohse, ‘‘The effects of autonomous difficulty selection on tion,’’ Perception Psychophys., vol. 42, no. 1, pp. 1–14, Jan. 1987, doi:
engagement, motivation, and learning in a motion-controlled video game 10.3758/bf03211508.
task,’’ Hum. Movement Sci., vol. 49, pp. 326–335, Oct. 2016, doi: [32] J. R. Simon, E. Acosta, and S. P. Mewaldt, ‘‘Effect of locus of warning
10.1016/[Link].2016.08.005. tone on auditory choice reaction time,’’ Memory Cognition, vol. 3, no. 2,
[11] L. S. Pagani, C. Fitzpatrick, and S. Parent, ‘‘Relating kindergarten pp. 70–167, Mar. 1975, doi: 10.3758/BF03212893.
attention to subsequent developmental pathways of classroom [33] T. Baltrusaitis, A. Zadeh, Y. C. Lim, and L.-P. Morency, ‘‘OpenFace
engagement in elementary school,’’ J. Abnormal Child Psychol., 2.0: Facial behavior analysis toolkit,’’ in Proc. 13th IEEE Int. Conf.
vol. 40, no. 5, pp. 715–725, Jul. 2012, doi: 10.1007/s10802-011- Autom. Face Gesture Recognit. (FG), May 2018, pp. 59–66, doi:
9605-4. 10.1109/FG.2018.00019.

VOLUME 11, 2023 76561


R. Miao et al.: Analysis of Facial Expressions to Estimate the Level of Engagement

RENJUN MIAO was born in Wenzhou, Zhejiang, YOSHIYUKI SATO received the B.S. degree from
in 1986. He received the bachelor’s degree in Kyoto University, in 2004, and the M.S. and Ph.D.
mechanical automation engineering from the Zhe- degrees from The University of Tokyo, Japan, in
jiang University City College, in 2008, and the 2006 and 2009, respectively.
master’s degree in information engineering from From 2010 to 2016, he was an Assis-
Tohoku University, Japan, in 2012, where he is tant Professor with the University of Electro-
currently pursuing the Ph.D. degree, with a focus Communications, Japan. From 2012 to 2013,
on affective computing, mainly on detecting the he was a Visiting Professor with Northwestern
quality of students’ online education through the University, USA. From 2016 to 2018, he was a
change of facial expressions. Project Researcher with The University of Tokyo.
From 2010 to 2012, his main research focus was on signal processing of Since 2018, he has been a specially-appointed Assistant Professor with
color and shape in brain visual neurology. He has been an Engineer since Tohoku University, Japan. His research interests include mathematical and
graduation and an Education SAAS Development Supervisor, in 2017. machine learning modeling of human behaviors, including perception,
cognition, attention, motor functions, and communications.

HARUKA KATO received the B.S. degree in


engineering and the M.S. degree in information
engineering from Tohoku University, in 2021 and
2023, respectively.
From 2021 to 2023, her main research was
affective computing mainly on detecting engage-
ment of student while studying through the change
of electroencephalography and facial expressions.
Her research interests include engagement, atten-
tion while studying, and facial expressions.

SATOSHI SHIOIRI received the B.S. degree in


mechanical engineering and the M.S. and Ph.D.
degrees in physical information engineering from
YASUHIRO HATORI received the B.S. degree in the Tokyo Institute of Technology, in 1981, 1983,
information engineering and the M.S. and Ph.D. and 1986, respectively.
degrees in engineering from the University of From 1986 to 1989, he was a Postdoc-
Tsukuba, in 2007, 2009, and 2014, respectively. toral Fellow with the University of Montreal.
From 2014 to 2016, he was a Postdoc- From 1989 to 1990, he was a Postdoctoral Fellow
toral Fellow with the Research Institute of with the Advanced Telecommunications Research
Electrical Communication, Tohoku University. Institute International, Kyoto. He was an Assistant
From 2016 to 2018, he was a Postdoctoral Fellow Professor, an Associate Professor, and a Professor with Chiba University,
with the National Institute of Advanced Science from 1990 to 2004. He has been a Professor with Tohoku University, since
and Technology. Since 2018, he has been an 2004. His research interests include motion perception, depth perception,
Assistant Professor with Tohoku University. His research interests include color perception, visual attention, eye movement, and vision for action.
eye movement, visual attention, and multisensory integration.

76562 VOLUME 11, 2023

You might also like