Base Paper
Base Paper
ABSTRACT The present study aimed to develop a method for estimating students’ attentional state from
facial expressions during online lectures. We estimated the level of attention while students watched a
video lecture by measuring reaction time (RT) to detect a target sound that was irrelevant to the lecture.
We assumed that RT to such a stimulus would be longer when participants were focusing on the lecture
compared with when they were not. We sought to estimate how much learners focus on a lecture using
RT measurement. In the experiment, the learner’s face was recorded by a video camera while watching a
video lecture. Facial features were analyzed to predict RT to a task-irrelevant stimulus, which was assumed
to be an index of the level of attention. We applied a machine learning method, light Gradient Boosting
Machine (LightGBM), to estimate RTs from facial features extracted as action units (AUs) corresponding
to facial muscle movements by an open-source software (OpenFace). The model obtained using LightGBM
indicated that RTs to the irrelevant stimuli can be estimated from AUs, suggesting that facial expressions are
useful for predicting attentional states while watching lectures. We re-analyzed the data while excluding RT
data with sleepy faces of the students to test whether decreased general arousal caused by sleepiness was a
significant factor in the RT lengthening observed in the experiment. The results were similar regardless of
the inclusion of RTs with sleepy faces, indicating that facial expression can be used to predict learners’ level
of attention to video lectures.
INDEX TERMS Attention, affective computing, engagement, facial features, online lecture.
FIGURE 1. Experimental design. While watching a video lecture of an introductory PHP course in a session, auditory signals of white noise were added
to the original auditory track of the lecturer video. At the end of a session after watching the video, participants answered to several quizzes about the
lecture.
targets, which was a period of white noise presentation, III. FACIAL FEATURE ANALYSIS
was randomly selected between 25 and 35 seconds. The Participants’ faces were recorded while watching lecture
white noise started again immediately after the key press videos, and their facial features were analyzed after the
to indicate detection, or after a period of 10 seconds if no experiment. We analyzed face images recorded in the 3 sec-
key press had been performed. Each lecture lasted between onds before the target presentation (disappearance of white
10 and 20 minutes, depending on the content. At the noise), using OpenFace [25] to extract the facial features.
end of each lecture, eight questions were provided in a To perform facial expression analysis using OpenFace, the
google form format. For each question, participants selected first step is to gather facial images or video data. From each
one of four choices as their [Link] one lecture video frame, OpenFace detects a face (multiple faces can be
is a session of the experiment. There were nine lecture detected while there was only one face in our experiment)
sessions. and locate it in the frame. Then, it makes facial appearance
In addition to the lecture sessions, there were two control as face orientation and makes facial landmarks such as
sessions to measure RT for the detection task without paying boundaries of eyes, eyebrows, and mouth. By analyzing
attention to the lecture, so that the total number of session the position changes of the facial landmarks and facial
was eleven. In the control sessions, two videos from the appearance, OpenFace evaluates the degree of facial muscle
same video lectures were used so that the participants knew activity as action units (AUs). AUs are assigned to muscle
the content and had little or no reason to be attracted to movements related to facial expressions based on the Facial
the content. Participants were asked to focus on the white Action Coding System (FACS) [26]. For example, AU1
noise and told that they did not have to pay attention to indicates the raising of the inner eyebrows, AU4 indicates
the content of the video on the display. The first control the lowering of the eyebrows, and AU5 indicates the
session was conducted as the 6th session with the first lecture raising of the upper lids (Table 1). OpenFace offers several
video, and the second control session was conducted as research advantages for facial analysis. Firstly, leveraging
the 11th session with the 6th lecture video used at the 7th deep learning techniques, particularly convolutional neural
session. The experiment was conducted over 2 days. Five networks (CNNs), OpenFace achieves high accuracy in facial
lecture sessions and the first control session were performed recognition and feature extraction tasks. This is crucial
on the first day, and the rest of the sessions (four lectures for research projects that require precise identification and
and one control session) were performed on the second day. comparison of facial features. Secondly, OpenFace not only
The interval between the first and second days was within enables facial recognition but also facilitates the extraction
1 week. The total duration of the experiment, 11 sessions, of facial features such as expressions and poses. This
was approximately 130 minutes. broadens its applications in research areas such as facial
TABLE 1. Meanings of AUs are also listed. and mean and standard deviation for AUc. The number of
parameters was 155 in total numbers of variables in total.
There are perhaps better statistical features of sequential data
rather than what we used here. However, they were sufficient
to show the usefulness of the AUs to predict RT (see later).
For better prediction in the future, we could investigate more
complex temporal features.
To investigate the relationship of facial expressions with
the RT of target detection, we attempted to predict RT from
AUs using a machine learning method called LightGBM [27].
LightGBM is a gradient boosting model, which operates
quickly and exhibits relatively accurate performance in
general. LightGBM is a decision tree model with gradient
boosting, in which the node of trees grows to minimize
the residuals. Since training data with large residuals are
used preferentially, thus learning proceeds efficiently, which
is a powerful machine learning technique that can be used
for both regression and classification tasks. It works by
combining multiple weak learners (simple decision trees) into
a strong learner, which is able to make accurate predictions
on new data. In this study, two different methods were tested
for RT predictions of AUs. One method was to train a model
with pooled data of all participants (pooled data model), and
the other was to train a model with all but one participant
and test with the remaining participant (across individual
test models). The latter method was to investigate individual
differences. If individual differences are small, the model
emotion recognition, facial tracking, and facial attribute built with other participants should be able to predict RTs
analysis. Thirdly, being an open-source toolkit, OpenFace of the participant tested. However, individual variations may
allows researchers to modify and customize it according to prevent the building of a general model that can be used for
their specific needs. This flexibility enables adjustments and anyone whose data are not used to build the model.
improvements tailored to individual research objectives and For the evaluation of the models, a 15-fold cross-validation
various application scenarios. Fourthly, OpenFace supports method was used. All data were divided into 15 groups
processing large datasets of facial images and videos. This randomly for the pooled data model, 14 of which were used
is particularly valuable for research projects that involve for training and the remaining group was used for testing.
handling extensive data, such as facial recognition in video The process was repeated 15 times, one test for each group,
surveillance systems or the establishment of facial image and the average was used as the model performance. For the
databases. All of these advatages are important to us across-individuals test model, data for 14 of 15 participants
particulary when to apply research achievements to practical were used for training, and data for the remaining participant
occasions. We arbitrarily chose the period of time between were used for testing. The process was repeated 15 times, with
3 sec and 0 sec before target presentation as the time window one test for each participant. The average of the 15 test scores
during which the effect of attention might be reflected in was used as the model performance. Prediction performance
target detection, but the uses of 1 or 5 seconds showed similar was assessed by the root mean square error (RMSE) of the
results (see Fig. 5). prediction against the data and by the Pearson’s correlation
The features of the facial expressions were extracted as coefficient between the data and the prediction.(Fig. 2)
AUs from the video taken for each target presentation using
OpenFace as well as the positions and angles of the head and IV. RESULTS
eyes. The meanings of AUs are shown in Table 1. Two types Target presentations without responses within 10 sec were
of AU indexes are available from OpenFaces: a continuous excluded from the reaction time (RT) analysis. Such target
value between 0 and 5 for 17 AUs (referred to as AUr) and presentations occurred on 5.5% of trials on average across
a binary value of 0 or 1 (absence or presence) for 18 AUs all participants. The average RT over all sessions of all
(referred to as AUc), which are 17 AUs and the AU28 for participants was 1.1 sec, with a standard deviation of
Lip Suck. Because we collected data for a 3-sec period for 2.3 sec. Because average RT varied among participants,
each target, we used statistical features of the time-varying we normalized RT as Z-scores after taking the logarithm.
values: minimum, maximum, mean, standard deviation, and We took the logarithm of RT to minimize the effects of
three levels of percentiles (25%, 50%, and 75%) for AUr asymmetrical distribution (usually a heavy tail for longer
FIGURE 2. The framework of the analysis: (a) Video recording of participants’ faces while watching online lectures and recording of reaction time of
target detection measured as the time from the target presentation (disappearance of white noise and the key press for the detection). (b) Orientation
information at each location as facial appearance and landmarks on a face, such as the eyes, nose, mouth, and chin are detected for all video frames
through each session by OpenFace. Facial appearance and landmarks are used to obtain AUs based on the Facial Action Coding System(FACS). Head
pose and eye gaze are also detected. Head pose is an important factor to analyzed face images as normalized fashion. (c) OpenFace extracted Action
units (AUs) from facial landmarks and appearance for each frame. (d) We used several statistical measures of sequential AU values from a time window
(3 s for main analysis and 1s and 5s were also used) before each target presentation. Used statistical measures were average, standard deviation,
minimum, maximum, and percentiles of 25th, 50th, and 75th for intensity indexes. Only average and standard deviation were used for binary indexes.
(e) The statistical measures from all AUs were used to predict reaction time using a machine learning method, LightGBM. LightGBM constructs a
tree-type model with leaf-wise tree growth, choosing the leaf with max delta loss to grow. (f) We compared predicted RTs with measured RTs, showing
their correlation. Higher correlation indicates that the LightGBM model can predict RTs to task-irrelevant stimulus well, so that the model can predict the
attention level at the time of target presentation under the assumption that higher attention to the lecture makes RT to the task irrelevant target longer.
We also analyzed strength of contribution using a method called Shapley additive explanations (SHAP). SHAP shows relationship between contribution
values (strength to contribute the prediction) and each of feature indexes.
RTs). We also used normalized values of AUs by Z-scoring training-test combinations (15 different combinations with
to avoid the effects of individual variations of facial features. different colors). The RMSE of data deviation from the
We expected that variations of AUs after normalization predictions (or the deviation of predictions from the data) was
were related to changes in mental processes, whereas the 0.75. The average of the RT data is zero, with a unit standard
absolute AU values include facial differences among different deviation after Z scoring by definition. Thus, the RMSE of
individuals. We then applied LightGBM to model the model prediction (0.75, which is smaller than 1) indicates
relationship between RT and facial expressions, and tested that the model can at least partially explain the data variation
the model using a 15-fold cross-validation method. Fig. 3a (25% in this case). The Pearson’s correlation coefficient
shows the prediction results of the pooled data model. The between data and prediction was 0.66. A statistical test of
horizontal axis shows RT measured in the experiment and no correlation showed that the correlation was statistically
the vertical axis shows the prediction from LightGBM. Each significant (p < 0.001, t(2412) = 11). We used a test to
point represents each target presentation from all sessions examine whether the Pearson’s correlation coefficient is not
of all participants and different colors indicate different significantly different from zero and showed the assumption
FIGURE 3. (a) Correlation between measured reaction time (RT) and the predicted RT of the model. Each point represents the RT of each target
presentation from all sessions of all participants. Different colors indicate different training-test combinations (15 combinations). (b) Indexes are
arranged according to the level of contribution to the prediction obtained using the Shapley additive explanations (SHAP) method. Each point is from
each RT, as in the correlation figure of figure3 (a), and the color (red or blue) indicates a positive or negative contribution. The horizontal axis indicates
the level of contribution to the prediction of the RT by the model (c) The absolute value that corresponds to the contribution of each index to the
prediction estimated by SHAP.
of not different was rejected with a level of 5%.In addition to provides the degree of contribution of each input feature for
the statistical significance of correlation coefficient, we also predicting each event of target detection, as shown by the dots
used a statistical test of RMSE to show that our prediction in Fig. 3 (b). Red dots indicate high values of facial feature
is better than chance. We compared RMSE of the model indexes and blue dots indicate low values. The patterns of dot
prediction and that of data, which is one after Z-scoring, distribution of data points in red and blue show, for example,
using a t-test (p < 0.001, t(14) = 16.62). The present that AU9 negatively contributes to RT. Higher values (red
analysis successfully predicted RT to task-irrelevant targets, dots) were distributed toward the negative direction of the
which we assumed to vary depending on attention states. horizontal axis, indicating that shorter RT was associated
This prediction of RT, in turn, predicted the attention state with more nose wrinkling, which, in turn, suggests that less
at the time some seconds before the target presentation attention was paid to the lecture when more nose wrinkling
during learning. We concluded that facial features and was exhibited. We will discuss the effect of these AUs in more
movements of the head and eyes contain information about detail in the Discussion section.
attention. We performed control sessions to confirm that watching
Further analysis revealed the level of contribution of each a lecture video influences RT to the auditory target. In the
index to the prediction (i.e., the importance of each index control condition, participants were asked to detect the target
for the prediction) using a method called Shapley additive without paying any attention to the video lecture. RTs in
explanations (SHAP) [28]. SHAP provides the value that this condition are considered to reflect full attention to the
corresponds to the contribution of each input feature to the auditory target. The average RT for the two control sessions
prediction (Fig. 3 c). AU9 (nose wrinkler), AU45 (blink), across all participants was 0.7 sec. This RT duration is clearly
AU15 (lip corner depressor), and AU7 (lid tightener) were shorter than the average RT in the lecture sessions, which was
the best five contributors among all AUs. The analysis also 1.1 sec. and the Pearson’s correlation coefficient between the
FIGURE 4. Comparison of four different models: Support Vector Regression (SVR), Multilayer perceptron (MLP), Linear Regression and LightGBM.
experiment and control sessions was statistically significant a method for predicting engagement to video lectures using
(p < 0.05, t(14) = 2.74), indicating that RT to the target a machine learning technique. Our approach was to predict
was an appropriate measure of attention to the lectures. the response time under the assumption that the response
We attempted to predict RTs of the control conditions with time would become longer when more attention was paid to
the same procedure used for the lecture session. The results the lecture, reducing attention to a target that was irrelevant
revealed that the RMSE of the predictions was 0.89, and the to the lecture. The model built for the prediction provided
Pearson’s correlation coefficient between data and prediction information about the facial features that contributed most to
was 0.45, which was not statistically significant (p = 0.092, the prediction, which were as follows: AU9 (nose wrinkler),
t(517) = 3.1). AU45 (blink), AU15 (lip corner depressor), and AU7 (lid
There are three issues to examine before accepting the tightener). Here, we discuss possible explanations for the
results. The first is whether the results depend on the choice importance of these factors in predicting RTs. AU9 was
of the machine learning methods, the second is whether they negatively related to RT. Longer RT was associated with more
depend on the selection of time windows and the third is attention to the lecture, suggesting that AU9 was negatively
whether they depend on individual variations. First, we used related to the amount of attention paid to the lecture.
three different models other than LightGBM as a comparison: Increased nose wrinkling was associated with deviation of
Support Vector Regression (SVR), Multilayer perceptron attention from the lecture. On the hand, the results suggest
(MLP), and Linear Regression. The results showed that that AU45 (blink), AU15 (lip corner depressor), and AU7
accuracy of lightGBM is similar to that of SVR, which is (lid tightener) were positively related to level of attention
better than MLP and Linear Regression (Fig.4), and that paid to the lecture. Thus, more depression of the lip corner,
the time required to analyze was the shortest for lightGBM more frequent blinking, and more tightening of the eyelids
among the four methods. are expected when a person pays more attention to lectures.
Second, there is no theoretical reason to select a certain Lip corner depression may be related to situations in which a
period of time for facial feature extraction to estimate RTs. learner has difficulty in understanding the lecture. This may
We used 1 and 5 second windows in addition to 3 second lead the learner to try and attend more to lectures, and to
windows to see the effect of the time on the analysis. The exhibit a serious facial expression. Tightening the eyelids and
results are similar for the three cases (fig. 5). A t-test of blinking are similar facial actions, and both may be related
RMSE of the model prediction with one showed statistical to making an effort to understand the content of lectures by
significance both for 1- and 5-second windows (p < 0.001, opening the eyes wider. However, more blinks and tightening
t(14) = 15.32 for 1s and p < 0.001, t(14) = 18.08 for 5s). eyelids may also be related to sleepiness. When a person is
Third, we tested whether a model built with other sleepy, they would be likely to not attend either to lectures
individuals’ data (across individual models) can predict or to any task-irrelevant stimuli, which would result in longer
the data of another individual. Figure 6 shows the results RT to the target even without a high level of attention being
of the predictions. Surprisingly, the results revealed no paid to the lecture. Although the present experiment was
successful prediction across participants. Thus, a model that designed assuming only two attention states, attending to
was based on a group of individuals could not be used to the lecture or to the task irrelevant target, attention level
predict the attention level of an individual in the group. The could potentially be reduced by sleepiness, resulting in longer
face information related to attention appeared to vary from RTs for the target with decreased attention to the lecture.
participant to participant. We attempted to estimate the effect of sleepiness during
lectures and re-analyzed the data.
V. DISCUSSION To exclude the possible influence of sleepiness on the
In the current study, we measured RT to task-irrelevant targets results, we re-analyzed the data after removing data with
as an index of attentional level. With the RTs, we developed sleepy faces. To identify times at which a participant appeared
VOLUME 11, 2023 76557
R. Miao et al.: Analysis of Facial Expressions to Estimate the Level of Engagement
FIGURE 5. We applied 1, 3 and 5 second time windows to see the effect of the time to analyze facial features.
FIGURE 7. (a) Histogram of standard deviation of gaze movements (gaze SD). The gaze SDs before target presentations with sleepy faces estimated
subjectively were smaller than the red line, and we assumed that RTs with gaze SDs larger than the red line were not influenced by sleepiness.
(b) Correlation between measured and predicted RTs for data without the influence of sleepiness. Configurations are the same as those in Fig. 3 (a).
(c) Indexes are arranged according to the level of contribution to the prediction obtained using SHAP. Configurations are the same as those in Fig. 3 (b).
findings. Because AUs themselves may contain information examined the effect of the size of the data set on prediction
about the facial features of individual participants, the AU accuracy, and we found that approximately 20% of all data
analysis might identify individuals. If there is substantial were required to obtain a training effect with RMSE of about
individual variation in RTs in the present experiment, iden- 0.8 (Fig. 8 f). To keep the proportion of the data set larger
tification of individuals by facial features could potentially than 20%, we compared the predictions between within and
predict RT results with some level of accuracy because there across participants using data sets of three or five groups
is a correlation between facial features and RT for individuals. of participants, instead of datasets of individual participants.
However, because we used normalized values of RTs and Better predictions in the within-group analysis compared
AUs for each participant, the averages of each parameter with those in the across-group analysis were expected if there
did not exhibit any correlations among the parameters. were large individual variations in feature expressions related
In other words, the individual differences we found could be to engagement with the lecture. In the case of three-group
explained by individual differences in contributions of facial division, two of three groups were used for training and the
features to RT estimation. third group was used for a test for across group analysis,
To investigate the effects of individual variability, we first while four groups were used for training with the fifth one
conducted the same analyses for data from each participant. as a test in the case of five-group division. For within-group
Because the amount of data for each participant is relatively analysis, data were divided into three or five sets, selecting
small, we performed a 5-fold (instead of 15-fold) cross- equal number of data from each group (each data set had one
validation analysis on each participant’s own data. The third of first group, one third of the second group and one third
average RMSE of the prediction against the data for all of the third group in the case of three groups). These three
participants was very close to baseline 1.01 (Fig. 8 c). The or five datasets were used for three- or five-fold validation
RMSE is as poor as the that across individual models (shown testing.
in Fig. 6), likely because of the small amount of data used Figure 8 shows the results of both within- and across-group
for each model even with 5-fold cross-validation. We, then, analyses for the three and five groups in addition to the
FIGURE 8. (a) the results of the three within groups, (b) five within groups, (c) each individual, (d) three across groups and (e) five across groups. (f) the
prediction performance as a function of the data size.
averaged individual predictions. The prediction accuracy model can be customized to each individual and models
was better for within-group analyses compared with that constructed for particular individuals may be more precise.
for across-group analyses. RMSE values were 0.86 and Although individual variation should be investigated further
0.78 for three and five within-group analyses, respectively, to understand the essential factors, the technique developed
and 1.10 and 1.08 for three and five across-group analyses, here can be used for applications in actual education
respectively. A t-test of RMSE of the model prediction conditions.
with one showed statistical significance both for 3- and Although psychophysical studies used sound stimuli as
5-groups (p < 0.001, t(14) = 7.11 for 3-group and p < a probe to measure attention level [30], [31], [32], such
0.001, t(14) = 11.19 for 5-group). These results indicate approach is not practical in the actual learning situations.
nontrivial individual variations in the relationship between Therefore, we investigated whether facial images are suf-
facial expressions and engagement. These variations do ficient to provide indexes of attention level. The model
not mean that there is no common factor shared by some performance depends on OpenFace performance. Although
individuals because pooling data from many participants Baltrusaitis et al. [33] reported that the accuracy of the
was shown to improve the prediction (compare Fig. 7 and OpenFace is better than other methods, it is obvious that
8). SHAP values for the three- and five-group analyses its performance is not perfect, and it depends on recording
showed that AU9 and AU2 were among the best five conditions of faces. Our estimation of RT from AUs,
features in both groups. AU9 was also included in the therefore, includes estimation errors of facial features at a
original analysis with all data. This result suggests that these certain amount. We believe that this analysis is useful to
features are important for all individuals, while other features obtain information of a learner’s conditions (mental states)
that differ substantially across individuals could impair the at each time to make appropriate feedbacks. For example,
across-participant predictions. Although individual variation 70% of correct detection of less attention to a lecture
limits to use the model without doubt, it is possible to should be useful to provide a warning signal to the leaners
construct a model for a group of individuals with similar and/or the lecturer. Three times of erroneous warnings out
properties. of ten should not be problem if the warning signal used
The results suggest that individual variation is substantial, does not disturb the class much. Also, the detection rate
and appears to be a disadvantage in general when the becomes higher than 99% if there are more than five learners
present technique is applied to a supporting system, using who loose attention to the class even that is 70% for one
a model trained with different individuals. However, the learner.
RENJUN MIAO was born in Wenzhou, Zhejiang, YOSHIYUKI SATO received the B.S. degree from
in 1986. He received the bachelor’s degree in Kyoto University, in 2004, and the M.S. and Ph.D.
mechanical automation engineering from the Zhe- degrees from The University of Tokyo, Japan, in
jiang University City College, in 2008, and the 2006 and 2009, respectively.
master’s degree in information engineering from From 2010 to 2016, he was an Assis-
Tohoku University, Japan, in 2012, where he is tant Professor with the University of Electro-
currently pursuing the Ph.D. degree, with a focus Communications, Japan. From 2012 to 2013,
on affective computing, mainly on detecting the he was a Visiting Professor with Northwestern
quality of students’ online education through the University, USA. From 2016 to 2018, he was a
change of facial expressions. Project Researcher with The University of Tokyo.
From 2010 to 2012, his main research focus was on signal processing of Since 2018, he has been a specially-appointed Assistant Professor with
color and shape in brain visual neurology. He has been an Engineer since Tohoku University, Japan. His research interests include mathematical and
graduation and an Education SAAS Development Supervisor, in 2017. machine learning modeling of human behaviors, including perception,
cognition, attention, motor functions, and communications.