0% found this document useful (0 votes)
43 views19 pages

Predicting Student Performance From Online Engagement Activities Using Novel Statistical Features

Uploaded by

Wahyu Syahputra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views19 pages

Predicting Student Performance From Online Engagement Activities Using Novel Statistical Features

Uploaded by

Wahyu Syahputra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Arabian Journal for Science and Engineering (2022) 47:10225–10243

https://doi.org/10.1007/s13369-021-06548-w

RESEARCH ARTICLE-COMPUTER ENGINEERING AND COMPUTER SCIENCE

Predicting Student Performance from Online Engagement Activities


Using Novel Statistical Features
Ghassen Ben Brahim1

Received: 25 August 2021 / Accepted: 26 December 2021 / Published online: 17 January 2022
© King Fahd University of Petroleum & Minerals 2022

Abstract
Predicting students’ performance during their years of academic study has been investigated tremendously. It offers impor-
tant insights that can help and guide institutions to make timely decisions and changes leading to better student outcome
achievements. In the post-COVID-19 pandemic era, the adoption of e-learning has gained momentum and has increased the
availability of online related learning data. This has encouraged researchers to develop machine learning (ML)-based models
to predict students’ performance during online classes. The study presented in this paper, focuses on predicting student per-
formance during a series of online interactive sessions by considering a dataset collected using digital electronics education
and design suite. The dataset tracks the interaction of students during online lab work in terms of text editing, a number of
keystrokes, time spent in each activity, etc., along with the exam score achieved per session. Our proposed prediction model
consists of extracting a total of 86 novel statistical features, which were semantically categorized in three broad categories
based on different criteria: (1) activity type, (2) timing statistics, and (3) peripheral activity count. This set of features were
further reduced during the feature selection phase and only influential features were retained for training purposes. Our pro-
posed ML model aims to predict whether a student’s performance will be low or high. Five popular classifiers were used in
our study, namely: random forest (RF), support vector machine, Naïve Bayes, logistic regression, and multilayer perceptron.
We evaluated our model under three different scenarios: (1) 80:20 random data split for training and testing, (2) fivefold
cross-validation, and (3) train the model on all sessions but one which will be used for testing. Results showed that our model
achieved the best classification accuracy performance of 97.4% with the RF classifier. We demonstrated that, under similar
experimental setup, our model outperformed other existing studies.

Keywords Machine learning · Random forest · Student performance prediction · Feature extraction · Binary classification

1 Introduction collected data is becoming significantly large to warrant


sophisticated and intelligent techniques to manage and ana-
The prediction of student performance has been the focus of lyze such data.
many educational institutions due to the insight that it pro- Student performance prediction is considered an impor-
vides in terms of dropouts and learning outcome attainment. tant area of educational data mining (EDM) and Learning
Such insights may support institutions in making learning Analytics (LA) [1]. It is becoming more challenging in the
adjustments or adopting new learning management strate- presence of the ever-increasing data volumes. EDM and LA
gies to create novel learning opportunities for the students. are considered to be closely related disciplines and they aim
With the application of technology in the learning process to support the process of analyzing related educational data.
through the wide deployment of learning management sys- Together, they provide tools and techniques to collect, pro-
tems (LMS) and course learning platforms, the amount of cess, and analyze educational data. In most cases, the analysis
is based on data mining, machine learning, and statistical
analysis techniques to explore hitherto unknown patterns in
B Ghassen Ben Brahim
different types of historical data.
[email protected]
Researchers have focused on different data features to
1 Department of Computer Science, College of Computer predict student performance: these include previous exam
Engineering and Science, Prince Mohammad Bin Fahd scores, background, demographic data, and activity in and
University, Al-Khobar 31952, Saudi Arabia

123
10226 Arabian Journal for Science and Engineering (2022) 47:10225–10243

outside the classroom, to name but a few [2]. Though most of • The design of a student performance prediction model
the performance prediction work tends to focus on previous based on the extraction of a set of statistical features which
exam scores, very little work seems to target the analysis of were categorized into three broad categories: (1) activity-
data wherein student interactions with online systems is being type based, (2) timing statistics-based, and (3) peripheral
logged and analyzed [3]. Using past scores has two disadvan- activity count-based. These features comprise an extrac-
tages: one is that it predicts performance in the long term such tion phase followed by a feature selection phase using an
as predicting students’ performance in his junior or sopho- entropy-based selection method.
more year courses based on exam grades and course grades • Performance evaluation of the proposed model by consid-
from the freshman or sophomore year, second is that even ering the following classifiers: random forest (RF), support
if it takes grades of major1 and major2 exams to predict the vector machine (SVM), Naïve Bayes (NB), logistic regres-
performance in the course, then it is too far in the semester for sion (LR), and multilayer perceptron (MLP).
any corrective actions to be taken to assist the student to suc- • Comparative performance analysis between our proposed
ceed. Therefore, in the work introduced herein, we propose model and some of the existing published research propos-
focusing on performance prediction problems while explor- ing students’ performance prediction models using the
ing data describing students’ online interactions during the DEEDS dataset [4, 30, 31].
online exam sessions. We experiment with a dataset that has
been collected using digital electronics education and design The rest of the paper is organized as follows. Section 2
suite (DEEDS)—a simulation software used for e-learning presents the background about the student performance
in a Computer Engineering course (Digital Electronics) [1]. prediction domain: its importance, applicable prediction
The DEEDS platform logs student activities and actions dur- metrics, dataset categorization, and overview of prediction
ing exams (such as viewing/studying exam content, working models. Section 3 describes our proposed approach consid-
on a specific exercise, using a text editor or IDE, or viewing ering student engagement data wherein the DEEDS dataset
exam-related material). Our literature survey indicates that is presented along with the feature extraction process. Sec-
this DEEDS dataset was explored twice: in [1, 4], wherein tion 4, details the model performance in terms of prediction
authors attempted to predict student performance based on accuracy, and then a comparative study is presented in rela-
exam complexity and predict the exam difficulty based on tion to the existing work. Finally, in Sect. 5, conclusions are
student activities from prior sessions, respectively. drawn, and future research directions are suggested.
The aim of this research work is to build a prediction
model which is based on newly extracted statistical features
aimed at predicting students’ performance based on their 2 Student Performance Prediction: Overview
online activities. To build and refine the model, we have
proceeded as follows. Initially, we have proposed new fea- This section starts with a brief background about the impor-
tures that were categorized into three broad categories, based tance of student performance prediction. Then it overviews
on different criteria: (1) Activity-type count-based, (2) Tim- the performance prediction targets in terms of prediction
ing statistics-based, and (3) Peripheral activity count-based, goals. Next, it surveys and categorizes the set of features
resulting in a total of 86 features. We have also proposed commonly considered in most of the datasets used during
further improvement in the model by reducing the set of fea- the prediction process. Finally, it overviews the prominent
tures and eventually keeping the most influential (significant) approaches and models being used along with their achieved
ones using the entropy-based feature selection method [35, performance.
36]. The proposed model was then evaluated and compared
with other existing similar research work. We compared the 2.1 Importance of Student Performance Prediction
performance with some of the existing work addressing the
same problem and using the same DEEDS dataset. We have The problem of predicting student performance has been
shown that our proposed model outperforms existing ones in extensively studied by the research community as part of
terms of classification accuracy results. the learning analytics topic due to its importance for many
The key contributions of this research work can be sum- academic disciplines [2, 3, 5–7]. Based on the goal of the
marized as follows. performance prediction model, the benefits may include
the following: (1) improved planning and accurate adjust-
ments in education management strategies to yield enhanced
• Categorization of student academic performance-related attainment rates in program learning outcomes [8], (2) iden-
features in existing datasets. tify, track and improve student learning outcomes and their
• Statistical analysis of the DEEDS dataset which supported impact on classroom activities. For instance, prediction mod-
the feature extraction process. els could be tuned to classify student performance as low,

123
Arabian Journal for Science and Engineering (2022) 47:10225–10243 10227

average, or high. Based on the classification results, con- of these five categories. Table 1 also includes the stated aim
certed measures may be taken by the education managers of the prediction study per category.
to support the low-performing students [7], (3) propose new,
formative learning approaches for the students based on their 2.4 Related Work
predicted performance [8]. For instance, students are advised
to adopt different learning strategies such as emphasizing Several methods and approaches were considered in pre-
more on practical aspects of course material, (4) allocating dicting student performance; most of these approaches are
resources to the students based on their predicted perfor- statistical in nature and designed for machine learning (ML)
mance. For instance, the identification and prediction of models. The models attempt to estimate an inherent corre-
high-performing students will support institutions to estimate lation between input variables and identify patterns within
the number of awarded scholarships [9], (5) minimize the stu- the input data. Following our review of most of the exist-
dent dropout rates which is considered a resources black hole ing datasets, these attributes can be classified under any
that impacts graduation rates, quality, and even institutional of five categories, namely: (1) student historic performance
ranking [10]. attributes, (2) student demographic attributes, and (3) stu-
dent learning platform interactions attributes, (4) personality
2.2 What to Predict? attributes, and (5) institutional attributes; as detailed in Table
1.
Student performance prediction models have targeted sev- Among the two existing ML models types, supervised
eral metrics which are both quantitative and qualitative in learning techniques are a better fit for handling classifica-
nature. The amount of research work to predict quantitative tion and regression problems and were more widely used to
metrics outweighs those for qualitative metrics [8]. Qualita- deal with the student prediction problem as compared to the
tive metrics have mainly focused on Pass/Fail or Letter Grade unsupervised learning techniques. Classification approaches
classifications of students in particular courses [11] or overall attempt to classify entities into some known classes, which
student assessment prediction in terms of high/average/low. are two in the case of binary classification (for example, clas-
This type of assessment could be performed per course, sifying students into Passing or Failing classes) or more in
major, topic, etc. [3], or student knowledge accomplish- the case of multinomial classification. On the other hand,
ment levels: First/Second/Third/Fail [4, 12], or to classify in regression-based approaches, the model attempts to pre-
students into low risk/high risk/medium risk [7]. By con- dict a continuous type of value (for instance, predict the final
trast, quantitative metrics have mainly attempted to pre- exam score, which could be a real number between “0 to
dict scores or course/exam/assignment grades [5], range of 100”). This makes regression techniques more challenging
course/exam/assignment grades [6], major dropout/retention problems to solve compared to the classification problems.
rates [10], prediction of the time needed for exam completion, In the context of student performance prediction, sev-
prediction of on-duration/delay of graduation and student eral supervised learning models were used while considering
engagement as well [13]. datasets with each of the five feature categories as well
as their combinations and targeting the prediction of spe-
2.3 Dataset Features Categorizations cific performance features (as specified in Table 1). For
instance, the authors in [2] have studied the impact of each
Most of the datasets that have been used to machine-learn the of the three categories of features (student engagement,
student performance have considered historic data that can be demographics, and performance data) on predicting student
categorized into three broad categories based on the attribute performance using binary classification-based models to pre-
types [2]: (1) student historic performance attributes, (2) dict at-risk students and regression techniques to predict
student demographic attributes, and (3) student learning the student scores. Also, they studied the prediction per-
platform interactions (engagement) attributes. These catego- formance at different time instances before taking the final
rizations were further extended into a more comprehensive exam. The analysis was performed on a public open univer-
classification of features to include two more categories [3], sity learning analytics dataset (OULAD) while using Support
namely: (4) personality—to better describe the subject capa- vector machine (SVM), decision tree (DT), artificial neural
bility and ability (such as efficacy, commitment, efficiency, networks (ANNs), Naïve Bayes (NB), K-nearest neighbor
etc.), and (5) institutional—to better describe the teaching (K-NN), and logistic regression (LR) models for classifi-
methods, strategies, and qualities [14]. We have surveyed cation and SVM, ANN, DT, Bayesian Regression (BN),
several datasets that have been considered for student per- K-NN, and Linear Regression models for regression analysis.
formance prediction studies and have summarized them in For the classification task, better performance results were
Table 1, where we enlisted the five categories along with obtained with ANN while considering engagement and exam
most of the common and relevant features being used in each score data (F1-score ~ 96%). The same performance was also

123
10228

Table 1 Summary of features categories v/s targeted attributes

123
Feature categories Features sub-categories Features Targeted/predicted attributes

Historic performance [2] Pre-course performance Exam/assignment/ project/lab/quiz/seminars Course/Program dropout & retention,
Course performance results, previous semester performance, assessment grades, GPA range, actual GPA,
Admission score, cumulative GPA, topic course Pass/Fail, course letter grade, course
performance, average course grade, course grade, assignment grade
Pass/Fail

Prior school performance


Demographic [27] Family background Gender, age, marital status, children, Program dropout & retention, GPA range,
Working conditions occupation, computer literacy, skills, location, course Pass/Fail, course letter grade,
income, caretaker, support, ethnicity, assignment grade
Population international, minority, distance to school,
daily commute time, internet access
Learning platform interactions (engagement) [1] Time-based performance Duration/number of visited material, search Assessment grades, course Pass/Fail, course
activity, face-to-face meetings, discussions letter grade, course grade, assignment grade
participation, frequency of deck comments,
number of attempted exams, number of clicks,
attendance, chat/forum activity, time on task,
system interaction duration

Activity count interaction performance


Personality [3] Studying strategy based Efficacy, motivation, study habits, anxiety, GPA range, course Pass/Fail, course letter
positive working strategy, negative working grade, program dropout & retention
strategy, goal-oriented, emotional, learning
style

Feeling based
Institutional [14] Teaching strategy University educations, high school, pedagogical Course/Program dropout & retention, at-risk
Institution type mode, teaching mode, formal/informal students
education, training mode, type of material
available, deep/shallow learning approach
Arabian Journal for Science and Engineering (2022) 47:10225–10243
Arabian Journal for Science and Engineering (2022) 47:10225–10243 10229

obtained for the regression analysis task where ANN outper- engagements such as the number of clicks that student exe-
formed other algorithms while inputting the model with more cutes to access certain URLs, resources, homepages, etc. The
historic assessment scores (RMSE ~ 14.59). authors have considered several machine learning classifica-
The authors in [5] have focused on predicting student tion algorithms to predict low engagement instances, such as
performance based on historic exam grades and in-progress Decision Trees (DT), JRIP, J48, gradient-boosted tree (GBT),
course exam grades. The goal of such a study is to identify, per CART, and Naïve Bayes. The best performance in terms of
area and per subject, the “at-risk” students and hence provide predicting students with low engagement was obtained with
real-time feedback about the current status of the students. the first four algorithms (topping 88.5%). Also, these results
Such feedback would drive the appropriate remedial strate- have identified the best predictive variables in terms of the
gies to support these students and eventually help improve number of clicks students executed per activity during the
retention rates. The authors have conducted their studies on identification of low-engagement students.
335 students and 6358 assessment grades while consider- In the same context, Vahdat et al. [1] aimed to study
ing 68 subjects categorized into 7 different knowledge areas. the impact of student behavior during the online assessment
The prediction model used the Decision Tree algorithm to and the scores obtained. The authors have used complexity
classify students into passing and failing categories. In their matrix and process mining techniques to identify any cor-
effort to identify the most influential variable, the authors run relation between attained scores and student engagement.
their model while considering all possible combinations of The dataset was collected using digital electronics education
Final grades and their weighted Partial (1 to 3) ones. The best and design suite (DEEDS), a simulation software used for
model accuracy performance was reported (96.5%) when all e-learning in a Computer Engineering course, namely Digi-
Partial grades were included in the prediction. tal Electronics. Analysis has shown that: (1) the complexity
In a different study, Huang and Fang [7] focused on pre- matrix and student scores are positively correlated, (2) and
dicting student performance in a specific field: Engineering the complex matrix and session difficulties are negatively
dynamics. The authors conducted their research while con- correlated. Additionally, the authors demonstrated that the
sidering four predictive models (multiple linear regression proposed process discovery could provide useful informa-
(MLR), multilayer perceptron network, radial basis function tion about student learning practices.
network, and the support vector machine-based model). Their In a different study Elbadrawy et al. [16], the authors have
study aimed to identify the best prediction model and the most built a model to accurately predict student grades based on
influential variable among a set of six considered ones leading several types of features (past grade performance, course
to a more accurate prediction model. The dataset being con- characteristics, and student interaction with the online learn-
sidered in this study included only the historic performance ing management system (aka. Student Engagement). The
type of data. The dataset was a collection of 323 students built model relies on a weighted sum of collaborative multi-
over 4 semesters and included nine different grade dynamics regression-based models, which were able to improve the
and pre-requisite dynamic courses. Results showed that the prediction accuracy performance by over 20%.
type of mathematical model being used had little impact on Along the same directions as in [16], Liu and d’aquin [17]
the prediction accuracy. Regarding the most influential vari- have attempted to predict student performance based on two
ables, results showed that it varies based on the goal of the categories of features: Demographics and Student Engage-
instructor in terms of what to predict (average performance of ment with the online learning system. They have applied
the entire dynamic class or individual student performance). supervised learning-based algorithms in their model on the
Best performance results in terms of accuracy were achieved Open University Learning Analytics dataset [18] and inves-
with MLR (reaching 89.7%). Similar to the proposed model tigated the relationship between demographic features and
in [5], this model only considered one category of student the achieved performance. Analysis has shown that the best-
data (historic performance-related data) and did not include performing students were those who had acquired a higher
different categories such as engagement or demographic data. education level and were residing in the most privileged
As online teaching is gaining more and more popularity, areas.
it is becoming necessary for all schools to provide e-learning Hussain et al. in [4] have investigated predicting diffi-
options to their students, especially after the COVID-19 pan- culties that students may face during Digital Electronics lab
demic. Many research works have studied and evaluated the sessions based on previous student activities. They identified
performance of such learning in a virtual environment. For the best predicting model, as well as the most influential fea-
instance, Hussain et al. [15] have focused on studying the tures. The authors have only considered the engagement type
impact of student engagement in a virtual learning envi- of data using the digital electronics education and design suite
ronment on their performance in terms of attained exam (DEEDS) simulator [1]. They have conducted their study
scores. This study has considered various variables includ- considering the following five features: average time, average
ing demographics, assessment scores, and student-system idle time, and the total number of activities, total related activ-

123
10230 Arabian Journal for Science and Engineering (2022) 47:10225–10243

ity, and the average number of keystrokes. Five classification institutional categories were used. Out of the eight algorithms
algorithms were explored: support vector machines (SVMs), used, the RF performed best and has achieved an accuracy
artificial neural networks (ANNs), Naïve Bayes classifiers, of 88% following the feature reduction step.
Logistic Regression, and Decision Trees. While consider- In another study [34], the authors have used Artificial Neu-
ing fivefold cross-validation and random data division, the ral Networks (ANN) to predict the performance of students
best accuracy performance results were obtained with the in an online environment. The data set was collected using
ANN and SVM-based models (75%). This performance was the participation of 3518 students. Out of the five categories
later improved and reached 80% when considering the Alpha of features described in Table 1, only historic performance
Investing technique on the SVM-based model. and learning platform interaction categories were considered.
DEEDS dataset has also been used in other research work Results showed that the ANN-based model was able to pre-
where researchers have attempted to predict the performance dict the students’ performance with an accuracy of 80%.
of students and the difficulties they face by analyzing stu- Summary of the prediction of student performance using
dents’ behavior during interactive online learning sessions. machine learning-based models is presented in Table 2. It
For instance, in [30], the authors attempted to predict the is worth noting that the last entry in Table 2 captures the per-
student performance using the DEEDS dataset by consid- formance achieved with the proposed model in predicting
ering regression models. The input variables of the model students’ performance while considering student engage-
were the current study of the students for all sessions, how- ment and historic type of features.
ever, the model output consists of the student’s grade for Our proposed research explores the area of predicting stu-
a specific session. Among the three models used, (Linear dent performance in an e-learning environment, which is
regression, Artificial Neural Networks, and Support Vector gaining more popularity, especially post the COVID-19 pan-
Machine (SVM)), SVM performed the best and achieved an demic. We propose exploring the DEEDS dataset, which is,
accuracy of 95%. to the best of our knowledge, studied twice by Vahdat et al.
In a different research work, the authors in [31] have [1] and Hussain et al. [4]. DEEDS is a Technology Learn-
also considered the DEEDS dataset to perform a compar- ing Environment platform that logs the real-time interactions
ative analysis using various machine learning models such of students during classwork as well as exam performance
as Artificial Neural Network, Logistic regression, Decision in terms of grades; these logs were collected in six differ-
Tree, Support Vector Machines, and Naïve Bayes. In their ent interactive sessions. We intend to conduct a statistical
study, the authors have extracted and considered five differ- study on the DEEDS dataset followed by the design of a new
ent types of features, including average time, total activities, prediction model based on a new set of statistical features
average mouse clicks, related activities in an exercise, aver- to predict student performance using their interaction logs
age keystrokes, and average idle time per exercise. During registered by DEEDS. We propose assessing our model’s
this study, SVM performed the best as compared to the rest performance using five different types of classifiers in a dif-
of the models and has achieved an accuracy of 94%. ferent experimental setup. We will also compare the achieved
In [32], the authors have considered different datasets in performance in terms of accuracy and F1-score to that pro-
their attempt to predict the performance of students in two posed by Hussain et al. [4].
different courses (namely Mathematics and Portuguese lan-
guage). Both datasets have a total of 33 attributes each and
a total of 396 and 649 records, respectively. The authors 3 Proposed Methodology
have applied two different types of models: following a
feature Support Vector Machines and Random Forest to clas- The proposed method aiming to classify student performance
sify passing students from the failing ones. Both datasets follows a typical machine learning classification approach.
comprise 16 features, including historic performance, demo- Initially, we start with a statistical analysis and feature engi-
graphic data, and personality type of features. Experimental neering of the DEEDS dataset which resulted in the reduction
results showed an accuracy of more than 91% and that the of DEEDS activity features from 15 to 9 activities. This
historic features were most influential during classification. reduction (as will be described in Sect. 3.2) consists of
It was also worth noting that Random Forest performed better aggregating semantically similar activities into a single cat-
in the case of larger dataset (Portuguese language). egory. The next step of this process consists of the data
The authors in [33] have proposed a model to predict stu- pre-processing phase, where entries showing some discrep-
dents’ performance in higher educational institutions using ancy were discarded, and DEEDS logs with no corresponding
video learning analytics and data mining techniques. A sam- lab or final exam grades were also excluded. This is further
ple of 722 students was considered in the collection of the explained in Sect. 3.3. The next step is a features extrac-
dataset. Among the five categories highlighted in Table 1, tion phase. This phase resulted in extracting three broad
only the historic performance, engagement, personality, and categories of features: (1) Activity-type count-based, (2)

123
Arabian Journal for Science and Engineering (2022) 47:10225–10243 10231

Table 2 Summary of student performance prediction using machine learning-based models


References Dataset source/description Features categories Models used Top performance

[2] Open University Learning Student engagement, Support Vector Machine ANN (96% F1-score)
Analytics (OULAD) dataset, demographics, historic (SVM), Decision Tree (DT),
3166 students performance Artificial Neural Networks
(ANN), Naïve Bayes (NB),
K-Nearest Neighbor (K-NN),
and Logistic Regression (LR)
[5] Ecuador University, 335 Historic performance Decision Tree 96.5% accuracy
students
[7] Engineering Dynamics course, Historic performance Multiple Linear Regression MLR (89.7% accuracy)
323 students (MLR), Multilayer
Perception (MLP) network,
Radial Basis Function (RBF)
network, Support Vector
Machine (SVM)
[15] OU University, United Demographic, assessment Decision Tree (DT), JRIP, J48, J48 (88.5% accuracy)
Kingdom, 384 students scores, student-system Gradient-Boosted Tree
engagement (GBT), CART, and Naïve
Bayes
[16] University of Minnesota’s Student engagement, historic Multi-Regression 0.145 RMSE
Moodle, 11,556 students performance
[4] DEEDS, 115 students Student engagement, historic Support Vector Machines SVM (80% accuracy)
performance (SVMs), Artificial Neural
Networks (ANNs), Naïve
Bayes classifiers, Logistic
Regression, and Decision
Trees
[30] DEEDS, 115 students Student engagement, historic Linear Regression, Artificial SVM (95% accuracy)
performance Neural Networks, and
Support Vector Machine
(SVM)
[31] DEEDS, 115 students Student engagement, historic Artificial Neural Network, SVM (94.8% accuracy)
performance Logistic Regression,
Decision Tree, Support
Vector Machine, and Naïve
Bayes
[32] Mathematics dataset, 696 Historic performance, Support Vector Machine, SVM (92% accuracy)
records demographic data, Random Forest
personality type of features
[32] Portuguese dataset, 649 records Historic performance, Support Vector Machine, RF (94% accuracy)
demographic data, Random Forest
personality
[33] 722 students Historic performance, Neural Network, KNN, LR, RF (88% accuracy)
engagement, personality, CN2 Rule Inducer, Naïve
institutional Bayes, SVM
[34] 3518 students Only historic performance, Artificial Neural Networks 80%. accuracy
student engagement (ANN)
Proposed work DEEDS, 115 students Student engagement, historic Random Forest (RF), Support 97.4% accuracy
performance Vector Machine (SVM),
Naïve Bayes (NB), Logistic
Regression (LR), and
Multilayer Perceptron (MLP)

123
10232 Arabian Journal for Science and Engineering (2022) 47:10225–10243

Timing statistics-based, and (3) Peripheral activity count-


based, resulting in a total of 86 features. This phase is
described in Sect. 3.4. The final phase consists of apply-
ing classification algorithms including: Random Forest (RF),
Support Vector Machines (SVM), Multi-Layer Perceptron
(MLP), Naïve Bayes (NB), and Logistic Regression (LR).
Our proposed model attempts to predict the following: weak
versus good performing students based on interaction with
DEEDS through the use of a binary classification models.
The proposed model’s performance was also compared to
that proposed in [4], where the authors considered a set of
30 extracted features while considering the same DEEDS
Dataset. Figure 1 shows a workflow diagram of the entire
process that we followed to build our prediction model. The
process involves the typical five machine learning steps,
including: (1) DEEDS raw dataset collection, (2) DEEDS
data logs pre-processing, (3) Features extraction, (4) Feature
selection, and (5) classification and analysis.

3.1 Dataset Description

In this research work, DEEDS dataset was used. DEEDS is


considered among the very few existing ones that capture
real-time in-class student interactions and behavior while
using a proprietary Enhanced Technology Learning (ETL)
platform called DEEDS (Digital Electronics Education and
Design Suite). This dataset was considered by this study due
to each richness in terms of the variety of students interac-
tion information being logged by the system such as: average
time being spent by students in various types of problems as
well as learning environments, types of activities, the total
number of activities, average idle time, the average number
of keystrokes and total related activities for various exercises
during individual sessions, the grade performance achieved
by students in each session, etc. Also, we claim that there
is enough room for opportunity to develop a classification
model which will achieve better performance compared to
the existing ones, which is the main objective of this research
work. General information about DEEDS is presented in
Table 3.
DEEDS logs real-time interaction of students during class-
Fig. 1 Workflow diagram of the proposed model for student perfor-
work as well as exam performance in terms of grades. The mance prediction
DEEDS dataset was collected following six working lab
sessions. Students will initially be learning a specific topic
during each of these sessions, then taking a set of exercises. tor, using a simulation timing diagram tool, reviewing study
The number of exercises in each of the 6 sessions ranges material, using the Aulaweb, etc. After attending all lab ses-
from 4 to 6 exercises. For instance, in sessions 1, 3, and 5, sions, students will be taking an exam where questions are
four exercises were planned. In sessions 2 and 6, six exer- chosen to cover each of the six topics studied in each of the
cises were planned. In session 3, five exercises were planned. six lab sessions, and student performance per topic was also
Student performance (in terms of attained grades) is subse- noted. DEEDS creates a separate file for each student attend-
quently recorded; with the maximum session grade varying ing a session and adds a new entry every one-second interval.
from 4 and 6. During a study session phase, DEEDS records Each new entry corresponds to a new row in these log files
all student activities in each session, such as using text edi- describing student activities during each session. Each row

123
Arabian Journal for Science and Engineering (2022) 47:10225–10243 10233

Table 3 DEEDS dataset


information Owner University of Genoa, Italy

Year 2015
Topic Digital Electronics—Computer Engineering
Performance metric Session grades,
Final exam grades
Dataset Location https://sites.google.com/site/learninganalyticsforall/data-sets/epm-dataset

Table 4 Dataset features description Table 5 Statistical analysis of the DEEDS dataset
Order Feature Description DEEDS details Statistics

1 Session Session identifier (1.0.6) Number of features 13


2 Student Student identifier (1.0.115) Number of lab sessions 6
3 Exercise Exercise identifier (1.0.6) Dataset size 230,318 entries
4 Activity Type of activity Number of activities 15
5 Time begin Activity Start time Number of students 115
6 Time end Activity End time Average number of students per session 86
7 Idle time Idle time duration Number of students not taking any test 7
8 Mouse wheel Amount of mouse wheel during Freq. (%) of non 0 entries for “mouse wheel” 21,460 (9.3%)
activity Freq. of non 0 entries for mouse wheel clicks 376 (0.1%)
9 Mouse Wheel click Number of mouse clicks during Freq. of non 0 entries for mouse wheel click right 12,997 (5%)
activity
Freq. of non 0 entries for mouse wheel click left 206,457 (90%)
10 Mouse click left Number of left mouse clicks during
activity Freq. of non 0 entries for mouse movement 215,576 (93%)

11 Mouse click right Number of right mouse clicks during Freq. of non 0 entries for “Keystrokes” 33,706 (14%)
activity
12 Mouse movement Distance covered by mouse
13 Keystrokes Keystrokes hit during activity
except for the last session, which included 23% of all data.
Statistical analysis has also shown that some dataset fea-
tures logged fewer activities, and most of the registered
is a collection of 13 comma-separated features arranged in values were “0”. For the “number of mouse wheel” feature,
the order indicated in Table 4. 90.7% of the logged values were zeros’, in the “number of
Figure 2 shows a snapshot of the comma-separated log mouse wheel click” feature, 99.9% of the logged values were
file corresponding to student ID 21 collected during the 4th zeros’, in the “number of mouse wheel click right” feature,
session. 95% of the logged values were zero, and 86% of logged
“keystrokes” were zeros’. On the other hand, the number of
3.2 Statistical Analysis of the DEEDS Dataset “mouse wheel click left” and “mouse movement” features
were well distributed across a range of 0 to 1096 and 0 to
As a result of the preliminary data pre-processing and meta- 85,945, respectively.
data analysis, we are able to better describe the dataset, as Our next analysis focused on the representation of each of
shown in Table 5. Though 115 students were expected to the activities in the entire dataset. Some activities did not have
participate in this experiment and eventually take exams, 7 sufficient representation throughout the entire dataset and
students ended up not showing up in all sessions, and an were represented with less than 1%, occurrence such as “Text
average of 86 students were registered per lab session. Our Editor no exercise—0.02%”, “Deeds no exercise—0.2%”,
dataset included more than 230,000 entries, which were uni- “Deeds other activity—0.4%”, “FSM related—0.1%”. This
formly distributed across all sessions, as shown in Fig. 3. is in comparison with the rest of the activities, which had
Sessions 1 through 5 included 13–17% of the entire dataset, representation rates between 7.5 and 16.3% of the entire

Fig. 2 Raw data snapshot 4, 21, Es_4_1, Study_Es_4_1, 13.11.2014 11:8:17, 13.11.2014 11:12:51, 35205806, 5, 0, 2, 0, 94, 0
4, 21, Es_4_5, TextEditor_Es_4_5, 13.11.2014 11:50:49, 13.11.2014 11:50:57, 5986, 0, 0, 8, 0, 257, 6
4, 21, Es_4_5, Diagram, 13.11.2014 11:51:15, 13.11.2014 11:54:8, 89410, 0, 0, 149, 0, 6088, 0
4, 21, Es_4_5, Aulaweb, 13.11.2014 12:44:51, 13.11.2014 12:47:27, 10947165, 103, 0, 2, 0, 138, 0

123
10234 Arabian Journal for Science and Engineering (2022) 47:10225–10243

data transformation and irrelevant information are being dis-


Log Frequency Per Session carded from the original raw dataset, and only relevant data is
kept for further processing. In the current work, the following
steps have been executed on the original raw data.
23% 17% (a) Data discrepancy removal We have noticed that the
15% original dataset had some inconsistencies between the pairs
15% (Session, Exercise) and (Exercise, Activity), as indicated in
13%
17% Fig. 5. DEEDS platform seems to have wrongly logged the
session ID with its corresponding exercise ID (Fig. 6a) and
Exercise ID with its corresponding Activity ID (Fig. 6b).
Session 1 Session 2 Session 3
The frequency of occurrences of such inconsistencies was
Session 4 Session 5 Session 6 minimal, and counts for less than 0.5% of the entire dataset
and were proportionally equivalent across all sessions and
Fig. 3 DEED’s log distribution across sessions
exercises; therefore, these entries were discarded from the
dataset. During this phase, we also excluded the entries where
15 - AC TIVITY FREQUENCY
students have logged on but did not start working on the exer-
Text Editor Exercise Text Editor no exercise Text Editor other acvity cises. These are the logs with the exercise ID field matching
Aulaweb Deeds Exercise Deeds no exercise “ES”.
Deeds other acvity FSM exercise FSM related (b) Session 1 data removal During this step, the session
Study exercise Study material Blank 1 data was filtered out. This was necessary since session 1
Other Diagram Properes did not have any corresponding intermediate nor final exam
16.33%

grades related to the topics covered in this session. It is worth


15.92%

14.60%

mentioning that student grades (intermediate or final exam)


10.56%
9.49%

9.05%

8.55%

will be used to label our data for classification purposes.


7.51%
3.59%

(c) Time format conversion The DEEDS platform logs the


2.09%

1.44%
0.46%
0.29%

0.18%
0.02%

time using the format “hours: minutes: seconds”. These times


were converted to seconds to facilitate operations on these
ACTIVITIES
times.
Fig. 4 Frequency of each activity in the DEEDS dataset size (pre- (d) Exercises 5 and 6 exclusion This phase has to do with
aggregation) restricting our analysis to the first 4 exercises of each of the
5 sessions (2 through 6), the rest have been discarded. This
step is necessary since not all sessions have the same number
dataset. Proportional activity frequency results are depicted of exercises.
in Fig. 4. To avoid dropping any of the least frequent activity
data entries, we propose aggregating semantically equiv- 3.4 Features Extraction
alent activities under new activity categories. These are
being represented in Fig. 5, along with their representa- This step is deemed pivotal during classification problems.
tion frequency. In this figure, activity “Editing” includes the In this work, we propose using an augmented set of numeric
following similar editing activities “Text Editor exercise”,” features compared to those by Hussain et al. (2019), where the
“Text Editor no exercise”, and “Text Editor other activity” authors have considered 30 features (5 features per exercise)
with an 18% frequency. The same applies for the new cat- for each student–session combination. We extracted a set
egory “Deeds activity”, which includes “Deeds exercise”, of 86 features for each student per session and categorized
“Deeds no exercise”, and “Deeds other activity” with a 17% these features into 3 broad categories: (1) Activity-type count-
overall frequency occurrence. The same applies to the new based features, (2) Timing statistics-based features, and (3)
category “study”, which includes “study exercise” and “study peripheral activity count-based features, which are explained
material” with a 10% frequency. Also, new category “FSM” below.
now includes activities “FSM exercise” and “FSM material”
with a 9% frequency. 3.4.1 Activity-Type Count-Based Features

3.3 Data Pre-processing As stated in our dataset description, DEEDS defines 15 types
of activities. Based on our statistical analysis in terms of
Pre-processing is a typical first step in the data classifi- log count distribution of each of these activities (Fig. 3)
cation and pattern recognition process. During this phase, along with their significance, we propose reducing these lists

123
Arabian Journal for Science and Engineering (2022) 47:10225–10243 10235

Fig. 5 Frequency of each


activity in the DEEDS dataset
size (post-aggregation) Aggregate Acvity Frequency

8%
18%
9%
4%
15%
17%
10%
10% 9%

Eding Aulaweb Deeds acvity


FSM Study Blank
Other Diagram Properes

Session Student Exercise Activity Start_T End_T Idle_T ouse_wheuse_wheel_use_click_luse_click_r


use_movem
keystroke
4 91 Es_5_2 TextEditor_Es_5_2 13.11.2014 11:37:41 13.11.2014 11:37:53 27673 6 0 2 0 252 0
4 91 Es_5_2 Other 13.11.2014 11:37:54 13.11.2014 11:37:59 343 0 0 4 0 171 0
6 82 Es_5_1 Other 11.12.2014 11:9:24 11.12.2014 11:9:28 96 0 0 8 0 509 0
(a): Session ID and Exercise ID

Session Student Exercise Activity Start_T End_T Idle_T mouse_wheeouse_wheel_clmouse_click_leouse_click_rigouse_movemekeystroke


2 87 Es_2_2 TextEditor_Es_3_2 16.10.2014 12:31:39 16.10.2014 12:31:41 814 0 0 3 0 127 0
2 99 Es_2_2 TextEditor_Es_4_2 16.10.2014 12:7:23 16.10.2014 12:7:25 124 0 0 5 0 59 0
2 99 Es_2_2 TextEditor_Es_4_1 16.10.2014 12:8:16 16.10.2014 12:13:56 41624944 0 0 17 0 479 513
(b): Exercise ID and Activity ID

Fig. 6 Example of DEEDS log discrepancy

by combining semantically similar non-proportional logs


into a single group. In other words, activities {Deeds_Es_#, ⎡ ⎤
Deeds_Es, Deeds} are reduced into {Deeds Activity}, a1, 1 a1, 2 a1, 3 . . . a1, 9
⎢ a2, 1 a2, 2 a2, 3 . . . a2, 9 ⎥
activities {TextEditor_Es_#, TextEditor_Es, TextEditor} are A⎢
⎣ a3, 1

reduced into {Editing}, activity and activities {FSM_Es_#, a3, 2 a3, 3 . . . a3, 9 ⎦
FSM_Related} are reduced into {FSM}. This has resulted a4, 1 a4, 2 a4, 3 . . . a4, 9
in the following Activity_type set of 9 categories {Edit-
ing, Aulaweb, Deeds Activity, FSM, Study, Blank, Other, In matrix A, which is a 4 by 9 matrix, element “ai, j ” repre-
Diagram, Properties}. Each of these 9 activities will be ref- sents the occurrence count of activity “j” in exercise “i”. For
erenced with its order of appearance in the Activity_type instance, “a1, 3 ” represents the occurrence count of activity 3
set. For instance, activity 3 represents “Study” and activity 9 (“Study”) in exercise 1.
represents “Properties”. Following the definition of matrix A, we define the first
Before we can mathematically express our activity-based 36 features as depicted in Eq. 1 as follows:
features, we introduce the following activity matrix, which
captures statistics about student activity occurrence count in F j{1→36}  a j/9, j−9 j/9 (1)
all 4 exercises per session:
The 36 features map/track the occurrence count of the 9 activ-
ities in each of the 4 exercises (indexed by e). F 1 for instance
 a1, 1 and represents the occurrence count of “Editing”
activity in exercise_1.

123
10236 Arabian Journal for Science and Engineering (2022) 47:10225–10243

For the next 9 features (F 37 through F 45 ), we track the MouseW heelClickCount, MouseClick Le f tcount,
occurrence count of the 9 activities across all exercises. These MouseClick Rightcount, MouseMovementCount,
are captured in Equations set 2. F 37 , for instance, represents K eystr okeCount} along with one single keyboard activity
the occurrence count of “Editing” activity in all 4 exercises. K eystr okeCount . We used 24 new features to track
 the total occurrence count of each of the aforementioned 6
F37  ae,1 ; peripheral related activities in each of the 4 exercises.
e1→4 Each of these 6 peripheral activities will be mapped with
 
F38  ae,2 ; . . . ; F45  ae,9 (2) its order of appearance in the Peripheral_Activity_type set.
e1→4 e1→4 For instance, peripheral activity 3 represents “Mouse Click
Left count” and activity 6 represents “Keystroke Count”.
The next 4 features track the occurrence count of all 9 Similar to the case of Activity-based features, we intro-
activities together in each of the 4 exercises. These are cap- duce the following peripheral activity matrix, which captures
tured in Equations set 3. F 46 , for instance, represents the statistics about student interactions using the computer
aggregated occurrence count of all 9 activities in exercise 1. peripherals in all 4 exercises per session:
  ⎡ ⎤
F46  a1,i ; F47  a2,i ; p1, 1 p1, 2 p1, 3 ... p1, 6
⎢ p2, 1 p2, 2 p2, 3 ... p2, 6 ⎥
P⎢ ⎥
i1→9 e1→4
  ⎣ p3, 1
F48  a3,i ; F49  a4,i (3) p3, 2 p3, 3 ... p3, 6 ⎦
e1→4 e1→4 p4, 1 p4, 2 p4, 3 ... p4, 6

The final feature in this activity count-based category tracks In matrix P, which is a 4 by 6 matrix, element “ pi, j ” rep-
the aggregated sum of occurrence of all activities across all resents the occurrence count of peripheral activity “j” in
exercises. This is being captured in Eq. 4. exercise “i”. For instance, “ p1, 3 ” represents the occurrence
  count of peripheral activity 3 (“Mouse Click Left Count”) in
F50  ae,i (4) exercise 1.
i1→9 e1→4 Each element of the peripheral activity-based matrix maps
to a different feature, which will be considered during the
3.4.2 Timing Statistics Based Features classification phase. These constitute a set of 24 new fea-
tures (F 59 through F 82 ). F59, for instance, describes the total
This category of features captures the timing performance occurrence count of Mouse Wheel events in exercise_1.
of a student while working on exercises. We used 2 sets of The next set of features within this category defines 4 more
timing features of size 4 features each. The first set of 4 features to describe the level of utilization of the computer
features {F 51 , F 52 , F 53 , F 54 } describes the time spent in peripherals in each of the 4 exercises. These are described in
each exercise. This is calculated by taking the difference of equation set 7. F83, for instance, specifies the total occurrence
max-end time and min-start time for each of the 4 exercises count of the 5 mouse activities and the keystrokes activities
as indicated in Eq. 5. F 51, for instance, corresponds to the in exercise 1.
period of time spent by a student in exercise 1.
 
F83  p1,k ; F84  p2,k ;
F j{51→54}  Tmaxe − Tmine ; where e  1 → 4 (5)
k1→6 k1→6
 
The next set of 4 features (F 55 , F 56 , F 57 , F 58 ) describes the F85  p3,k ; F86  p4,k (7)
total idle time registered in each of the 4 exercises. F 55, for k1→6 k1→6

instance, describes the total idle time registered in exercise_1.


This is being captured in Eq. 6 3.5 Feature Selection

F j{55→58}  I dle_T imee ; where e  1 → 4 (6) Our feature selection phase consists of ranking all features
1→n by applying the entropy-based techniques based on the Shan-
non theory [35, 36]. This approach attempts to assign ranks
3.4.3 Peripheral Activity Count-Based Features based on the level of uncertainty or disorder within the data.
Under this scheme, features with low entropy levels indicate
The final set of features deals with tracking the count of a high level of uncertainty and, therefore, will be assigned
student interactions with the computer input peripherals low-rank values. The entropy-based technique is a power-
(mouse and keyboard). Since DEEDS logs 5 differ- ful technique in determining the level of correlation between
ent types of mouse activities {MouseW heelCount, variables by capturing the linear and non-linear associations

123
Arabian Journal for Science and Engineering (2022) 47:10225–10243 10237

between variables. We propose assessing the performance of the importance of each of the problem features. It is prone
our prediction model with the full set of extracted features to noise, outliers, and overfitting [20]. Contrarily to other
and a reduced set after eliminating the none influential fea- classification techniques, the RF relies on a combination of
tures (low ranked ones.) a set of classification techniques contributing to a single vote
during the class classification process.
3.6 Classification Algorithms
3.6.2 Multilayer Perceptron (MLP)
Among the many existing classification models, we have
evaluated our model using five different classifiers. There MLP is a supervised learning-based approach [21]. It is based
are many factors influencing the suitability of the machine on the concept of perceptron in Neural Networks, which
learning-based regression models, which may differ from one is capable of generating a single output based on a multi-
problem to another. These factors include the type of the data, dimensional data input through their linear (non-linear in
the dataset size, the number of features, data distribution, etc. some occasions) combination along with their correspond-
Also, models may work best for problems but not others. For ing weight as follows:
instance, SVM is known to perform well in the case of rela-
tively small dataset size, which is the case of DEEDS. In this 
n
yα wi xi + β (10)
work, models were chosen based on two criteria: (1) current i1
existing work dealing with the same problems and using the
same dataset DEEDS. For instance SVM, LR, NB. (2) cover where wi , xi , β, and α. are the weights, input variable, bias,
the broad spectrum of the different categories of classifiers. non-linear activation function, respectively. The MLP is com-
For instance, RF was considered as a representative of the posed of three or more node layers, including the input/output
ensemble type of classifiers, NB as a probabilistic type of layer and one or many hidden layers. The training phase in
models, MLP as a Neural Network-based model. The model the case of MLP consists of adjusting the model parameters
selection process also included a step where models which (biases and weights) through a back and forth mechanism
have experienced high error rates were eliminated. (Feed-forward pass followed by Back-forward pass) with
Next, a brief description of each of the Classification mod- respect to the prediction error.
els being used in this work is presented.
3.6.3 Support Vector Machine (SVM)
3.6.1 Random Forest (RF)
SVM is a supervised ML model to solve classification and
RF is an ensemble of Decision Tree bundled together [19]. regression problems; it has demonstrated efficiency in solv-
The training of these bundles of trees consists of executing ing a variety of linear and non-linear problems. The idea of
the bagging process on a dataset of N entities. This process SVM lies in creating a hyperplane that distinctly categorizes
consists of sampling a set of N training samples with replace- the data into classes [22]. SVM works well for multi-domain
ment. Then using these samples to train a decision tree. This applications with a large dataset; however, the model has a
process needs to be repeated T times. The prediction of the high computational cost.
unseen entity is eventually made through a majority vote of
each of the T trees in the case of classification trees or the 3.6.4 Naïve Bayes (NB)
average value in the case of regression as given by Eq. 8.
NB is a probabilistic algorithm based on Bayes’ theorem. It
1 
T
is naïve in the sense that each feature makes an equal and
y fi x  (8)
T independent contribution in determining the probability of
i1
the target class. NP has the advantage of noise immunity
where y is the predicted value, x’ is the unseen sample, fi is [23]. It is proven to perform well in the case of large high-
the trained decision tree on data sample i, and T is a constant dimensional datasets. It is fast (computational complexity-
number of iterations to repeat the process. wise) and relatively easy to implement.

y  mode f 1 x  , f 2 x  , . . . , f i x  , . . . f T x  (9) 3.6.5 Logistic Regression (LR)

where f i (x ) is the prediction class of unseen entity x’ using LR is an ML algorithm based on probability concepts used
the decision tree trained on data sample i. for classification—finding the success and failure events. LR
RF technique has shown its ability in handling large can be considered as a Linear Regression model with a more
datasets with a large number of attributes while it weighs complex cost function, which can be defined as the sigmoid

123
10238 Arabian Journal for Science and Engineering (2022) 47:10225–10243

function (compared to a linear function in the case of linear Table 6 Metrics definitions [26]
regression [24]. LR has the advantage of being computa- Metric Formula Description
tionally efficient, relatively simple to implement with good T p +TN
performance with various types of problems. However, it has Accuracy T p +TN +F p +FN Represents the ratio of the
sum of correct
the main disadvantage of assuming linearity between inde-
classifications with respect
pendent and dependent variables [25]. to the total number of
classifications
Tp
Precision T p +F p Represents the proportion of
4 Results and Analysis positive test cases that were
classified correctly
Tp
4.1 Experiment Setup Recall (sensitivity) T p +FN Represents the fraction of
actual positive test cases
that were properly classified
Three sets of experiments aiming at three different goals F1-score
2T p
Represents the precision and
2T p +FP +FN
were conducted during this study: (1) Goal_1—evaluate recalls harmonic mean
the performance of the proposed model while considering ROC f (TPR, FPR) A probability curve which
the full set of extracted features, (2) Goal_2—study the tells how much a model can
importance of each of the extracted features in the classi- distinguish between various
classes
fication process using the entropy-based ranking approach,
(3) Goal_3—compare the performance of our model to that
proposed in [4, 30, 31]. Table 7 Model average performance (experiment1—random distribu-
The general experiment setup consists of using DEEDS tion case)
log data from five different sessions (sessions 2 through 6) Classifier Accuracy Precision Recall F1-Score ROC
along with their corresponding intermediate grades that were
MLP 0.957 0.959 0.957 0.957 0.98
attained by all 115 students. Only the first 4 exercises from
RF 0.974 0.974 0.974 0.974 0.988
each session were considered. This has resulted in a total
SVM 0.948 0.948 0.948 0.946 0.908
dataset size of 575 entries. We have proceeded in the same
direction as in [4] and [31] in terms of data labeling where LR 92.1 0.934 0.922 0.924 0.918
a student achieving a grade higher than 2 is labeled as class NB 0.826 0.887 0.826 0.837 0.982
“A” student—(student with “no difficulty”); otherwise, the
student is labeled as class “B” student—(student with “diffi-
culty”). By adopting this labeling strategy, 74% of our dataset metrics: (1) accuracy, (2) precision, (3) recall (aka. sensitiv-
included students with category A and 26% of students falling ity), and (4) F1-score. These are briefly described in Table 6,
under category B. where Tp , TN , FP , and FN represent the True Positive, True
For the training and evaluation phase, we have considered Negative, False Positive, and False Negative testing cases,
three sets of experiments. The first experiment consists of a respectively. Along with these four metrics, we also consid-
random distribution-based model where we randomly chose ered the Receiver Operator Characteristic (ROC) metric to
80% of the data (resulting in 460 records ~ 4 data sessions) analyze the proposed model’s ability to distinguish between
for training and 20% for testing (115 entries ~ 1 data session). classes by looking at the True Positive rate versus the False
The second experiment is a more generic approach and con- Positive rates under different settings.
sists of the classic fivefold cross-validation (resulting in 80%
of the data for training and 20% for testing). The third experi- 4.2 Model Performance Evaluation
ment consists of independently assessing the performance of
our model per session. In this setup, 4 session data were used The proposed models were evaluated based on the metrics
for training (equivalent to 80% of the data) and the remain- mentioned in Sect. 4.1. Initially, we have tested our model
ing one (equivalent to 20%) for testing. In the classification with the full set of extracted features (86), next we have stud-
phase, we have used five of the well-known classifiers then ied the level of influence of each of the 86 features in the
selected the most accurate one. We have considered MLP, RF, overall model accuracy performance through the application
SVM LR, and NB models to classify student performance. of entropy-based ranking approach.
Our classifiers’ configuration parameters tuning phase has
led to running all classifiers with a Batch size equal to 100, 4.2.1 Model Performance Analysis
Learning rate equal to 0.3, and Loss equal to 0.1.
We evaluated the effectiveness of the proposed classi- Table 7 shows a summary of the obtained results in terms of
fication model through the analysis of the following four averages of Accuracy, Precision, Recall, F1-score, and ROC

123
Arabian Journal for Science and Engineering (2022) 47:10225–10243 10239

Fig. 7 Overall model Precision Recall F1-Score


1
classification performance
0.95
(Random Distribution of
0.9
training vs. testing sets)
0.85
0.8
0.75
0.7
0.65
0.6
0.55
A B A B A B A B A B
MLP RF SVM LR NB

metrics. These results were collected following our first set accuracy performance for all classifiers. For instance, for
of experiments (randomly choosing 80% of the data for train- our best performing classifier (RF algorithm), and in the
ing and the remaining 20% for testing), where 5 classifiers case of random distribution of test data and training data
were applied (MLP, RF, SVM, LR, and NB). Results show (randomly choosing 80% of the data for training and the
that the RF classifier, which is considered as an ensemble remaining 20% for testing), we have achieved an accuracy
of a combination of tree predictors, has achieved the best of 96.7% compared to 97.4% when the full list of features
performance with 97.4% accuracy and high Recall and Pre- where considered. This insignificant variation in the accuracy
cision values resulting in a high F1-score (97.4%). MLP and performance could be attributed to the size of our dataset. It
SVM (which are known to make good predictions in the case is worth highlighting that though the achieved accuracy per-
of binary classification problems) have also performed well formance was the same, running the model with a reduced
with an accuracy of 95.7% and 94.8%, respectively, and F1- number of features contributed to reducing the overall com-
score of 95.7% and 94.6%, respectively. On the other hand, plexity of the model.
NB, as expected, did not perform well due to the nature of Our next results are depicted in Table 9 where we show
our dataset wherein features (described in Sect. 3.4) are not the corresponding confusion matrix following the execution
completely independent from each other (as a case in point, of MLP, RF, SVM, LR, and NB classifiers. Results are in line
there is a dependency between activity type feature and the with that presented in Fig. 7. For class A entries and when
corresponding timing statistics (in terms of duration). NB has considering the SVM classifier, Table 9 shows that 85 out
achieved relatively low Accuracy and F1-score (82.6% and of 86 were classified properly, which is slightly better than
83.7%, respectively). the performance of RF, where 84 out of 86 were classified
In line with Table 7, Fig. 7 shows a breakdown of the correctly. However, RF showed a better performance than
average performance of all five classifiers in terms of their SVM when classifying class B type of entries (28 out of 29
effectiveness in predicting the correct classes (A—for stu- versus 24 out of 29). It is noteworthy that Table 9 shows that
dents with no difficulty, versus B—students with difficulty). we are in possession of unbalanced data where class A entries
Results demonstrated a pattern of improved prediction rates outnumber class B entries (86 versus 29, respectively).
for class A entries versus class B entries across all classi- Contrary to the results reported in Table 7, where 80% of
fiers. For example, RF achieved 98.8% F1-score for class A the data was chosen randomly for training and the remaining
entries compared to 95% for class B entries. By contrast, NB 20% for testing, Table 10 shows the performance achieved by
consistently showed low performance, especially for class A all five classifiers while considering fivefold cross-validation,
entries where 73.7% F1-score was attained. a more conservative approach. Results reported in Table 10
are generally in line with those reported in Table 7 in terms
4.2.2 Features Relevance Analysis of classifier performances. RF classifier, for instance, per-
formed the best in terms of Accuracy and F1-score (93.37%
In this set of experiment, we have studied the relevance and 95.4%, respectively), and NB continues to show the low-
and level of influence of the extracted features through the est performance. It is also noticed that there was a slight
application of the entropy-based ranking approach. Features decrease of about 4% in the overall performance reported
ranking results are captured in Table 8. Results show that in Table 10 compared to those reported in Table 7, which
about 20% of the total features received a low ranking (less could be attributed to the random nature of the fivefold cross-
than 0.21), indicating that these features may not influence validation technique.
the performance classification model. In fact, a re-run of Our next set of results aims at studying the performance of
our prediction model where we have excluded the 18 fea- our model in a setup where four sessions’ data will be used for
tures from our original dataset has resulted in a very similar training, and the remaining unseen session data will be used

123
10240 Arabian Journal for Science and Engineering (2022) 47:10225–10243

Table 8 Ranking of Features


using Entropy based Ranking Feature # Rank Feature # Rank Feature # Rank Feature # Rank
Algorithm
Feat-9 0.509 Feat-6 0.4146 Feat-25 0.3789 Feat-70 0.2263
Feat-31 0.5066 Feat-28 0.4137 Feat-62 0.3785 Feat-45 0.2238
Feat-49 0.497 Feat-3 0.4093 Feat-32 0.3718 Feat-40 0.2039
Feat-85 0.4845 Feat-34 0.406 Feat-74 0.3714 Feat-69 0.2032
Feat-77 0.47 Feat-10 0.4058 Feat-35 0.3683 Feat-43 0.1864
Feat-24 0.4561 Feat-33 0.4041 Feat-2 0.3674 Feat-68 0.1819
Feat-81 0.4439 Feat-22 0.403 Feat-23 0.3636 Feat-67 0.1792
Feat-83 0.4372 Feat-71 0.4013 Feat-19 0.3349 Feat-42 0.1758
Feat-59 0.4371 Feat-8 0.398 Feat-53 0.3293 Feat-64 0.1661
Feat-29 0.4331 Feat-46 0.3979 Feat-20 0.3243 Feat-37 0.1124
Feat-50 0.4331 Feat-26 0.3979 Feat-52 0.3186 Feat-38 0.0865
Feat-1 0.4299 Feat-4 0.3975 Feat-54 0.314 Feat-14 0.0301
Feat-84 0.4283 Feat-78 0.395 Feat-16 0.2918 Feat-12 0.0188
Feat-79 0.4283 Feat-5 0.3947 Feat-18 0.2891 Feat-15 0.0168
Feat-61 0.4282 Feat-60 0.3937 Feat-17 0.2812 Feat-57 0.0139
Feat-47 0.4268 Feat-76 0.3929 Feat-65 0.2803 Feat-11 0
Feat-48 0.4252 Feat-21 0.3922 Feat-66 0.2579 Feat-58 0
Feat-73 0.4239 Feat-7 0.3868 Feat-39 0.2526 Feat-56 0
Feat-27 0.4238 Feat-82 0.3835 Feat-36 0.251 Feat-55 0
Feat-80 0.4176 Feat-72 0.3826 Feat-63 0.2465 Feat-13 0
Feat-86 0.4171 Feat-75 0.3816 Feat-41 0.23
Feat-51 0.4151 Feat-30 0.3794 Feat-44 0.2275

Table 9 Class prediction


confusion matrices Classifier MLP RF SVM LR NB
(experiment1—random Class A B A B A B A B A B
distribution case)
A 82 4 84 2 85 1 78 8 67 19
95% 5% 98% 2% 99% 1% 91% 9% 78% 22%
B 1 28 1 28 5 24 1 28 1 28
3% 97% 3% 97% 17% 83% 3% 97% 3% 97%

Table 10 Model average performance (experiment 2—fivefold cross- However, the difference in specific metrics performance var-
validation) ied from one session to another. This was manifested with
Classifier Accuracy Precision Recall F1-Score ROC the Recall and Accuracymetrics where the variations were
20% in session 1 and 14% in sessions 1 and 5, respectively.
MLP 93.04 0.954 0.948 0.950 0.935
RF 93.37 0.950 0.958 0.954 0.966
SVM 89.90 0.896 0.970 0.932 0.847 4.2.3 Comparative Analysis
LR 89.36 0.943 0.904 0.923 0.898
NB 82.22 0.946 0.796 0.863 0.920 The final set of results compares the performance of our pro-
posed model with that of [4, 30, 31], where DEEDS dataset
has been used to predict students’ academic performance
for testing. These are captured in Table 11. Table 11 shows under the same experimental setup. Figure 8 illustrates the
the results of five different experiments. For instance, the set performance comparison of all four research results (ours
of results for “session ID for testing” equals 2 represents the with three others from recent literatures which are named as
performance of the model being trained using sessions 1, 3, Hussain, Sriram and Maksud in Fig. 8) in terms of Accuracy,
4, and 5, then tested on session 2. NB had the poorest perfor- Precision, Recall, and F1-score that were achieved while con-
mance (between 85 and 88% F1-score), followed by LR, then sidering the best performing classifiers (RF in case of our
MLP, then SVM, then RF (between 94 and 97% F1-score). model and ANN in case of the model of Hussain et al. in

123
Arabian Journal for Science and Engineering (2022) 47:10225–10243 10241

Table 11 Per-Session model


classification performance (4 Session ID for testing Classifier Accuracy Precision Recall F1 score ROC
data sessions for training and 1
1 MLP 0.930 0.951 0.951 0.951 0.928
for testing)
RF 0.887 0.897 0.951 0.923 0.964
SVM 0.887 0.897 0.951 0.923 0.839
LR 0.904 0.961 0.902 0.931 0.897
NB 0.800 0.940 0.768 0.846 0.880
2 MLP 0.939 0.952 0.963 0.958 0.955
RF 0.957 0.943 1.000 0.970 0.988
SVM 0.922 0.901 1.000 0.948 0.864
LR 0.913 0.962 0.915 0.938 0.901
NB 0.852 0.922 0.866 0.893 0.926
3 MLP 0.939 0.987 0.926 0.955 0.948
RF 0.922 0.962 0.926 0.943 0.961
SVM 0.904 0.907 0.963 0.934 0.864
LR 0.904 0.961 0.901 0.930 0.895
NB 0.817 0.984 0.753 0.853 0.949
4 MLP 0.913 0.973 0.901 0.936 0.917
RF 0.957 0.975 0.963 0.969 0.964
SVM 0.887 0.878 0.975 0.924 0.826
LR 0.878 0.947 0.877 0.910 0.853
NB 0.843 0.957 0.815 0.880 0.929
5 MLP 0.930 0.940 0.963 0.951 0.934
RF 0.930 0.940 0.963 0.951 0.945
SVM 0.895 0.897 0.963 0.929 0.845
LR 0.895 0.948 0.901 0.924 0.874
NB 0.807 0.928 0.790 0.853 0.906

Fig. 8 Proposed model 1 0.98


0.974 0.974 0.974 0.974
performance compared with the 0.96
0.955 0.95 0.95
results from the recent literature 0.938
0.95
[4, 30, 31]
0.91
0.9

0.85
0.85

0.8
0.8

0.75
0.75

0.7
Accuracy Precision Recall F1-Score
RF (Proposed Features) Hussain [4] Sriram [30] Maksud [31]

[4], SVM in case of the models by Siram et al. and Maksud is attributed to the extended set of features that were intro-
et al. in [30, 31], respectively). Results show that our pro- duced by our model compared to a reduced and abstract list
posed models outperformed all the three existing models in of five features per exercise proposed in [4]. While in [4], the
terms of accuracy with an improvement ranging between 2 authors did not differentiate between the types of activities
to 22% compared to that being achieved in [4, 30, 31]. The within a single exercise, our model provisioned for the dif-
F1-score was 12% higher than that achieved in [4] using the ferent types of activities resulting in 9 distinct features along
ANN classifier and 2 percent in the case of the SVM classi- with the total activity occurrence count per exercise. This has
fier being used in [30, 31]. We believe that such improvement resulted in a total of 10 features related to activities per exer-

123
10242 Arabian Journal for Science and Engineering (2022) 47:10225–10243

cise. Also, contrary to the model in [4], where the interaction Funding Not applicable.
of students with DEEDS was only captured by a single fea-
ture counting the number of keystrokes, our model took into Availability of Data and Materials The dataset used in this study is pub-
licly published https://sites.google.com/site/learninganalyticsforall/
account the different types of interactions of students with
data-sets/epm-dataset.
DEEDS using input peripherals (mouse and keyboard) lead-
ing to a total of 6 features per exercise. The extra information
that was provided to the prediction model explains the sig- Declarations
nificant classification performance improvement captured in
Fig. 8. Conflict of interest The author has no conflicts of interest.

5 Conclusion
References
In this article, we demonstrated the ability to predict student
1. Vahdat, M.; Oneto, L.; Anguita, D.; Funk, M.; Rauterberg, M.: A
performance by analyzing the interaction logs of students in
learning analytics approach to correlate the academic achievements
the DEEDS dataset. We have extracted a total of 86 statisti- of students with interaction data from an educational simulator.
cal features, which are categorized into three main categories In: Design for Teaching and Learning in a Networked World,
based on different criteria: (1) activity-type based, (2) tim- pp. 352–366. Springer, Cham (2015)
2. Tomasevic, N.; Gvozdenovic, N.; Vranes, S.: An overview and
ing statistics-based, and (3) peripheral activity count-based
comparison of supervised data mining techniques for student exam
features. This set of features was further reduced during the performance prediction. Comput. Educ. 143, 103676 (2020)
feature selection phase where we have applied the entropy- 3. Hellas, A.; Ihantola, P.; Petersen; A.; Ajanovski, V.; Gutica, M.;
based selection technique and only influential features were Hynninen, T; Liao, S.N.: Predicting academic performance: a sys-
tematic literature review. In: Proceedings Companion of the 23rd
retained for training purposes. We trained our model consid-
Annual ACM Conference on Innovation and Technology in Com-
ering three different scenarios: (1) 80:20 random data split for puter Science Education, pp. 175–199 (2018)
training and testing, fivefold cross-validation, and (3) train 4. Hussain, M.; Zhu, W.; Zhang, W.; Abidi, S.M.R.; Ali, S.: Using
the model on all sessions but one which will be used for machine learning to predict student difficulties from learning ses-
sion data. Artif. Intell. Rev. 52(1), 381–407 (2019)
testing. Then we collected performance results in terms of
5. Buenaño-Fernández, D.; Gil, D.; Luján-Mora, S.: Application of
Accuracy, Precision, Recall, F1-score, and ROC, using the machine learning in predicting performance for computer engi-
five prominent classifiers (RF, SVM, MLP, LR, and NB). neering students: a case study. Sustainability 11(10), 2833 (2019)
Results showed that the best performance was obtained 6. Ofori, F.; Maina, E.; Gitonga, R.: Using machine learning algo-
rithms to predict students performance and improve learning
using an RF classifier with a classification accuracy of 97%
outcome: a literature based review. J. Inf. Technol. 4(1), 33–55
and an F1-score of 97%. However, the poorest results were (2020)
achieved with NB due to the inherent dependency of the 7. Huang, S.; Fang, N.: Predicting student academic performance in
model on the proposed features. When comparing our model an engineering dynamics course: a comparison of four types of pre-
dictive mathematical models. Comput. Educ. 61, 133–145 (2013)
with the benchmark models proposed by Hussain et al. in [4],
8. Rastrollo-Guerrero, J.L.; Gomez-Pulido, J.A.; Duran-Dominguez,
Sriram et al. in [30] and Maksud et al. in [31], we were able A.: Analyzing and predicting students’ performance by means of
to demonstrate that, under a similar experimental setup, our machine learning: a review. Appl. Sci. 10(3), 1042 (2020)
model outperformed existing models in terms of classifica- 9. Sundar, P.P.: A comparative study for predicting students academic
performance using Bayesian network classifiers. IOSR J. Eng. IOS-
tion accuracy and F1-score.
RJEN e-ISSN, 2250-3021 (2013)
Future work For future work, we propose exploring vari- 10. Burgos, C.; Campanario, M.L.; de la Peña, D.; Lara, J.A.; Lizcano,
ous research directions as follows: D.; Martínez, M.A.: Data mining for modeling students’ perfor-
mance: a tutoring action plan to prevent academic dropout. Comput.
Electr. Eng. 66, 541–556 (2018)
1. Modify and compare the proposed model with a model 11. Ma, X.; Zhou, Z.: Student pass rates prediction using optimized
which considers more sophisticated Machine Learning support vector machine and decision tree. In: 2018 IEEE 8th
algorithms for feature extraction and classification such Annual Computing and Communication Workshop and Confer-
as Decision Trees, fuzzy entropy-based analysis, and ence (CCWC), pp. 209–215. IEEE (2018)
12. Masci, C.; Johnes, G.; Agasisti, T.: Student and school performance
transfer learning, etc. across countries: a machine learning approach. Eur. J. Oper. Res.
2. Enhance the prediction models to a multi-label problem 269(3), 1072–1085 (2018)
aimed at classifying students into four broad categories: 13. Pardo, A.; Han, F.; Ellis, R.A.: Combining university student self-
(1) very weak, (2) weak, (3) average, and (4) good. regulated learning indicators and engagement with online learning
events to predict academic performance. IEEE Trans. Learn. Tech-
3. Consider proposing a regression model to predict exam nol. 10(1), 82–92 (2016)
grades along with classifying students’ performance 14. Gray, G.; McGuinness, C.; Owende, P.: An application of classifica-
using just the binary classification approach. tion models to predict learner progression in tertiary education. In:

123
Arabian Journal for Science and Engineering (2022) 47:10225–10243 10243

2014 IEEE International Advance Computing Conference (IACC), 27. Trstenjak, B.; Ðonko, D.: Determining the impact of demo-
pp. 549–554. IEEE (2014) graphic features in predicting student success in Croatia. In:
15. Hussain, M.; Zhu, W.; Zhang, W.; Abidi, S.M.R.: Student engage- 2014 37th International Convention on Information and Commu-
ment predictions in an e-learning system and their impact on student nication Technology, Electronics and Microelectronics (MIPRO),
course assessment scores. Comput. Intell. Neurosci. (2018) pp. 1222–1227. IEEE (2014)
16. Elbadrawy, A.; Studham, R.S.; Karypis, G.: Collaborative multi- 28. Kursa, M.B.; Rudnicki, W.R.: Feature selection with the Boruta
regression models for predicting students’ performance in course package. J Stat Softw 36(11), 1–13 (2010)
activities. In: Proceedings of the Fifth International Conference on 29. Shaw, R.G.; Mitchell-Olds, T.: ANOVA for unbalanced data: an
Learning Analytics and Knowledge, pp. 103–107 (2015) overview. Ecology 74(6), 1638–1645 (1993)
17. Liu, S.; d’Aquin, M.: Unsupervised learning for understand- 30. Sriram, K.; Chakravarthy, T.; Anastraj, K.: A comparative analysis
ing student achievement in a distance learning setting. In: 2017 of student performance prediction using machine learning tech-
IEEE Global Engineering Education Conference (EDUCON), niques with DEEDS lab. J. Compos. Theory XII(VIII) (2019)
pp. 1373–1377. IEEE (2017) 31. Maksud, M.; Nesar, A.: Machine learning approaches to digital
18. Kuzilek, J.; Hlosta, M.; Herrmannova, D.; Zdrahal, Z.; Vaclavek, learning performance analysis. Int. J. Comput. Digit. Syst. 10, 2–9
J.; Wolff, A.: OU Analyse: analysing at-risk students at The Open (2020)
University. Learn. Analyt. Rev. 1–16 (2015) 32. Leena, H. A; Ranim, S. A; Mona, S. A; Dana, K. A; Irfan, U. K;
19. Ho, T.K.: Random decision forests. In: Proceedings of 3rd Inter- Nida, A.: Predicting Student Academic Performance using Support
national Conference on Document Analysis and Recognition, vol. Vector Machine and Random Forest. 3rd International Conference
1, pp. 278–282. IEEE (1995) on Education Technology Management. pp. 100–107 (2020)
20. Bauer, E.; Kohavi, R.: An empirical comparison of voting classifi- 33. Hasan, R.; Sellappan, P.; Salman, M.; Ali, A.; Kamal, U.S.; Mian,
cation algorithms: Bagging, boosting, and variants. Mach. Learn. U.S.: Predicting student performance in higher educational insti-
36(1), 105–139 (1999) tutions using video learning analytics and data mining techniques.
21. Latif, G.; Iskandar, D.A.; Alghazo, J.M.; Mohammad, N.: Appl. Sci. 10(11), 3894 (2020)
Enhanced MR image classification using hybrid statistical and 34. Aydoğdu, Ş: Predicting student final performance using artificial
wavelets features. IEEE Access 7, 9634–9644 (2018) neural networks in online learning environments. Educ. Inf. Tech-
22. Suthaharan, S.: Machine learning models and algorithms for big nol. 25(3), 1913–1927 (2020)
data classification. Integr. Ser. Inf. Syst 36, 1–12 (2016) 35. Biesiada, J.; Włodzisław D.; Adam K.; Krystian M.; Sebastian P.:
23. Misra, S.; Li, H.; He, J.: Machine Learning for subsurface Charac- Feature ranking methods based on information entropy with parzen
terization. Gulf Professional Publishing, Oxford (2019) windows. In: International Conference on Research in Electrotech-
24. Bewick, V.; Cheek, L.; Ball, J.: Statistics review 14: logistic regres- nology and Applied Informatics, vol. 1, p. 1. (2005)
sion. Crit. Care 9(1), 1–7 (2005) 36. Horino, H,; Hirofumi, N.; Elisa, C.A.C.; Toru, H.: Development
25. Meurer, W.J.; Tolles, J.: Logistic regression diagnostics: under- of an entropy-based feature selection method and analysis of
standing how well a model predicts outcomes. JAMA 317(10), online reviews on real estate. In: IEEE International Conference
1068–1069 (2017) on Industrial Engineering and Engineering Management (IEEM),
26. Rehman, A.; Naz, S.; Razzak, M.I.; Hameed, I.A.: Automatic visual pp. 2351–2355. IEEE (2017)
features for writer identification: a deep learning approach. IEEE
Access 7, 17149–17157 (2019)

123

You might also like