Clustering Student Learning Behaviors
Clustering Student Learning Behaviors
Abstract—Online learning is the source of data generation utilize models in improving learning and evaluating the process
related to learner’s learning behaviors, which is valuable for through instrumental investigation. According to the first
knowledge discovery. Existing research emphasized more on an definition of LA, it is an approach to collecting, analyzing, and
understanding of student’s performance and achievement from reporting educational data related to learning, learners, and its
learning log data. In this study, we presented data-driven learning related context [7]. There are several techniques used for the
behavior clustering in authentic learning context to understand
students’ behavior while participating in the learning process. The
analytical process of learning related data such as supervised
objective of the study is to distinguish students according to their and unsupervised learning methods. The main difference
learning behavior characteristics and identify clusters of students between the two approaches is the use of labeled data (Nafis
at risk of unsuccessful learning achievement. Learning log data and Biswas, 2022; Shakarami, Shahidinejad and Ghobaei-
were collected from ubiquitous learning applications before Arani, 2021). In the unsupervised learning method, the process
conducting Exploratory Data Analysis (EDA) and cluster analysis. of data analysis to learn the patterns from data does not require
We used partitional clustering using K-means algorithm and labeled input and output data. It is being used generally for
hierarchical clustering based on the agglomerative method to clustering and segmentation-related tasks. The algorithm
improve clustering strategies. The result of this study revealed performs natural clustering over the dataset to identify similar
three different clusters of students supported by data visualization
techniques. Cluster 1 comprised more students with active
patterns and characteristics. The process of learning about user
learning behavior based on the total logs, total problems posed, behavior from log data typically involves partitioning the data
and the total attempts in fraction operation and simplification. into meaningful subsets, called partitions, and comparing the
Students in clusters 2 and 3 had a higher attempt at problem- different partitions.
solving instead of problem-posing. Both clusters also focused on In an educational context, cluster analysis can be used to
fraction’s conceptual understanding. Knowledge discovery of this gain insight into structured data such as student behavior
study used real data generated from ubiquitous learning grouping, finding similar learning patterns, and student
application namely U-Fraction. We combined two different types performance clustering [10], [11]. However, despite the
of clustering method for delivering more accurate portrait of a potential of unsupervised learning or cluster analysis for LA, it
student’s hidden learning behaviors. The outcome of this study can
be a basis for educational stakeholders to provide preventive
is seldom utilized for supporting teaching and learning analysis
learning strategies tailored to a different cluster of students. in ubiquitous learning contexts based on students’ learning log
data [12]. Log data is automatically produced files and
Keywords—Learning analytics, behavior clustering, timestamps relevant to the system or software application [13].
unsupervised learning, learning log-data, education research, Log data can provide a portrait of a student’s hidden learning
educational policies. behavior and give a more complete or accurate picture of all
behaviors. Yet, log data generated by the learning application
I. INTRODUCTION server had left the characteristic prone to data noise. The
Over ten years, starting from 2011, learning analytics (LA) process of mining and reducing noise in log data is considered
with data-driven analysis has arisen by exploiting machine as challenging task. In addition to that fact, this study tries to
learning in the educational field [1], [2]. Several research perform an unsupervised learning method on student behavior
studies in educational data mining and artificial intelligence based on learning log data generated from ubiquitous learning
have tempted to distinguish the LA movement in an educational applications. In this study, log data refers to all students’
context [3], [4], [5], [6]. LA used educational data for activity while using the learning system namely ubiquitous
knowledge discovery and transform data into meaningful fraction (U-Fraction) [14], [15]. This learning application is
insights. It is used for leveraging educational data to support the installed on a tablet device with an Android operating system.
teaching and learning process. The main purpose of LA is to By analyzing learning log data produced from the application,
14
Journal of Advanced Technology and Multidiscipline (JATM)
Vol. 03, No. 01, 2024, pp. 13-20
e-ISSN: 2964-6162
the educational stakeholder can obtain learning problems at the Furthermore, students learning and social interaction with real-
earliest possible time. Additionally, it can enable them to world situation is critical to be learned and analyzed. Currently,
resolve learning issues in a timelier fashion. Most importantly, there have been few studies that examine students' actions like
a lot of data from learning systems and applications can be their interaction behaviors rather than their perceptions and
analyzed using machine-learning techniques to support performance. The present study takes a further step toward the
decision-making in the educational field. We structured this direction to propose an approach in interpreting students
paper as follows. In Section 1, an overview of LA especially for learning log data to understand how students learn in the
cluster analysis in an educational context was presented. authentic situation over time. These findings are hinting that log
Section 2 presented a literature review of related studies. data could be an important source to identify behavioral
Section 3 described the methodological part of the research. interaction in authentic learning context.
Furthermore, section 4 explained the result of the study
B. Cluster analysis used in educational purpose
followed by a research discussion. Last, section 5 provided the
conclusion of the study. . To support the call for LA in education, several cluster
analyses have been researched in the literature. While some
II. LITERATURE REVIEW research studies have collected log-file data from virtual
learning environments, the data was frequently evaluated using
A. Learning log data in educational context more conventional statistical techniques like regression,
Learning log data is defined as important source to provide correlation, and t-tests rather than analytics algorithms [22].
powerful portrait of students learning patterns and their hidden Instead, cluster analysis serves as an exploratory method that
behaviors during participation on learning process. Log data are aims to identify naturally occurring homogeneous groups that
commonly collected from online learning platforms such as were either unclear or previously unknown [23]. With a rapid
virtual learning environment, e-learning, or mobile learning increase in available learner data, cluster analysis becomes the
applications. Accessing and analyzing learning log data is potential in understanding and unveiling hidden information
challenging due to privacy issue and proper storage about students in educational settings [24]. Studies by Yadav
management. Effective learning log data management requires [25], [26] proposed a new approach known as hybrid clustering
more time to be processed because the huge amount of to assess students’ academic performance. The clusters are
information collected from online server need complex formed based on the intelligence level of students. Walsh and
treatment like understanding of application usage, Risquez [22] used cluster analysis to explore the engagement of
preprocessing task, data engineering, and data architecture native and non-native English-speaking management students
provision. In the past research, some studies focused on the in a flipped classroom. They used log file data to identify hidden
direction how to interpret learning log data in understanding patterns in student behavior, paying particular attention to the
student learning process in flipped classroom [16], [17], [18]. institution's native language proficiency.
Commonly, researchers on learning analytics used learning log Research shows the exploratory potential of cluster
data from Learning Management System (e.g., Moodle, analysis on log file data in other contexts such as peer tutoring
Canvas, etc.) or Massive Open Online Course (e.g., Coursera, [27], [28]. However, despite its potential, cluster analysis is still
Udemy, etc.). The learning analytics goals emphasized teaching underutilized in the context of education. Moreover, the rare
and learning processes in asynchronous learning networks. For previous application of cluster analysis to study student
example, data collection related to the number of posts, the learning behavior in ubiquitous learning contexts remained
number of posts read, the number of posts replied, and content unclear. The present study is adapted from the work of
viewed. Jovanovic et al. [29]. However, in the present study, we applied
A limitation of previous studies is that they focus on cluster analysis to log-file data to identify patterns in how
student performance and student satisfaction which typically students access online resources over time while engaging with
rely on self-reporting and may be inaccurate [19], [20]. a ubiquitous learning application. This paper attempts to
Therefore, more studies based on log-file data are needed in address the lack of research using learning analytics in the
order to add an additional level of research validity to the ubiquitous learning context, using students' learning log data
understanding of students' behavior in relation to the authentic from a mobile application, and the cluster analysis algorithm
learning approach. It has been argued that log-file data may be using hierarchical and partitional methods.
more genuine and authentic than survey data, which are prone
bias into students' interpretations [21]. Instead, learning III. METHOD
analytics can reflect real and uninterrupted user behavior [17]. In the present study, we employed EDA as an initial
Therefore, rather than relying on student perceptions, this study technique for understanding the dataset. Investigation of data
examines ubiquitous application learning log data on how using EDA is used to discover unseen patterns, data anomalies,
student interact in authentic learning context and how students and a summary of the data [30]. Two important practices in
accessed learning material. However, the use of learning log EDA i.e., descriptive statistics and data visualization were used
data in mobile application particularly for authentic learning to gather insight from the data [31]. Before conducting EDA,
context had not yet fully exploited to unveil students learning we accessed the data from the online repository and organized
behaviors. Whereas, the adoption and acceptance of mobile it using Structured Query Language (SQL) operations such as
learning has led to a dramatic increase in available learner data. data selection, data join, and data aggregation. We used
15
Journal of Advanced Technology and Multidiscipline (JATM)
Vol. 03, No. 01, 2024, pp. 13-20
e-ISSN: 2964-6162
learning log data generated from a ubiquitous learning means algorithm, we also performed the agglomerative method
application namely U-Fraction. The dataset is related to student as a bottom-up approach to hierarchical clustering. Recursively,
learning activity while using the application such as problem- each observation starts in its cluster, and pairs of clusters are
solving activities and peer assessment. It was adapted from an merged as one moves up the hierarchy. This method works from
experimental study conducted by Hwang et. al. in 2018 [32]. the dissimilarities between the objects to be grouped. A type of
The data log structure before data preprocessing is represented dissimilarity can be suited to the subject studied and the nature
by the database design (Figure 2). There are 10 variables of the data. Overall, the process in research methodology is
selected for cluster analysis after the data preprocessing stage presented in Figure 1.
and feature selection stage. The attributes of the dataset are TABLE 3
K-MEANS PSEUDOCODE
presented in Table 1.
Algorithm 1. K-Means Algorithm
TABLE 1
THE ATTRIBUTE OF THE DATASET Data: number of clusters k, dataset X
Result: cluster centres C = {c1, ..., ck}
No Attribute name Description
Start
1 Operation Total attempts of fraction operation
Randomly select k data points as initial cluster centres;
2 Success_oper Total of successful fraction operation
Repeat
3 Simplification Total attempts at fraction simplification
Reinitialize all partition S subsets as empty:
4 Success_simp Total of successful fraction’s simplification
S1 = S2 = ··· = Sk = {};
5 Asking Problem posing
Compute the distance of each data point to each cluster centre;
6 Answer Problem-solving
Assign each data point to the closest cluster centre:
7 Comment Peer assessment
for i ∈ {1, ..., N} do
8 Understanding Fraction understanding
respective label l = argminj ∈ {1, ..., k} ‖ xi – cj ‖2;
9 Log1 Total data logging 1
Sl = Sl ∪ {xi};
10 Log2 Total data logging 2
End
Define new cluster centres based on the current partition:
After the data preprocessing step with EDA, we followed a two- for j ∈ {1, ..., k} do
step cluster analysis using a K-means algorithm and cj = ∑ i ∈ {1, ..., N} xi ∈ Sj xi / |Sj|
agglomerative method. The K-means algorithm is a partition- End
until the cluster assignment converges;
based clustering method, while agglomerative is a hierarchical End
clustering method [27], [28]. K-means is best suited for a small-
to-medium number of clusters, as is the case for student IV. RESULT AND DISCUSSION
behavior clustering of this study [29]. The clustering process in
K-means started by defining the number of clusters k [30], [31]. In this section, we explained the results of the present
In addition, each of k is represented by a cluster center and each study. The results of the study are categorized into two sub-
data point is assigned to the nearest cluster center namely the sections as follows:
centroid. The algorithm group data that has similar A. Exploratory data analysis
characteristics into the sample cluster, while data with different
In this step, we performed data pre-processing using EDA
characteristics are grouped into other clusters [32], [33].
such as data cleaning (i.e., missing value computation and data
Typically, the Euclidean distance is used as a distance measure.
noise treatment), data transformation, and data reduction. EDA
The calculation using the Euclidian Distance formula (equation
is important step in data analytic task because it performs initial
1) with the description of the formula in Table 2 is as follows:
investigation on data to discover patterns, to spot some
d (x, y) = √∑𝑛𝑖=1(𝑦𝑖 − 𝑥𝑖 )2 (1) anomalies, to test hypothesis, and to check assumptions using
summary statistics and graphical representations. In present
TABLE 2
study, we employed several Python libraries such as Pandas,
THE FORMULA DESCRIPTION
Symbol Description NumPy, Matplotlib, and ScikitLearn to perform the EDA’s
d Calculation of the distance to the center of the cluster operation and cluster analysis. The learning log dataset
x Point coordinates of the object comprises 4202 observations and 11 characteristics. We used
y Centroid coordinate data.head(10) function to show the dataset with only ten rows
𝑛
The amount of data to be measured, while i = 1 is the available (see Table 4). Furthermore, dataset information
∑(𝑦𝑖 − 𝑥𝑖 )2
clustering process starting from the first iteration including summary and missing value checking results is
𝑖=1
xi Coordinate the point of the i object presented in Figure 3. From the dataset summary, we can
yi i centroid coordinate point
identify the total of the column and the data type of each
column. Data has only non-null and integer values. In addition,
In the next step, new cluster centers are defined as the center of missing value analysis is used to check whether the dataset
mass of each cluster candidate. Unless the following contains a null value after the data pre-processing step. From
termination criterion is met, this process is repeated. The the result, we concluded that all columns have no missing
algorithm terminates if the last iteration did not lead to changes values.
in the assignment of each data point to the current cluster
centers [25]. The pseudocode is given in Table 3. Beside a K-
16
Journal of Advanced Technology and Multidiscipline (JATM)
Vol. 03, No. 01, 2024, pp. 13-20
e-ISSN: 2964-6162
TABLE 4
THE DATASET WITH THE TOP 10 ROWS
User Operation Success_oper Simplify Success_simp Asking Answer Comment Understanding Log1 Log2
1 24 22 162 16 243 69 20 292 2621 518
2 45 44 128 20 242 74 14 287 2928 512
3 148 18 177 18 74 70 21 290 2601 541
4 9 38 138 44 97 73 12 258 2300 560
5 62 29 137 35 98 81 12 264 2382 1429
6 109 33 381 12 203 87 19 241 2290 3401
7 33 22 181 15 192 85 12 255 2892 506
8 26 18 239 12 107 93 12 291 2334 943
9 39 49 144 53 161 89 12 254 2284 567
10 48 20 155 16 153 135 12 244 2212 1418
17
Journal of Advanced Technology and Multidiscipline (JATM)
Vol. 03, No. 01, 2024, pp. 13-20
e-ISSN: 2964-6162
V. CONCLUSIONS
This research used the unsupervised learning method of
machine learning to discover a similar pattern of students’
learning log data and perform cluster analysis in order to obtain
Fig. 4. Elbow method students’ behavior clustering. The dataset is collected from
students’ learning activity while using the ubiquitous fraction
app called U-Fraction. Data are processed in the initial step
using EDA for data cleaning and transformation. Partition-
based clustering methods using the K-means algorithm and
hierarchical clustering methods using an agglomerative
approach are used to create a cluster of students. The result
showed three different clusters of students with different
learning behavior characteristics. Cluster 1 comprised more
students with active learning behavior based on the total logs,
total problems posed, and the total attempts in fraction
operation and simplification. Students in clusters 2 and 3 had a
higher attempt at problem-solving instead of problem-posing.
Both clusters also focused on fraction understanding. However,
no significant difference in peer assessment activity among the
groups. The outcome of this study can help educational
18
Journal of Advanced Technology and Multidiscipline (JATM)
Vol. 03, No. 01, 2024, pp. 13-20
e-ISSN: 2964-6162
stakeholders to provide preventive learning strategies tailored
to different clusters of students.