Predicting Academic Success in Higher Education
Literature Review and Best Practices
Data mining has a stack of open source tools such as machine learning tools which
supports the researcher in analyzing the dataset using several algorithms.
Such tools are vastly used for predictive analysis, visualization, and statistical modeling.
WEKA is the most used tool for predictive modeling (Jayaprakash, 2018).
This can be explained by its many pre-built tools for data pre-processing, classification,
association rules, regression, and visualization, as well as its user-friendliness, and
accessibility even to a novice in programming or data mining.
Education Data Mining (EDM) plays a significant role in discovering patterns of knowledge
about educational phenomena and the learning process, including understanding performance.
EDM has been used for predicting a variety of crucial educational outcomes, like performance,
retention, success, satisfaction, achievement, and dropout rate. Student success is a crucial
component of higher education institutions because it is considered as an essential criterion for
assessing the quality of educational institutions. Despite reports calling for more detailed views
of the term, the bulk of published researchers measure academic success narrowly as academic
achievement. There are several definitions of student success in the literature.
Academic achievement itself is mainly based on grades and GPA, or Cumulative Grade Point
Average. Academic success has also been defined related to students' persistence, also called
academic resilience. Several studies have been published in using data mining methods to predict
students' academic success.
Prior academic achievement, student demographics, e-learning activity, psychological attributes,
and environments are the most commonly reported factors. Gender, age, race/ethnicity,
socioeconomic status and father's and mother's background have been shown to be important.
The psychological attributes are determined as the interests and personal behavior of the student.
Several studies indicated its impact on students' success. The design of a prediction model using
data mining techniques requires the instantiation of many characteristics, like the type of the
model to build, or methods and techniques to apply.
This section defines these attributes, provide some of their instances, and reveal the statistics of
their occurrence among the reviewed papers grouped by the target variable in the student success
prediction. Predicting success at a course level can give more accuracy than at degree or year
level. The best accuracy is obtained in course level with 93%. The target course was an advanced
programming course while the influential factor was a previous programming course, also a
prerequisite course. All decisions needed to be taken at various stages are explained, along with a
shortlist of best practices collected from the literature.
Data sources tend to be inconsistent, contain noises, and usually suffer from missing values. This
is why the raw data needs to go through an initial preparation, consisting of 1) selection, 2)
cleaning, and 3) derivation of new variables. Data selection, also called "Dimensionality
Reduction", consists in vertical selection and horizontal selection.
Data sources tend to be inconsistent, contain noises, and usually suffer from missing data. There
are two strategies to deal with missing data: listwise deletion or imputation. Outliers data are also
known as anomalies, can easily be identified by visual means. Once identified, outliers can be
removed from the modeling data. New variables can be derived from existing variables by
combining them.
For example, GPA is a common variable that can be obtained from SIS system. Preliminary
statistical analysis, especially through visualization, allows to better understand the data before
moving to more sophisticated data mining tasks and algorithms. Dedicated tools like
STATISTICA and SPSS can also provide tremendous insight.
Data transformation is a necessary step to eliminate dissimilarities in the dataset. Normalizing
the data may improve the accuracy and efficiency of the mining algorithms. Discretization also
increases the accuracy of the models by overcoming noisy data, and by identifying outliers'
values. Finally, discrete features are easier to understand, handle, and explain. It is common in
EDM applications that the dataset is imbalanced, meaning that the number of samples from one
class is significantly less than the samples from other classes.
Re-sampling is the solution of choice. Feature selection aims to choose a subset of attributes
from the input data while reducing effects from unrelated variables while preserving sufficient
prediction results. Two types of data mining models are commonly used in EDM applications for
success prediction. Descriptive models are used to produce patterns that describe the
fundamental structure, relations, and interconnectedness of the mined data. Predictive models
apply supervised learning functions to provide estimation for expected values of dependent
variables. Table 13 shows the recurrence of specific algorithms based on the literature review
that we performed.
Data mining has a stack of open source tools such as machine learning tools which supports the
researcher in analyzing the dataset using several algorithms. There are various strategies to tune
parameters for EDM algorithms, used to find the most useful performing parameters. Different
performance measures are included to evaluate the model of each classifier, almost all measures
of performance are based on the confusion matrix and numbers in it. By applying EDM
techniques, it is possible to develop prediction models to improve student success. Using data
mining techniques can be daunting and challenging for nontechnical persons.
This study presents a clear set of guidelines to follow for using EDM for success prediction. The
study was limited to undergraduate level, however the same principles can be easily adapted to
graduate level.
Education Data Mining (EDM) plays a significant role in discovering patterns of knowledge
about educational phenomena and the learning process, including understanding performance.
EDM has been used for predicting a variety of crucial educational outcomes, like performance,
retention, success, satisfaction, achievement, and dropout rate. Several studies have been
published in using data mining methods to predict students' academic success. Data sources tend
to be inconsistent, contain noises, and usually suffer from missing data. There are two strategies
to deal with missing data: listwise deletion or imputation. Normalizing the data may improve the
accuracy and efficiency of the mining algorithms. Discretization increases the accuracy of the
models by overcoming noisy data, and by identifying outliers' values. Data mining has a stack of
open source tools such as machine learning tools which supports the researcher in analyzing the
dataset using several algorithms. Descriptive models are used to produce patterns that describe
the fundamental structure, relations, and interconnectedness of the mined data. Predictive models
apply supervised learning functions to provide estimation for expected values of dependent
variables.