0% found this document useful (0 votes)
34 views4 pages

14 Predicting Students Performance in Educational Data Mining

The document discusses the development of a classification model called the Students Performance Prediction Network (SPPN) to predict student academic performance using deep learning techniques. It highlights the challenges of measuring academic performance due to various influencing factors and the limitations of traditional data mining methods. The study demonstrates that SPPN, trained on a large dataset, achieves superior accuracy compared to other algorithms, making it effective for educational pre-warning mechanisms.

Uploaded by

rashed10cse.hstu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views4 pages

14 Predicting Students Performance in Educational Data Mining

The document discusses the development of a classification model called the Students Performance Prediction Network (SPPN) to predict student academic performance using deep learning techniques. It highlights the challenges of measuring academic performance due to various influencing factors and the limitations of traditional data mining methods. The study demonstrates that SPPN, trained on a large dataset, achieves superior accuracy compared to other algorithms, making it effective for educational pre-warning mechanisms.

Uploaded by

rashed10cse.hstu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

2015 International Symposium on Educational Technology

Predicting Students Performance in Educational


Data Mining
Bo Guo∗ , Rui Zhang† , Guang Xu‡ , Chuangming Shi§ and Li Yang§
School of Computer
Hubei University of Education, Wuhan, Hubei, China
Email: [Link]@[Link]∗
Email: zhangr2013@[Link]†
Email: xuguang@[Link]‡
Email: shichuangming@[Link]§
Email: yangli@[Link]§

Abstract—Predicting student academic performance has been    


an important research topic in Educational Data Mining (EDM)     
 
which uses machine learning and data mining techniques to

explore data from educational settings. However measuring 
  
academic performance of students is challenging since students 
   
academic performance hinges on diverse factors. The interre- 
lationship between variables and factors for predicting perfor-    
   
mance participate in complicated nonlinear ways. Traditional  

data mining and machine learning techniques may not be applied   
directly to these types of data and problems. In this study we 
develop a classification model to predict student performance    

using Deep Learning which automatically learns multiple levels of 
representation. We pre-train hidden layers of features layerwisely 
  
using an unsupervised learning algorithm sparse auto-encoder  
from unlabeled data, and then use supervised training for fine-  
tuning the parameters. We train model on a relatively large real                  
world students dataset, and the experimental results show the
effectiveness of the proposed method which can be applied into Fig. 1. Students performance prediction system overview. We learn
academic pre-warning mechanism. multi-level representations by unsupervised training hidden layers, then neural
network is fine-tuned using back-propagation. The input of network is a flat
vector of different kinds of student information, the output is a multiple
I. I NTRODUCTION classification of softmax which indicates the student final examination score.
Applying data mining and machine learning methods in
education is an emerging interdisciplinary field, which forms problem of academic performance are not clearly understood,
a new research field called Educational Data Mining (EDM) and they are often related in a complicated nonlinear way.
[1]. EDM uses these machine learning and data mining tech-
Therefore using machine learning techniques in EDM to
niques to explore data from educational settings to find out
explore educational data and discover latent meaningful pat-
predictions and patterns that characterize students behaviors
terns for predicting students marks prevails recently. The per-
and performance. Its goal is to better identify the settings in
formance of machine learning methods is heavily dependent
which they learn to improve educational outcomes and to gain
on the choice of data representation. A variety of machine
insights into educational phenomena.
learning related approaches has been proposed to predict
In EDM, predicting the performance of a student is a student performance. Romero [Link] [2] used multiple linear
great concern to the education managements. For example, regression model and support vector machine (SVM) to predict
it could give an appropriate warning to students those who overall and individual student academic performance. Jia [Link]
are at risk by forecasting the grade of students, and help [3] predicted students retention by combining SVM and a shal-
them to avoid problems and overcome all difficulties in study. low neural network to improved the classification accuracy.
However measuring of academic performance of students is Musso [Link] [4] applied traditional artificial neural networks to
challenging since students academic performance hinges on di- predicting general academic performance. Kotsiantis [5] used
verse factors or characteristics such as demographics, personal, regression method to predict the students marks in a distance
educational background, psychological, academic progress and learning system. Wolff [Link] [6] developed a predictive model
other environmental variables. The interrelationship between using decision trees and SVM with data from several Open
these variables participating in the complex and multi-faceted University to forecast students pattern. These methods are all
based on shallow architectures [7] which implement one or two

978-1-4673-7370-8/15 $31.00 © 2015 IEEE 127


125
DOI 10.1109/ISET.2015.33

Authorized licensed use limited to: Hajee Mohammad Danesh Science and Technology Univ. Downloaded on May 27,2022 at [Link] UTC from IEEE Xplore. Restrictions apply.
layer feature representation. Shallow model can not capture Background and demographic data: gender, age, health sta-
all relationships among factors especially when the data is tus, family status etc.
relatively large and correlated. Traditional data mining and Past study data: junior high school entrance score, GPA of
machine learning techniques may not be applied directly to primary school etc.
these types of data and problems. School assessment data: school type, school ranking etc.
In this study a prediction system, called Students Perfor- Study data: every course score in junior high school(middle-
mance Prediction Network (SPPN), is proposed to predict term exam, final-term exam, average)
student performance using emerging trend Deep Learning Personal data: personality, attention, psychology related data,
approach [8] which is demonstrated to be a very effective etc.
method to predict outcomes with a high level of accuracy, All collected raw data is transferred into numerical values, then
especially when large data sets are available. Deep learning we normalized and scaled the data values by subtracting the
algorithm automatically discovers abstraction with the belief mean and dividing by the standard deviation of its elements to
that more abstract representations of data tend to be more make sure that each value varies within the same range. After
useful and learns multiple levels of representation. SPPN normalizing each input vector, the entire dataset is whitened
involves millions of parameters to train, which require massive [11] to make the input less redundant.
computation power. The learning can be made more efficient
III. A LGORITHM
by using a layer-by-layer pretraining phase that initializes the
weights sensibly. The pretraining also allows the variational in- After pre-processing, we concatenated student data consist-
ference to be initialized sensibly with a single bottom-up pass. ing of different information described in the [Link] into a
Thus graphical processing unit (GPU) [9] is being used due to flatten vector x1 . As shown in Fig.1, x1 will be the input of
its parallel architecture for the fast execution and training. One the networks. Then an unsupervised learning algorithm sparse
of the best advantages of GPU over the Central Processing auto-encoder is used to discover features from the unlabeled
Unit (CPU) is its lower cost to create parallel threads on data. We train an auto-encoder for K hidden nodes in hidden
blocks due to its efficient hardware implementation, whilst layer l using back-propagation algorithm to minimize squared
the CPU incurs an overhead to switch to another program. reconstruction error among all m examples with a penalty term
For this reason, the GPU hardware architecture allows the that forces the units to output a low average activation:
1   (l−1)  2
improvement of computational performance in massive data m
  
scenarios. GPU seems to be a natural candidate for massive arg min a − recons W l , bl , a(l−1)  +λS al
l
W ,b l m j=1
processing of educational data on forecasting applications. To
the best of our knowledge, SPPN is the first GPU implemented    T
  (1)
deep learning system in educational prediction. recons W l , bl , a(l−1) = f (W l ) f W l a(l−1) + bl + bl
In SPPN we use a six layer neural networks to implement (2)
deep learning algorithm as shown in Fig.1. Networks consist where S(al ) is a sparsity cost function which penalizes layer
of 1 input layer, 4 hidden layers and 1 output layer. We learn l output al for being far from zero. W l is layer l weight pa-
multi-level representations by greedily pre-training hidden rameters matrix, bl is biases, recons is reconstruction function
layers of features, one layer at a time, using an unsupervised which uses W l and bl to construct output with last layer l − 1
learning algorithm sparse auto-encoder [10]. It can learn good output al−1 . A hidden neuron layer l output is al :
feature representations and better initialize the network param- al = f (W l al−1 + bl ) (3)
eters. After the pre-training, neural network is then fine-tuned
using back-propagation of error derivative. We train proposed A Rectified Linear Units(ReLUs)[12] function:
model on a 120,000 students dataset with two Tesla K40 12GB
f (a) = max (0, a) (4)
GPUs, and the experimental results show the effectiveness and
efficiency of the proposed method which can be applied into is used to model hidden neuron’s activity function. We
academic pre-warning mechanism. used this unsupervised training method in all hidden layers
layerwisely to initialize the whole networks weights W to
II. E DUCATIONAL DATA appropriate values.
Educational data can be collected from multiple sources In the output layer lo , we used softmax to make a mul-
coming in different formats and granularity. We collected real tiple classification. lo has 5 output neuron units, each unit
world data from 100 junior high schools in Hubei province. corresponds to a final-exam grade, 5 units indicate student
Each school samples 1200 grade-9 students for recent three final-exam grade g ∈ {O, A, B, C, D} (O : 90% − 100%, A :
years (400 grade-9 students per one year). Grade-9 student 80% − 89%, B : 70% − 79%, C : 60% − 69%, D :< 60%, final
will have a high school Entrance Examination, therefore it’s grade is calculated in the format of 100% mark).
meaningful for management to predict entrance examination After unsupervised pre-training, the whole deep network
score and help the students at risk to improve the education is subsequently fine-tuned using backpropagation of error
quality. As shown in Fig.1, the training data is a composite of derivatives. The recently-introduced technique called dropout
different kinds of information: [13] is used in training SPPN. Dropout consists of setting

126
128

Authorized licensed use limited to: Hajee Mohammad Danesh Science and Technology Univ. Downloaded on May 27,2022 at [Link] UTC from IEEE Xplore. Restrictions apply.
TABLE I TABLE II
T RAINING PARAMETERS C OMPARISONS OF A LGORITHMS ACCURACY (%)

Parameter Value Algorithm O A B C D Average


learning rate 0.00025 Naive Bayes 4.7 15.5 20.8 26.9 35.4 20.7
momentum 0.9 MLP 33.4 35.6 53.8 40.3 44.7 41.6
minibatch size 512 SVM 34.7 39.8 42.5 55.8 69.3 48.4
weight decay 0.0005 SPPN 86.5 63.2 77.6 70.4 88.4 77.2
underline and bold indicate best and second best performance respectively

zero to the output of each hidden neuron with probability


TABLE III
0.5. The neurons which are dropped out do not participate to T RAINING PARAMETERS
the forward pass and back propagation. Every time the neural
network samples a different architecture but sharing same SPPN-g SPPN-c
weights. This technique prevent the units from co-adapting too
much since a neuron cannot rely on the presence of particular training time 376 minutes 3382 minutes
other neurons. Dropout forces to learn more robust features average precision 77.2% 78.4%
that are useful in conjunction with many different random SPPN-g is trained on GPU
subsets of the other neurons. Without above techniques, SPPN SPPN-c is trained on pure CPU
will suffer from greatly overfitting and be stuck on poor local
optima.
SPPN is trained on a GPU platform based on the Nvidia accuracy of classifiers performance with correctly classified
CUDA API. The GPU pipeline is well-suited for parallelism ratio on our dataset are shown in the Table II. SPPN acquires
attaining high performances in matrix and vector operations. the best accuracy between those algorithms with the average
Unlike CPUs which use the paradigm SISD (Single Instruction accuracy 77.2%. Traditional neural networks MLP greatly
Single Data), GPUs are optimized to perform floating-point suffers from substantial overfitting. The other two shallow
operations (on large data sets) using the paradigm Single models SVM, Naive Bayes are not capable to be comparably
Instruction Multiple Data (SIMD). The parallelism of a GPU discriminative as SPPN. The experimental results show that
is fully utilized by accumulating a lot of input feature vectors our approach is practical to use in educational setting for
and weight vectors, then converting the many inner-product identifying particular events such as pre-warning for students
operations into one matrix operation. GPUs enormous compu- at risk.
tational potential is particularly valuable for neural networks We also compare the training efficiency between GPU and
which are complex [9], placing high demands on memory and CPU, the training time period is shown at [Link]. SPPN-g is
computing resources. CPUs are simply not powerful enough the network trained on GPU, which is about 9 times faster
to solve them quickly in a feasible running time period. than SPPN-c trained purely on CPU in the training procedure.
The whole processing is demonstrated in Fig.1. Although the result of SPPN-c is slightly better than SPPN-
g, SPPN-c took almost 2 and half days to train, SPPN-g just
IV. E XPERIMENT AND R ESULTS took 6 hours. If the training set becomes larger in the future,
SPPN is implemented in ANSI C and Theano [14], a python it would be necessary to use GPU parallel architecture for
library that allows transparent use of GPU, and runs on a 2x converging.
Intel E5-2680 CPU, 64GB RAM with 2x Nvidia Tesla K40
12GB GPU. Although a single Tesla K40 GPU has 12GB V. C ONCLUSION
of memory, it’s still not enough to have whole net to fit
on one GPU over 120,000 training examples. Therefore we Data mining technologies have been recently used in the ed-
spread the net across two K40 GPUs since current GPUs are ucation for predicting students academic performance. Howev-
particularly suitable on cross-GPU parallelization (SLI), which er measuring academic performance of students is challenging
is able to access each another GPUs memory directly without since diverse factors and variables correlate in complicated
interference with host computer’s memory. nonlinear ways. In this study, we present a deep learning
SPPN is trained on a about 120, 000 labeled students dataset architecture for predicting students performance, which takes
with training parameters demonstrated in Tab.I. Note that it advantages of unlabeled data by automatically learning mul-
does not exist a public benchmark or dataset in education- tiple levels of representation. We pre-train hidden layers of
al prediction because of the sensitivity and confidentiality, features layerwisely using sparse auto-encoder, and then use
so a direct comparing of different methods is not feasible. supervised training for fine-tuning the parameters. We train
Therefore we implemented 3 different classification algorithm- model on a relatively large real world students dataset, and
s: NaiveBayes, Multilayer Perception (MLP) and SVM to the experimental results show the effectiveness of the propose
compare results with SPPN on our own dataset. The overall method. Future work will aim at optimizing the networks

127
129

Authorized licensed use limited to: Hajee Mohammad Danesh Science and Technology Univ. Downloaded on May 27,2022 at [Link] UTC from IEEE Xplore. Restrictions apply.
architecture, gathering more training samples, and using tem-
poral information in the sequential data.
ACKNOWLEDGMENT
The research was supported by the National Natural Science
Foundation of China (NSFC) via Grant 61402155 and Natural
Science Project of Hubei Education Department via Grant
B2015024.
R EFERENCES
[1] R Baker and George Siemens. Educational data mining and learning
analytics. Cambridge Handbook of the Learning Sciences:, 2014.
[2] Cristóbal Romero, Manuel-Ignacio López, Jose-Marı́a Luna, and Se-
bastián Ventura. Predicting students’ final performance from participa-
tion in on-line discussion forums. Computers & Education, 68:458–472,
October 2013.
[3] Ji-Wu Jia and Manohar Mareboyana. Machine Learning Algorithms and
Predictive Models for Undergraduate Student Retention. In Proceedings
of the World Congress on Engineering and Computer Science, volume 1,
2013.
[4] Mariel F Musso, Eva Kyndt, Eduardo C Cascallar, and Filip Dochy.
Predicting general academic performance and identifying the differential
contribution of participating variables using artificial neural networks.
Frontline Learning Research, 1(1):42–71, 2013.
[5] Sotiris B Kotsiantis. Use of machine learning techniques for educational
proposes: a decision support system for forecasting students grades.
Artificial Intelligence Review, 37(4):331–344, 2012.
[6] Annika Wolff, Zdenek Zdrahal, Drahomira Herrmannova, and Petr
Knoth. Predicting Student Performance from Combined Data Sources.
In Educational Data Mining, pages 175–202. Springer, 2014.
[7] Yoshua Bengio. Learning deep architectures for AI. Foundations and
Trends in Machine Learning, 2(1):1–127, 2009.
[8] Yoshua Bengio. Deep learning of representations: Looking forward.
Statistical Language and Speech Processing, 2013.
[9] Laszlo Bako, A Kolcsar, S Brassai, L Marton, and Lajos Losonczi.
Neuromorphic Neural Network Parallelization on CUDA Compatible
GPU for EEG Signal Classification. In Computer Modeling and
Simulation (EMS), 2012 Sixth UKSim/AMSS European Symposium on,
pages 359–364. IEEE, 2012.
[10] Adam Coates, AY Ng, and H Lee. An analysis of single-layer networks
in unsupervised feature learning. International Conference on Artificial
Intelligence and Statistics, pages 215–223, 2011.
[11] a Hyvärinen and E Oja. Independent component analysis: algorithms and
applications. Neural networks : the official journal of the International
Neural Network Society, 13(4-5):411–30, 2000.
[12] V Nair and GE Hinton. Rectified linear units improve restricted
boltzmann machines. Proceedings of the 27th International Conference
on Machine Learning (ICML-10), (3), 2010.
[13] N Srivastava. Improving neural networks with dropout. 2013.
[14] James Bergstra, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin,
Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-
Farley, and Yoshua Bengio. Theano: a CPU and GPU math expression
compiler. In Proceedings of the Python for scientific computing confer-
ence (SciPy), volume 4, page 3, 2010.

128
130

Authorized licensed use limited to: Hajee Mohammad Danesh Science and Technology Univ. Downloaded on May 27,2022 at [Link] UTC from IEEE Xplore. Restrictions apply.

You might also like