WS1UNR
WS1UNR
Introduction Introduction
From Traditional Statistics to Data Mining From Traditional Statistics to Data Mining
Types of DM algorithms:
Definition of data mining (DM): Data
Decision trees Understanding
“…the process of discovering meaningful new & Preparation
Artificial neural networks
correlations, patterns, and trends by sifting through
large amounts of data…, and by using pattern (ANNs)
recognition technologies, as well as statistical and Cluster analysis
Modeling
mathematical techniques.” (The Gartner Group, Inc.) Traditional techniques
DM uses algorithms (i.e., a finite set of well- (e.g., regression, PCA)
defined instructions) to: Typically, DM projects Evaluation
Classify
follow steps in the
Categorize Events or outcomes of
interest Cross-Industry Standard
Estimate (predict)
Process for Data Mining Deployment
Visualize
(CRISP-DM):
3 4
1
Introduction Introduction
Introduction Introduction
2
Introduction Introduction
Introduction Introduction
3
Models Models
Models Data
4
Data Data
17 18
Variables Variables
5
Variables Variables
Variables Variables
Credit hours Financial aid Ratio of earned to attempted credits (%) derived via
Total credits accumulated * Total aid received * linear regression for 493 cases
Total transfer credits articulated * Loans *
Total campus credits * Grants * Model Summary
Total math credits * Work study *
Total upper-division science credits * Merit-based aid * Adjusted Std. Error of
Total credits transferred in * Need-based aid * Model R R Square R Square the Estimate
1 .837a .701 .700 4.06135
Earned/attempted credits (%) * General fund aid *
Average credit load per enrolled term * Outside aid * a. Predictors: (Constant), Repeated a course, ACT Math
UNR Foundation aid * (miss=mean), Number of WNCC courses taken, # of
Faculty teaching courses taken replacement grades, Number of TMCC courses taken,
Percent of females * Acad. Dept-based aid *
Grants-in-Aid * Number of diversity courses taken, % of I/W grades
Percent of ethnic/racial minority * received, Number of math credits taken, Took honors
Percent of part-time faculty * Millennium Scholarship *
courses, % of D/F grades received, ACT English
Percent of adjunct faculty * Pell Grant aid *
(miss=mean), Number of upper div sci credits taken,
Percent at full-professor rank * ACT Composite (miss=mean)
Average age faculty *
23 24
6
Variables Exploratory Analysis
relationships during
Total campus credits derived via general linear model
exploratory stage
A
A
cases AA AA A A
AA AA
A A A
AAA
A
AA
A
A AA A A A
AAA AA AAA
A AA
AAA AAAAA AA
AA
A
AAAAAA
A A
A A
A A A AA
A
AA
AA
A
AA
AA
A
AA
AA
A
AA
AA
A
AA
A AA AA AAA
AAA AA
AAAAA
A
AA
AAA
A AA A A A
AA
A A
AAA A
AA
A
AAA
A
A A
A
AAA
A AA A AA
AA
AA
A
AA
A
A
A
A
A
A
A
AA
A
A
AA
A
A
A
A
A
A
A
A
A
A
A
A
A
A
AA
A
A
A
A
A
A
A
A
A
A
A
A
AA
A
A
A
A
A
A
A
A
A
A
A
A
A
AAA
A
A
A AA
A AAA AAA
AAA
A
A
AAA
AA
A
AAAA
AA
A
AA
AA
A
AAA A AA AAAA
A A
AA
A
AA
A
AA
AA
A
A A
AA
AA
AA
A
AAA
A
AAA
A
A
A A
A
A
AA A
AAAAA
A
AA
AA
A
AA
AA
A
AA
AA
AA
AAA
A
AAA
A
AA
AA
A
A
AA
A
AA
A
AA
A
A
AA
A
A
AA
AA AA A AAA
AA
AA
A A
AA
A
A
A
AA
A
A
A
A
A
A
A
AA
A
A
A
A
A
AA
A
AA
A
A
A
A
A
A
AA
A
A
AA
AA
A
A
AA
A
A
AAAA
AA
A
A
A
A
A
AA
A AA
AA
AA
A A
AA
AAA
A
AAA
AA
AAA
AA
A
AA
AA
A
AA
AA AA AAA AA
AA
AA
A
A AA
A
A AA
AA
AAAAA
A
AA
A
A
AA
A
AA
A
A
A
A
AA
A
A
A
AA
A
A
A
A
A
A
AAA
A
A
A
AA
A
A
A
AA
AA
A
A
A
AA
A
A
AA
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
AA
A
AA
A
A
AAA
Variable definitions
AA A A
A A
A A
AA A A
AA
AAAA A AAA A
AAAA AA
AA
AA
AA
A
AA
AAAAA
A
A
A
A
A
A
A
A
A
A
A
AA
A
A
A
A
A
A
AA
A
AA
A
A
A
A
A
AA
A
A
A
A
A
A
A
A
A
A
A
A
A
AA
A
A
A
A
A
A
A
AA
A
A
A
A
A
A
AA
A
A
AA
A
A
A
A
AA
A
A
AA
A
AA
AA
AA
A A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
AA
A
A
A
A
A
A
AA
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
AA
A
AA
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
AA
A
A
AA
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
AA
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
AAA
A
A
A A
A
AAAA A
A
A
A
A
A
A
A
A
A
AA
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
AA
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
AA
A AA A A A
AA
AA
A
A
A
AA
A
A
A
A
AA
A
A
AA
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
AA
A
AA
A
A
A
AA
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
È
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
AA
A
A
A
AA
A
A
A
A
AA
A A
AA A AAA
A AAA AAAAA
AAAA
AAAA
AAA AA A
A A A A A AA A
AA
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
AA
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
AA
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
AA
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
AAA A
A AA
AAA
AAA AAA AA
AA
A
AA
AA
A
AAAA
AA
A
A
AA
A
A
AA
A
A
AA
AA
AA
A
AA
A
AAA
A
AA
A
AAAAA
AAAA
A AAA
Initial study major: declared/pre-major, undeclared, non-degree AA
AAA A
AAAA
AAAA
AA
AAA
A
A
A
AA
AA
A
A
AA
A
A
A
AA
A
AA
AAA
A
A
AA
A
A
A
AA
A
A
A
AAA
A
A
A
A
AA
A
A
A
A
A
A
AA
A
A
AA
A
A
A
AA
A
AA
A
A
A
A
A
A
AA
A
A
A
A
AA
A
AA
A
A
A A
A
A
AA
A
A
AA
AA
A
A
A AA
A A
AAAA A A
A AAA A AA
A
A
A AA
AA A A
A
AAA
A
AA
AA
A
AA
AA
A
A
AA
AA
AA
A
A
AA
A A
AA
A
A
A
A AA
AAAA
A A
AA
A A
A A
AA A
A
AA A
AAA
A
A
A
AA
AA
A
A
A
A
A
A
A
A
A
AA
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
AA
A
A
AA
A
A
A
A
A
A
A
A
A
A
A
A
AA
A
A
A
AA
A
A
A
AAA
A
AA
AA
A
A
AAA
AA
A
AA
A
A
AA
A
AA
AA
A A AAA A AA
AAA
A AA
A
AA
A
AA
AA
A
A
A
AA
A
A
A
AA
A
AA
A
AA
A
A
A
A
AA
AA
A
AA
A
AA
A
A
A
AAA
A
A
AA
A
A
A
A
AAA
AA
A
AAA
A
A
AA
A
A
AAA
A
A AA
A A
AA
A
A A A AA
A A AA
AA AA
seeking, Intensive English A
A
AA
A A
AA
AAA
A A
AA
A
A
A
AA
A
A
AA
A
A
AA
A
A
A
AA
A
A
A
A
A
A
AA
A
A
AA
A
A
AA
A
A
A
A
AA
A
AA
A
A
AA
A
A
A
AA
AA
A
AA
A
AA
A
AA
A
AA
A
A
A
AA
AA
AA
AA
AAAAAA A
A AAA AAA
AAAA A AA
AA
AA
A A
A
AA
A
AA
AA
A
A
A
AA
A
A
AA
A
A
A
A
A
A
A
AA
A
A
A
AA
A
A
A
AA
A
A
A
AA
A
A
AA
A
AA
A
A
A
AA
A
AA
A
AA
A
A
A
AA
A
AA
A
A
AA
A
AA
A
A
AA
A
AA
A
A
A
AA
AA
AA
AA
AA AAAAA A
AAA AA AA A A
A A AAA AAA
A
A
A A
AA
AA
A
A
A
AA
A
A
AA
AA
A
A
A
AA
A
A
A
A
AA
AA
A
A
AA
AA
A
A
AA
A
A
AA
A
AA
A
A
AA
AA
A
A
AA
A
A
A
A
AA
A
A
AA
A
A
AA
A
A
AA
AA
AA
A
AA
A
A
AAA
AA
A
AA
AA
A
AAAA
AAAA
AAA
A
AAA
A
AA
AAAAA A A
A A
A AA A
AAA
AA
A
AA
A
A
AA
A
A
A
AA
A
A
AA
A
A
A
A
A
AA
A
A
AA
A
A
AA
A
AA
A
A
AA
A
AA
AA
A
A
A
AA
A
A
A
A
A
AA
A
AA
A
A
A
AA
A
A
AA
A
AA
AA
A
A
A
AAA
AA
A
AA
AAA
AA
AAA
AA
A AA
A
A A A
AAA AA
AA
AAAAA
A
AA
A
AA
A
A AA
A
AA
AA A
AA
AA
A A
A
A
AA A
A
A
AA A
A
A AA A
AA A
A AA A
A A A A AAA AAA A
AA
A A A
AAAA
A
A
A
A
A
A
A
A
AA
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
AA
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
AA
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
AA
AA
A
A
A
A
A
A
A
A
A
A
AA
A
A
A
AA
A
A
A
A
A
A
A
AA
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
AA
AA
A
A
A
A
AA
AA
A
AAA
A
A
AA
A
A
A
A
A
A
A
AA
A
AAA
A
A
A
A
A
A
A
A
A
AA
AA
A
AAAA
A
AA
A
AAA
AA
AAAAAA
AAA
AA
A
A
A
AAAA AAA A A
A
A
A
A A
A
A
AA
A
A
A
AA
A
A
AA
A
A
AA
A
A
A
AA
A
A
A
AA
A
A
AA
A
A
A
AA
A
A
AA
A
A
A
AA
AA
A
AA
A
AA
AA
A
AA
AA
A
AA
A
A
A
A
AA
A
AA
AA
AAA
AAA
AAAA A
AAA
AA
A A AA A
AA A
A A
AA A
A A A AA AA
Attempted registrations: registration attempt at time of fully A
A A
A
A
AA
A
A
AA
A
A
AA
A
A
A
A
AA
AA
A
A
A
AA
A
A
A
AA
A
A
AA
A
AA
A
A
AA
A
AA
A
A
A
AA
A
A
AA
AAA
A
AAA
A
A
AAA
AA
AA
AA
A
AA A
AA
AAAA
AA
AA A
A AA
AA
AA AAA
A AA
AAAA
AAA A AA
A AA
A
AA A
A
AAAA
A A A
A A
A A
A A A
A A A
A A
A A
A A A A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
AA
AA
AA
AA
AAA
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
AA
A
A
A
A
AA
A A
A
A
A
AA
A
AA
A
A
A
A
A
A
A
AA
A
A
A
AA
A A
AAA
A
A
AA
A
A
A
A
AA
A
A
A
A
A
A
A
A
AA
A
AA
AAAA
A
A
A
A
AAA
A
A
AA
A
AA
A A
A
A
AAAA
A
AAA
A
A AA A
AA
AA AA
A
AA
AA
A A
AA
AAA A AA
A A
A A
A A A
A A AAA A
subscribed class section during registration period
A
A
A
AA
A
A
A
AA
AA
AA
AA
A
AA
AA
A A
A
A
AAA
AA
A
A
A A
A
AA
AAAAAA A
A AAAA
AAAA
A
AAAA
A
A A
A AA AA A
A A
AA
A
AA
A A
A AA A A
A
A AA AA A A A A
A A A A
A AAA A A A AA A A
AAA
AAA
A
A AAAA
A
A A
A
A
A A AA
AAA A
A A A A A AAA A A
A
AA
AA
AAAAAAAA
A
AA A AA AA A A AAA A A
A AA A A AAA
AA A A AA A
AAA
A
AA A A A A AAAAA AA
AAA AA A A A A
A AA A
Stopout time: number of fall/spring semesters not in attendance A AA A A A A A A A AAA A
AA A
A
AA A AA A
A AA
AA A
AA A
A
AAA A A A A
A A
A
A A A
A
A
Declining GPA
27 28
7
Exploratory Analysis Exploratory Analysis
29 30
8
Model Results Model Results
Freshmen Retention Measured at End of Fall Freshmen Retention Measured at End of Spring
Prediction Accuracy Prediction Accuracy
100 100
Quick-Cont* Quick
95 95
Neural
% of cases correctly predicted
CR&T
Log Reg
65 Trees 65
C5.0
60 C5.0 60
55 55 LogReg baseline
LogReg baseline
50 50 Accuracy increases by
Training (N=4,079) Validation (N=3,939) Training (N=4,079) Validation (N=3,939) almost 10 percentage
33
points compared to fall34
*continuous variables
Time to Degree Measured with All Students Time to Degree Measured with New Students
Prediction Accuracy Prediction Accuracy
100 100
Quick Quick
95 95
Neural Neural
% of cases correctly predicted
9
Model Results Model Results
Time to Degree Measured with New Students Freshmen Retention Measured at End of Fall
Prediction Accuracy with Validation Data
Convidence Level of Correctly Predicted Cases
'Six Years Or More' Outcome
95
Quick 0.9
Quick-Cont*
90
% of cases correctly predicted
Neural 0.85
85 Multi Quick Neural
Nets
0.8 Nets
80 Prune Multi
0.75
75
Mean .
Prune
CHAID 0.7
70
Decision CHAID
65 CR&T 0.65 Decision
Trees CR&T Trees
60 C5.0 0.6
C5.0
55 0.55
LogReg baseline LogReg baseline
50
0.5
New and Transfers New Students Only
Training (N=4,079) Validation (N=3,939)
(N=7,598) (N=4,564)
37 *continuous variables 38
Freshmen Retention Measured at End of Spring Time to Degree Measured with All Students
Convidence Level of Correctly Predicted Cases Convidence Level of Correctly Predicted Cases
0.95 0.8
Quick Quick
0.9
Neural 0.75 Neural
0.85 Multi Multi
Nets Nets
0.8 Prune 0.7 Prune
Mean .
Mean .
0.75
CHAID 0.65 CHAID
0.7
Decision Decision
0.65 CR&T 0.6 CR&T
Trees Trees
0.6 C5.0 C5.0
0.55
0.55
LogReg baseline LogReg baseline
0.5 0.5
Training (N=4,079) Validation (N=3,939) Training (N=7,859) Validation (N=7,598)
39 40
10
Model Results Conclusion
41 42
Conclusion Clementine
®
43 44
11
® ®
Clementine Clementine
Generated ANN model characteristics: Generated CHAID ruleset (left) and coefficients for logistic
regression model (right):
Equation For 5 years:
-0.397 * GPAfinal +
0.05718 * CAPST +
-0.03169 * MACRS +
0.0005202 * UDSCI +
-0.04871 * TMCC +
0.01273 * WNCC +
-0.02395 * DIVCL +
-0.1026 * AGE +
-0.0001391 * TOTAL +
0.000129 * LOANS +
0.00009678 * GRANT +
0.0002089 * WORKS +
0.0001765 * MERIT +
0.000006031 * NEEDB +
-0.00002301 * GENFN +
-0.00008531 * OUTSI +
45
-0.00007588 * UNRFN + 46
® ®
Clementine Clementine
47 48
12
®
Model Characteristics Clementine
Logistic Regression
Main effects, simple entry; Cox & Snell = 0.728; run time 23 sec.; strongest
variables: average credit load, transfer credits, residency, stopout time, English
102 grade, English transfer
C5.0 decision tree
Tree depth: 1; no boosting; rules for each outcome: 38, 23, 40, 53 (default: 4 Link to presentation:
years), run time: 7 sec
C&R decision tree [Link]
Tree depth: 5; run time 39 sec
CHAID decision tree
Acknowledgement:
Tree depth: 4; run time 21 sec
SPSS Inc. is being thanked for providing a demo
Neural net quick method
Neurons per layer: 158 at input; 8 in hidden layer; 4 in output; run time 22 sec version of their Clementine® software used in this
Neural multi-topology net study.
Neurons per layer: 158 at input; 5 in hidden layer; 4 in output; run time 29 min
8 sec; strongest variables: average credit load, age, transfer credits, stopout
time, earned/attempted credits, English transfer, starting major, change of
major; topology setting: 2 20 3; 2 25 5, 2 22
Neural net multilayer pruned
Neurons per layer: 38 at input; 4 in 1st hidden layer; 2 in 2nd hidden layer, 2 in
3rd hidden layer, 4 in output; run time 45 min; strongest variables: average
credit load, age, application status, stopout time, GPA trend, transfer credits,
earned/attempted credits, residency, starting major, change of major, %
irregular faculty 49
References
13