0% found this document useful (0 votes)
20 views13 pages

WS1UNR

Uploaded by

Abiy Mulugeta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views13 pages

WS1UNR

Uploaded by

Abiy Mulugeta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Outline

Expanding Toolsets for Prediction


„ Introduction
Analysis Using Larger Datasets With „ Models
Many Variables „ Data
***** „ Variables
Regression vs. Decision Trees vs. Neural Networks „ Exploratory analysis
„ Model results
Spring 2006
„ Conclusion
®
Serge Herzog, PhD „ Clementine software
Sponsored by: Director, Institutional Analysis
University of Nevada, Reno
Center for Research Design and Analysis Reno, NV 89557
& the Office of Vice President for Research Serge@[Link] 2

Introduction Introduction

From Traditional Statistics to Data Mining From Traditional Statistics to Data Mining
Types of DM algorithms:
Definition of data mining (DM): Data
„ Decision trees Understanding
„ “…the process of discovering meaningful new & Preparation
„ Artificial neural networks
correlations, patterns, and trends by sifting through
large amounts of data…, and by using pattern (ANNs)
recognition technologies, as well as statistical and „ Cluster analysis
Modeling
mathematical techniques.” (The Gartner Group, Inc.) „ Traditional techniques
DM uses algorithms (i.e., a finite set of well- (e.g., regression, PCA)
defined instructions) to: Typically, DM projects Evaluation
„ Classify
follow steps in the
„ Categorize Events or outcomes of
interest Cross-Industry Standard
„ Estimate (predict)
Process for Data Mining Deployment
„ Visualize
(CRISP-DM):
3 4

1
Introduction Introduction

From Traditional Statistics to Data Mining Purpose of Study:


Unsupervised DM:
„ Explores and examines yet unknown patterns in data via „ Evaluate the accuracy of predicting a
classification and grouping techniques (e.g., clustering).
multinomial outcome of a grouped
Supervised DM:
(categorical) variable using
„ Predictive models are built, or trained, using data for which Logistic regression
the response variable is already known (e.g., rule-induction Decision trees
decision trees, neural networks, regression).
„ Generated models use the learned information (based on the Neural networks
initial test dataset) to predict outcomes with validation data
(or holdout sample) and to adjust for overprediction in the
test data; once acceptable levels of prediction are attained, „ Highlight potential operational benefits in
the model can be applied to predict outcomes with new data. the context of the case study used
5 6

Introduction Introduction

Focus of the Analysis Assumptions in Regression Analysis


„ Weigh the comparative accuracy of each „ Dependent (outcome) variable is continuous, in ordinary
analytical approach in predicting retention least-square regression, dichotomous in logistic
and time to degree completion (TTD) of regression
„ Independent (predictor) variables are uncorrelated with
undergraduate students at UNR each other (i.e., no multi-collinearity), though less an
issue in prediction
„ Independent variables are uncorrelated with the error
„ Though most predictors (IV) are correlated term (i.e., actual-to-predicted difference)
with each outcome (and informed by „ With small samples: Error term has a mean of zero,
constant variance, and errors are uncorrelated with each
scholarship), the analytical focus is not on other (i.e., variance is correlated with one or more
explaining retention/TTD (i.e. model fit), but predictors, generating heteroscedasticity in cross-
sectional data or autocorrelation in time-series data)
merely predicting it
7 8

2
Introduction Introduction

Definition: Decision Trees Artificial Neural Networks


1) “A way to represent rules underlying data with hierarchical,
Definition:
sequential structures that recursively partition the data.” (Murthy, 2005)
“System composed of many processing elements operating in parallel
2) Iterative splitting of data into discrete groups, with the goals to maximize
whose function is determined by network structure, connection strengths,
the ‘distance’ between groups at each split. (Two Crows Corp., 1999)
and processing performed at computing elements or nodes.”(DARPA, 1988)
„ Advantages:
Exploratory „ Advantages:
Non-parametric Handles both linear and
Transparent process of non-linear complex,
rule induction interactive relationships
among many variables
Handles continuous and
categorical data No data distribution,
variance homogeneity
Computational efficiency in assumptions required
classification via
hierarchical decomposition Adaptive learning based
on initial training data
Handles continuous and
Source: [Link]
categorical data
Types of tree generation Typical backpropagation network
9 Types of ANN generation 10

Introduction Introduction

Artificial Neural Networks Training Tree and ANN Models


Synaptic weights are derived
Definition: „ Controlling tree size with settings for:
from:
“System composed of many processing elements operating in parallel Maximum depth
Oj = ∫(
whosenfunction is determined by network structure, connection strengths,
∑ oi wji ), where
and processing performed at computing elements or nodes.”(DARPA, 1988)
Limit number of records in node
i =1 Pruning of full-size tree
„ Advantages: Adjusting for ‘downstream’ effect of upper-level splits (still
∫’ (x ) = - (1+ e –X)-2 e –X (-1)
i experimental?)
where o Handles both linear and
is the outcome, ‘Binning’, i.e. convert continuous into categorical predictors
xi is the inputcomplex,
non-linear vector, ‘Boosting’, i.e. re-sampling of misclassified records and combining of
wi interactive relationships
is the synaptic weight weak classification nodes (with low purity based on error rate, Gini
(randomlyamong
set formany variables
first record index, or cross-entropy) to reduce noise-to-signal ratio
No data distribution,
processed) „ Calculating connection weights and shaping ANN
variance
Vectors consist ofhomogeneity
one or more input architecture:
variablesassumptions
(predictors) required Choosing the weighting factor: backpropagation, radial basis
function, quasi-Newton, Levenberg-Marquardt, genetic algorithms
Adaptive learning based
Number of hidden-layers and networks of ANN topology
on initial training data
Synaptic weights Parameter settings to avoid local optima via Alpha momentum term
Handles continuous and Eta decay rate to control learning rate of model
categorical data Running continuous data to avoid ‘categorical explosion’
Typical backpropagation network
Types of ANN generation 11 12

3
Models Models

Analytical Approach Model Description


„ Compare prediction accuracy based on: „ CHAID
Uses Chi-squared statistics to identify optimal splits with two or more
subgroups. Starts with most significant predictor, determines multiple-
group differences and collapses groups with no significance; the
Logistic regression (LR) Baseline merging process stops at the preset testing level.
„ C&RT
Classification and regression tree (C&RT) Generates splits based on maximizing orthogonality between subgroups
Decision trees (measured via the impurity index); all splits are binary and outcome
Chi-squared Automatic variable can be continuous or categorical.
based on rule C5.0
Interaction Detector (CHAID) „
induction Uses the 5.0 algorithm to generate a decision tree or ruleset based on
C5.0 Algorithm the predictor that provides maximum information gain. Split process
continues until sample is exhausted. Lowest-level splits are removed if
they don’t contribute to model significance.
Neural Net simple topology (Quick) „ Neural net multiple topologies
Backpropagation
Neural Net parallel topologies (Multi) Creates several networks in parallel based on specified number of
perceptron hidden layers and units (nodes) in each layer. Learning rate is a
Neural Net 3-layer topology (Prune) function of specified number of cycles and the Eta decay rate.
„ Neural net prune method
„ Test generated models on ‘holdout’ sample after Starts with large network of layers and nodes as specified and removes
weakest units during training; learning rate is determined via specified
randomized 50/50 data partition Eta decay rate.
13 return 14

Models Data

Data Samples and Sources


Models Tested
„ Data samples drawn from U. of Nevada-Reno (a
„ Second-year retention „ Degree completion land-grant, Carnegie Extensive Research Institution):
based on previously All graduates Retention: 8,018 new full-time freshmen entering fall
established models (including transfers) semesters 2000 through 2003 (96% of total cohort
Graduates who started population*)
New freshmen at end of
fall term as new freshmen Degree completion: 15,457 undergraduate degree
recipients graduating in spring 1995 through summer
Spring-retained at end of Both measured at end 2005 (99% of total undergrad-level graduates after
spring term of second year listwise deletion of incomplete records; 85 multiple-
degree holders counted once.)
Measured outcome: Measured outcome: „ Data sources:
1) Returned for second year 1) 3 years or less Student Information System
2) Transferred out within 1 year 2) 4 years at the institution
Human Resources System
3) Dropped out/stopped out 3) 5 years
4) 6 years or more ACT Student Profile Section
15 16
*Excluding athletes, non-degree seeking, and foreign students

4
Data Data

Number of Predictor Variables


Data Types Used in Both Cases
Used
„ Flag/binomial, e.g., yes/no, 0/1 „ Retention: 40-50 (depending on model)
„ Set/grouped, e.g., NV, other US, foreign
„ Time to Degree: 80-100 (depending on model)
„ Ordinal/rank, e.g., high/medium/low
„ Range/scale, i.e., continuous numerical „ In regression, more variables typically increase
model complexity and difficulty in estimating and
interpreting partial effect sizes of individual
predictors

17 18

Variables Variables

Variables Examined: Retention Variables Examined: Retention


*range/scale ~ flag/binomial ^ordinal/rank ``set/multinomial *range/scale ~ flag/binomial ^ordinal/rank ``set/multinomial
„ Campus experience (cont.) „ Financial aid
„ Student demographics
Attempted registrations ^
Gender ~ Fall/spring package by type of aid included ``
Average class size ^
Age *^ „ No aid
„ Academic experience
Ethnicity/race `` Package with loans and/or work study
Academic peer challenge ^ „
Residency ``
Fall/1st year GPA *^ „ Grants and/or scholarships only
Parent income ``(incl miss. cat)
Credit load (<15) ~ „ Millennium Scholarship only
„ Pre-collegiate experience
Major requires Calc 1 ~
High school GPA * 2nd year package offer by type of aid included ``
Nat/Phys science courses ^
ACT Engl./math scores * (SAT conv.) „ as above
Remedial math taken ~
ACT/SAT test date *^
Remedial English taken ~ Fall/spring institutional aid amount ($) *
Acad. preparation index *^
Math credits earned ^ 2nd year institutional aid amount offered ($) *
Pre-fall summer enrollment ~
All and math transfer credits ~
AP/IB credits ~ Fall/spring Pell Grant aid ~ „ Financial aid need ($) *
Fall/spring math grades ``
Graduate degree aspiration ~
Math ‘D’/’F’ grades ~ Millennium aid status (fall/spring) `` Fall and spring remaining
„ Campus experience Math ‘I’/’W’ grades ~ „ Never had it 1st year total remaining
On-campus living ~ Passed 1st year math ~
Use of athletic facilities ~ „ Received it, maintains eligibility Fall and spring total need
English 101/102 grades ``
Dual enrollment w/ CCs ~ „ Lost eligibility, continues eligibility before calculated aid offered
Program major type `` 19 20
Fall entry term ``

5
Variables Variables

Variables Examined: Time-to-Degree Variables Examined: Time-to-Degree


*range/scale ~ flag/binomial ^ordinal/rank ``set/multinomial
*range/scale ~ flag/binomial ^ordinal/rank ``set/multinomial
„ Outside course experiences
„ Course Grades
Took overseas courses (USAC) ~
Remedial math ``
„ Student demographics „ General Experience Took Cont. Education courses ~
College algebra ``
Gender ~ Initial status (New vs. Took courses at TMCC *
College general math ``
Took courses at WNCC *
Age * Transfer) ~ College trigonometry ``
Took internships ^
Initial program major `` Intro to statistics ``
Ethnicity/race `` English courses transferred in ^
Business calculus ``
Residency `` Graduating major `` Math courses transferred in ^
Calculus 1 ``
English distance courses ~
Number pgm major changes * English 101 ``
Math distance courses ~
Graduated with minor ~ English 102 ``
„ Pre-collegiate experience Core Humanities transferred in ^
Core Humanities 201-203 ``
Completed a senior thesis ~ „ Campus course experiences
ACT English * General capstone ``
Attempted registrations * Took honors courses ~
ACT math * Program major capstone ``
Took independent studies ^
Participated in varsity sports ~ Cumulative GPA *^
ACT Composite * Repeated a course ~
GPA trend *
Stopout time since first Number of D/F grades (%) *
Took remedial math/English ~
enrollment (%) *^ Capstone courses taken *
Number of I/W grades (%) *
‘Diversity’ courses taken *
Number of replacement grades *
Nat science courses in three
21 22
core areas (3 variables) *

Variables Variables

Variables Examined: Time-to-Degree Imputation of Missing Values


*range/scale ~ flag/binomial ^ordinal/rank ``set/multinomial

„ Credit hours „ Financial aid „ Ratio of earned to attempted credits (%) derived via
Total credits accumulated * Total aid received * linear regression for 493 cases
Total transfer credits articulated * Loans *
Total campus credits * Grants * Model Summary
Total math credits * Work study *
Total upper-division science credits * Merit-based aid * Adjusted Std. Error of
Total credits transferred in * Need-based aid * Model R R Square R Square the Estimate
1 .837a .701 .700 4.06135
Earned/attempted credits (%) * General fund aid *
Average credit load per enrolled term * Outside aid * a. Predictors: (Constant), Repeated a course, ACT Math
UNR Foundation aid * (miss=mean), Number of WNCC courses taken, # of
„ Faculty teaching courses taken replacement grades, Number of TMCC courses taken,
Percent of females * Acad. Dept-based aid *
Grants-in-Aid * Number of diversity courses taken, % of I/W grades
Percent of ethnic/racial minority * received, Number of math credits taken, Took honors
Percent of part-time faculty * Millennium Scholarship *
courses, % of D/F grades received, ACT English
Percent of adjunct faculty * Pell Grant aid *
(miss=mean), Number of upper div sci credits taken,
Percent at full-professor rank * ACT Composite (miss=mean)
Average age faculty *
23 24

6
Variables Exploratory Analysis

Imputation of Missing Values Learn about variable A

relationships during
„ Total campus credits derived via general linear model
exploratory stage
A
A

(R2=0.682; 1 factor,18 covariates, table) for 804 cases A


A
A
A AAA A
A A A
A
AA A
AA

Mean value substitution of missing ACT scores for 230


AAA AA
A
A AA AAA AAA A
„ A
AAA AA
A
AAA
A
AA
A
A
AA
A
AAA
AA
A
A
A
AA
A
A
A
A
A
A
A
AA
A
A
A
A
A
AA
A

cases AA AA A A
AA AA
A A A
AAA
A
AA
A
A AA A A A
AAA AA AAA
A AA
AAA AAAAA AA
AA
A
AAAAAA
A A
A A
A A A AA
A
AA
AA
A
AA
AA
A
AA
AA
A
AA
AA
A
AA
A AA AA AAA
AAA AA
AAAAA
A
AA
AAA
A AA A A A
AA
A A
AAA A
AA
A
AAA
A
A A
A
AAA
A AA A AA
AA
AA
A
AA
A
A
A
A
A
A
A
AA
A
A
AA
A
A
A
A
A
A
A
A
A
A
A
A
A
A
AA
A
A
A
A
A
A
A
A
A
A
A
A
AA
A
A
A
A
A
A
A
A
A
A
A
A
A
AAA
A
A
A AA
A AAA AAA
AAA
A
A
AAA
AA
A
AAAA
AA
A
AA
AA
A
AAA A AA AAAA
A A
AA
A
AA
A
AA
AA
A
A A
AA
AA
AA
A
AAA
A
AAA
A
A
A A
A
A
AA A
AAAAA
A
AA
AA
A
AA
AA
A
AA
AA
AA
AAA
A
AAA
A
AA
AA
A
A
AA
A
AA
A
AA
A
A
AA
A
A
AA
AA AA A AAA
AA
AA
A A
AA
A
A
A
AA
A
A
A
A
A
A
A
AA
A
A
A
A
A
AA
A
AA
A
A
A
A
A
A
AA
A
A
AA
AA
A
A
AA
A
A
AAAA
AA
A
A
A
A
A
AA
A AA
AA
AA
A A
AA
AAA
A
AAA
AA
AAA
AA
A
AA
AA
A
AA
AA AA AAA AA
AA
AA
A
A AA
A
A AA
AA
AAAAA
A
AA
A
A
AA
A
AA
A
A
A
A
AA
A
A
A
AA
A
A
A
A
A
A
AAA
A
A
A
AA
A
A
A
AA
AA
A
A
A
AA
A
A
AA
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
AA
A
AA
A
A
AAA

Variable definitions
AA A A
A A
A A
AA A A
AA
AAAA A AAA A
AAAA AA
AA
AA
AA
A
AA
AAAAA
A
A
A
A
A
A
A
A
A
A
A
AA
A
A
A
A
A
A
AA
A
AA
A
A
A
A
A
AA
A
A
A
A
A
A
A
A
A
A
A
A
A
AA
A
A
A
A
A
A
A
AA
A
A
A
A
A
A
AA
A
A
AA
A
A
A
A
AA
A
A
AA
A
AA
AA
AA
A A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
AA
A
A
A
A
A
A
AA
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
AA
A
AA
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
AA
A
A
AA
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
AA
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
AAA
A
A
A A
A
AAAA A
A
A
A
A
A
A
A
A
A
AA
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
AA
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
AA
A AA A A A
AA
AA
A
A
A
AA
A
A
A
A
AA
A
A
AA
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
AA
A
AA
A
A
A
AA
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
È
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
AA
A
A
A
AA
A
A
A
A
AA
A A
AA A AAA
A AAA AAAAA
AAAA
AAAA
AAA AA A
A A A A A AA A
AA
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
AA
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
AA
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
AA
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
AAA A
A AA
AAA
AAA AAA AA
AA
A
AA
AA
A
AAAA
AA
A
A
AA
A
A
AA
A
A
AA
AA
AA
A
AA
A
AAA
A
AA
A
AAAAA
AAAA
A AAA
Initial study major: declared/pre-major, undeclared, non-degree AA
AAA A
AAAA
AAAA
AA
AAA
A
A
A
AA
AA
A
A
AA
A
A
A
AA
A
AA
AAA
A
A
AA
A
A
A
AA
A
A
A
AAA
A
A
A
A
AA
A
A
A
A
A
A
AA
A
A
AA
A
A
A
AA
A
AA
A
A
A
A
A
A
AA
A
A
A
A
AA
A
AA
A
A
A A
A
A
AA
A
A
AA
AA
A
A
A AA
A A
AAAA A A
A AAA A AA
A
A
A AA
AA A A
A
AAA
A
AA
AA
A
AA
AA
A
A
AA
AA
AA
A
A
AA
A A
AA
A
A
A
A AA
AAAA
A A
AA
A A
A A
„ AA A
A
AA A
AAA
A
A
A
AA
AA
A
A
A
A
A
A
A
A
A
AA
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
AA
A
A
AA
A
A
A
A
A
A
A
A
A
A
A
A
AA
A
A
A
AA
A
A
A
AAA
A
AA
AA
A
A
AAA
AA
A
AA
A
A
AA
A
AA
AA
A A AAA A AA
AAA
A AA
A
AA
A
AA
AA
A
A
A
AA
A
A
A
AA
A
AA
A
AA
A
A
A
A
AA
AA
A
AA
A
AA
A
A
A
AAA
A
A
AA
A
A
A
A
AAA
AA
A
AAA
A
A
AA
A
A
AAA
A
A AA
A A
AA
A
A A A AA
A A AA
AA AA
seeking, Intensive English A
A
AA
A A
AA
AAA
A A
AA
A
A
A
AA
A
A
AA
A
A
AA
A
A
A
AA
A
A
A
A
A
A
AA
A
A
AA
A
A
AA
A
A
A
A
AA
A
AA
A
A
AA
A
A
A
AA
AA
A
AA
A
AA
A
AA
A
AA
A
A
A
AA
AA
AA
AA
AAAAAA A
A AAA AAA
AAAA A AA
AA
AA
A A
A
AA
A
AA
AA
A
A
A
AA
A
A
AA
A
A
A
A
A
A
A
AA
A
A
A
AA
A
A
A
AA
A
A
A
AA
A
A
AA
A
AA
A
A
A
AA
A
AA
A
AA
A
A
A
AA
A
AA
A
A
AA
A
AA
A
A
AA
A
AA
A
A
A
AA
AA
AA
AA
AA AAAAA A
AAA AA AA A A
A A AAA AAA
A
A
A A
AA
AA
A
A
A
AA
A
A
AA
AA
A
A
A
AA
A
A
A
A
AA
AA
A
A
AA
AA
A
A
AA
A
A
AA
A
AA
A
A
AA
AA
A
A
AA
A
A
A
A
AA
A
A
AA
A
A
AA
A
A
AA
AA
AA
A
AA
A
A
AAA
AA
A
AA
AA
A
AAAA
AAAA
AAA
A
AAA
A
AA
AAAAA A A
A A
A AA A
AAA
AA
A
AA
A
A
AA
A
A
A
AA
A
A
AA
A
A
A
A
A
AA
A
A
AA
A
A
AA
A
AA
A
A
AA
A
AA
AA
A
A
A
AA
A
A
A
A
A
AA
A
AA
A
A
A
AA
A
A
AA
A
AA
AA
A
A
A
AAA
AA
A
AA
AAA
AA
AAA
AA
A AA
A
A A A
AAA AA
AA
AAAAA
A
AA
A
AA
A
A AA
A
AA
AA A
AA
AA
A A
A
A
AA A
A
A
AA A
A
A AA A
AA A
A AA A
A A A A AAA AAA A
AA
A A A
AAAA
A
A
A
A
A
A
A
A
AA
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
AA
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
AA
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
AA
AA
A
A
A
A
A
A
A
A
A
A
AA
A
A
A
AA
A
A
A
A
A
A
A
AA
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
AA
AA
A
A
A
A
AA
AA
A
AAA
A
A
AA
A
A
A
A
A
A
A
AA
A
AAA
A
A
A
A
A
A
A
A
A
AA
AA
A
AAAA
A
AA
A
AAA
AA
AAAAAA
AAA
AA
A
A
A
AAAA AAA A A
A
A
A
A A
A
A
AA
A
A
A
AA
A
A
AA
A
A
AA
A
A
A
AA
A
A
A
AA
A
A
AA
A
A
A
AA
A
A
AA
A
A
A
AA
AA
A
AA
A
AA
AA
A
AA
AA
A
AA
A
A
A
A
AA
A
AA
AA
AAA
AAA
AAAA A
AAA
AA
A A AA A
AA A
A A
AA A
A A A AA AA
Attempted registrations: registration attempt at time of fully A
A A
A
A
AA
A
A
AA
A
A
AA
A
A
A
A
AA
AA
A
A
A
AA
A
A
A
AA
A
A
AA
A
AA
A
A
AA
A
AA
A
A
A
AA
A
A
AA
AAA
A
AAA
A
A
AAA
AA
AA
AA
A
AA A
AA
AAAA
AA
AA A
A AA
AA
AA AAA
A AA
AAAA
AAA A AA
A AA
A
AA A
A
AAAA
A A A
A A
A A
A A A
A A A
A A
A A
A A A A
„ A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
AA
AA
AA
AA
AAA
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
AA
A
A
A
A
AA
A A
A
A
A
AA
A
AA
A
A
A
A
A
A
A
AA
A
A
A
AA
A A
AAA
A
A
AA
A
A
A
A
AA
A
A
A
A
A
A
A
A
AA
A
AA
AAAA
A
A
A
A
AAA
A
A
AA
A
AA
A A
A
A
AAAA
A
AAA
A
A AA A
AA
AA AA
A
AA
AA
A A
AA
AAA A AA
A A
A A
A A A
A A AAA A
subscribed class section during registration period
A
A
A
AA
A
A
A
AA
AA
AA
AA
A
AA
AA
A A
A
A
AAA
AA
A
A
A A
A
AA
AAAAAA A
A AAAA
AAAA
A
AAAA
A
A A
A AA AA A
A A
AA
A
AA
A A
A AA A A
A
A AA AA A A A A
A A A A
A AAA A A A AA A A
AAA
AAA
A
A AAAA
A
A A
A
A
A A AA
AAA A
A A A A A AAA A A
A
AA
AA
AAAAAAAA
A
AA A AA AA A A AAA A A
A AA A A AAA
AA A A AA A
AAA
A
AA A A A A AAAAA AA
AAA AA A A A A
A AA A
Stopout time: number of fall/spring semesters not in attendance A AA A A A A A A A AAA A
AA A
„ A
AA A AA A
A AA
AA A
AA A
A
AAA A A A A
A A
A
A A A
A
A

after first campus-based course enrollment


A
A A A A A
A A A A
A A A
A
A
A A

„ Number of replacement grades: students may repeat up to 12 A


A
A

lower-division credits to replace original UNR grades


„ GPA trend: ratio of 24-credit GPA to final cumulative GPA
„ Natural sciences core course offerings geared for 3 groups of Does merit-based aid go to fast
majors: a) social science, b) natural science, c) engineering completers with higher grades?
25 26

Exploratory Analysis Exploratory Analysis

Declining GPA

27 28

7
Exploratory Analysis Exploratory Analysis

29 30

Exploratory Analysis Exploratory Analysis

31 (Faster completion is associated with exposure to adjunct faculty) 32

8
Model Results Model Results

Freshmen Retention Measured at End of Fall Freshmen Retention Measured at End of Spring
Prediction Accuracy Prediction Accuracy

100 100
Quick-Cont* Quick
95 95
Neural
% of cases correctly predicted

% of cases correctly predicted


Quick Multi
90 Neural 90 Nets
85 Multi Nets 85 Prune
80 80
Prune
CHAID
75 75
CHAID Decision
70 Decision 70 CR&T
Trees
Quick-Cont

CR&T
Log Reg

65 Trees 65
C5.0
60 C5.0 60
55 55 LogReg baseline
LogReg baseline
50 50 Accuracy increases by
Training (N=4,079) Validation (N=3,939) Training (N=4,079) Validation (N=3,939) almost 10 percentage
33
points compared to fall34
*continuous variables

Model Results Model Results

Time to Degree Measured with All Students Time to Degree Measured with New Students
Prediction Accuracy Prediction Accuracy

100 100
Quick Quick
95 95
Neural Neural
% of cases correctly predicted

% of cases correctly predicte


90 Multi 90 Multi
Nets Nets
85 Prune 85 Prune
80 80
CHAID CHAID
75 75
70 CR&T
Decision 70 CR&T
Decision
Trees Trees
65 65
C5.0 C5.0
60 60
55 LogReg baseline 55 LogReg baseline
50 50
Training (N=7,859) Validation (N=7,598) ~ 50% improvement Training (N=4,727) Validation (N=4,564)
Improved accuracy when
over logit model restricted to new students
35 only 36

9
Model Results Model Results

Time to Degree Measured with New Students Freshmen Retention Measured at End of Fall
Prediction Accuracy with Validation Data
Convidence Level of Correctly Predicted Cases
'Six Years Or More' Outcome
95
Quick 0.9
Quick-Cont*
90
% of cases correctly predicted

Neural 0.85
85 Multi Quick Neural
Nets
0.8 Nets
80 Prune Multi
0.75
75

Mean .
Prune
CHAID 0.7
70
Decision CHAID
65 CR&T 0.65 Decision
Trees CR&T Trees
60 C5.0 0.6
C5.0
55 0.55
LogReg baseline LogReg baseline
50
0.5
New and Transfers New Students Only
Training (N=4,079) Validation (N=3,939)
(N=7,598) (N=4,564)
37 *continuous variables 38

Model Results Model Results

Freshmen Retention Measured at End of Spring Time to Degree Measured with All Students
Convidence Level of Correctly Predicted Cases Convidence Level of Correctly Predicted Cases

0.95 0.8
Quick Quick
0.9
Neural 0.75 Neural
0.85 Multi Multi
Nets Nets
0.8 Prune 0.7 Prune
Mean .

Mean .
0.75
CHAID 0.65 CHAID
0.7
Decision Decision
0.65 CR&T 0.6 CR&T
Trees Trees
0.6 C5.0 C5.0
0.55
0.55
LogReg baseline LogReg baseline
0.5 0.5
Training (N=4,079) Validation (N=3,939) Training (N=7,859) Validation (N=7,598)

39 40

10
Model Results Conclusion

Time to Degree Measured with New Students Comparison of Model Accuracy


Convidence Level of Correctly Predicted Cases
„ Mean accuracy level
Marginal improvement over regression when using more
0.95
Quick ‘matured’ variables in established retention model
0.9 Significant improvement over regression when using greater
Neural
0.85 Multi
Nets number of exploratory variables in time-to-degree model,
especially with
0.8 Prune „ Multi-layer, pruned neural net
Mean .

0.75 „ C5.0 decision tree


CHAID
0.7
Decision
0.65 CR&T
Trees
„ Confidence level of correct prediction
Better results for decision trees and regression model compared
0.6 C5.0 to neural nets, except for more complex model (i.e. time-to-
0.55 degree with new and transfer students) high for pruned
LogReg baseline neural net, low for C5.0
0.5
Training (N=4,727) Validation (N=4,564)

41 42

Conclusion Clementine
®

Potential Operational Benefits SPSS Clementine Data-Stream Pane ®

„ Fifteen percentage point improvement in correctly estimating time-


to-degree
End of sophomore year model: improved classification of 525
second-year students at examined institution
„ Likely enhancement of institutional enrollment projections, which are
based on class-standing flow model
„ Better targeting of students ‘at risk’ prior to choosing/commencing
program major (mitigate chance of subsequent changes in major)

„ Accelerated degree completion for estimated 6-year plus graduates


Net present cost of a four-year degree to average student entering
college in 2003 is ~ $107,000 (opportunity cost minus total attendance
cost) (Barrow and Rouse, 2005). Faster completion reduces net cost,
reduces time to recoup investment.
Speeding up time to graduation by one year may save a student
around $28,000 in foregone earnings (not counting the higher
increment of tuition and fees for a 6-year graduate compared to a 5-year
graduate).

43 44

11
® ®
Clementine Clementine

SPSS Clementine Data-Mining Application


®
SPSS Clementine Data-Mining Application
®

Generated ANN model characteristics: Generated CHAID ruleset (left) and coefficients for logistic
regression model (right):
Equation For 5 years:
-0.397 * GPAfinal +
0.05718 * CAPST +
-0.03169 * MACRS +
0.0005202 * UDSCI +
-0.04871 * TMCC +
0.01273 * WNCC +
-0.02395 * DIVCL +
-0.1026 * AGE +
-0.0001391 * TOTAL +
0.000129 * LOANS +
0.00009678 * GRANT +
0.0002089 * WORKS +
0.0001765 * MERIT +
0.000006031 * NEEDB +
-0.00002301 * GENFN +
-0.00008531 * OUTSI +
45
-0.00007588 * UNRFN + 46

® ®
Clementine Clementine

SPSS Clementine Data-Mining Application


®
SPSS Clementine Data-Mining Application
®

Generated C&RT tree structure: Generated R&RT model characteristics:

47 48

12
®
Model Characteristics Clementine
„ Logistic Regression
Main effects, simple entry; Cox & Snell = 0.728; run time 23 sec.; strongest
variables: average credit load, transfer credits, residency, stopout time, English
102 grade, English transfer
„ C5.0 decision tree
Tree depth: 1; no boosting; rules for each outcome: 38, 23, 40, 53 (default: 4 Link to presentation:
years), run time: 7 sec
„ C&R decision tree [Link]
Tree depth: 5; run time 39 sec
„ CHAID decision tree
Acknowledgement:
Tree depth: 4; run time 21 sec
SPSS Inc. is being thanked for providing a demo
„ Neural net quick method
Neurons per layer: 158 at input; 8 in hidden layer; 4 in output; run time 22 sec version of their Clementine® software used in this
„ Neural multi-topology net study.
Neurons per layer: 158 at input; 5 in hidden layer; 4 in output; run time 29 min
8 sec; strongest variables: average credit load, age, transfer credits, stopout
time, earned/attempted credits, English transfer, starting major, change of
major; topology setting: 2 20 3; 2 25 5, 2 22
„ Neural net multilayer pruned
Neurons per layer: 38 at input; 4 in 1st hidden layer; 2 in 2nd hidden layer, 2 in
3rd hidden layer, 4 in output; run time 45 min; strongest variables: average
credit load, age, application status, stopout time, GPA trend, transfer credits,
earned/attempted credits, residency, starting major, change of major, %
irregular faculty 49

References

Ampazis, N. Introduction to neural networks. [Link] Downloaded 10/3/05


Baker, B. D., Richards, C.E. (1999). A comparison of conventional linear regression methods and neural networks for
forecasting educational spending. Economics of Education Review 18: 405-415.
Barrow, L., Rouse, C. E. (2005). Does college still pay? The Economists’ Voice 2 (4): 1-8.
Byers Gonzalez, J.M., and DesJardins, S.L. (April 2002). Artificial neural networks: a new approach for predicting
application behavior. Research in Higher Education 43(2):235-258.
Defense Advanced Research Project Agency [DARPA] (1988). Neural Network Study. AFCEA International Press.
Eno, D., McLaughlin, G.W., Brozovsky, P., and Sheldon, P. (May 1998). Predicting freshman success based on high school
record and other measures. Paper presented at the Association for Institutional Research Forum, Minneapolis, MN.
Everson, H.T., Chance, D., and Lykins, S. (April 1994). Using artificial neural networks in educational research: Some
comparisons with linear statistical models. Paper presented at the annual meeting of the National Council on Measurement
in Education, New Orleans, LA.
Goodman, P.H., and Harrell, F.E. Neural networks: Advantages and limitations for biostatistical modeling. Washoe Medical
Center, Reno, NV. Paper available at [Link]/nevprop
Luan, J. (June 2002). Data mining and knowledge management in higher education – potential applications. Paper
presented at the Association for Institutional Research Forum, Toronto, Canada.
Murthy, S. K. (1998). Automatic construction of decision trees from data: a multi-disciplinary survey. Data Mining and
Knowledge Discovery 2: 345-389.
Porter, S. R. (June 1999). Viewing one-year retention as a continuum: the use of dichotomous logistic regression, ordered
logit and multinomial logit. Paper presented at the Association of Institutional Research, Seattle, WA.
Song, Q., and Chissom, B.S. (April, 1993) New models for forecasting enrollments: fuzzy time series and neural network
approaches. Paper presented at the American Educational Research Association, Atlanta, GA.
Stergiou, C. What is a neural network? [Link] Dowloaded
10/3/05
Thomas, E., Dawes, W., and Reznik, G. (2001). Using predictive modeling to target student recruitment: Theory and
practice. AIR Professional File, Number 78.
Using data mining to detect fraud. SPSS White paper available at
[Link]
Van Nelson, C., and Neff, K. J. (October, 1990). Comparing and contrasting neural network solutions to classical statistical
solutions. Paper presented at the Midwestern Educational Research Association, Chicago, IL.
Vijayaraman, B.S., and Osyk, B. (1997). A Survey of Neural Network Publications. ERIC ED-422942 (Original in:
Proceedings of the International Academy for Information Management Annual Conference, Atlanta, GA, December 12-14,
1997).
Wilkie P., and Pugh, D. Understanding your financial customers with Clementine: Neural Networks in Royal SunAlliance Life
and Pensions. SPSS white paper available at [Link]
51

13

You might also like