Cart Animation en Feb19 Final
Cart Animation en Feb19 Final
1
Introduction
2
What?
A prediction model consisting of a series of If-Else
statements
e.g. Vladimir Guerrero: 7 years, 200 hits. Predict his salary
for next year?
226
n=90
Hits < 118
465 949
n=90 n=83
3
Background on CART
2. quantitative (regression)
4
Regression vs. Classification
died
Fig.: Regression
Fig.: Classification
5
Regression vs. Classification
died
Fig.: Regression
Fig.: Classification
5
Terminology
Node
C
leaves
The children
of node C
6
A motivating example
7
Prediction of Major League Baseball Salaries
Major League Baseball (MLB) data from the 1986 and 1987
seasons. Available in the ISLR [4] R package:
library(ISLR)
data(Hitters)
8
Prediction of Major League Baseball Salaries
Major League Baseball (MLB) data from the 1986 and 1987
seasons. Available in the ISLR [4] R package:
library(ISLR)
data(Hitters)
8
Prediction of Major League Baseball Salaries
Major League Baseball (MLB) data from the 1986 and 1987
seasons. Available in the ISLR [4] R package:
library(ISLR)
data(Hitters)
Predictor variables:
1. X1 : Number of years in the major leagues
8
Prediction of Major League Baseball Salaries
Major League Baseball (MLB) data from the 1986 and 1987
seasons. Available in the ISLR [4] R package:
library(ISLR)
data(Hitters)
Predictor variables:
1. X1 : Number of years in the major leagues
8
Prediction of Major League Baseball Salaries
Major League Baseball (MLB) data from the 1986 and 1987
seasons. Available in the ISLR [4] R package:
library(ISLR)
data(Hitters)
Predictor variables:
1. X1 : Number of years in the major leagues
Objective
Predict the annual (salary) at the start of the 1987 season
using the predictor variables (years and hits).
8
The Data
9
A Visual Representation of the Data
●
●
● ●
● ●
200 ●
● ●
● ●
● ●
●
●
●
● ●
●
● ● ● ● ●
● ●
● ●
● ●
● ●
●
● ● ● ● ●
●
●
●
●
150 ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
● ● ●
●
●
● ●
●
● ●
● ●
●
●
● ● ● ●
Hits
●
●
● ●
● ● ● ● ●
● ●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
● ●
● ● ● ● ● ●
●
●
● ● ● ● ● ● ●
●
● ●
100 ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
● ●
● ● ●
● ●
● ● ● ●
● ●
● ● ● ● ●
●
● ●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
● ● ●
●
● ● ● ●
● ● ●
● ●
●
●
●
50
●
● ●
●
● ● ● ●
● ●
●
● ● ● ●
● ●
●
●
●
●
●
●
● ● ●
●
●
●
● ●
●
0 ●
●
0 5 10 15 20 25
Years
Salary ●
500 ● 1000 ● 1500 ● 2000
10
How does CART work?
11
How does CART work?
11
First Split
R1
R2
mean=226
mean=697
200
Hits
100
0
0 5 10 15 20 25
Years
Salary 500 1000 1500 2000
12
Second Split
250 R1
mean=204
200
150
Hits
100
50
R2
mean=454
0
0 5 10 15 20 25
Years
Salary 500 1000 1500 2000
13
A Mistake in the Data
14
Second Split
250 R1
R4
mean=204
mean=949
200
150
Hits
100
50
R2
mean=454 R3
mean=465
0
0 5 10 15 20 25
Years
Salary 500 1000 1500 2000
15
Third Split
250 R1 R2
mean=142 329
200
150
Hits
100
50
R3
mean=454
0
0 5 10 15 20 25
Years
Salary 500 1000 1500 2000
16
Third Split
250 R1 R2
mean=142 329
200
150
Hits
100
50
R4
R3
335 R5
mean=454
mean=518
0
0 5 10 15 20 25
Years
Salary 500 1000 1500 2000
17
Third Split
250 R1 R2
R7
mean=142 329
mean=1328
200
R6
150 mean=914
Hits
100
50
R4
R3
335 R5
mean=454
mean=518
0
0 5 10 15 20 25
Years
Salary 500 1000 1500 2000
18
And if we continue...
250
200
150
Hits
100
50
0
0 5 10 15 20 25
Years
Salary 500 1000 1500 2000
19
Stop if the number of observations is less than 20
yes Years < 4.5 no
21
The Details
22
1. Selecting the Best Partition
23
1. Selecting the Best Partition
23
Exhaustive Search for J = 4
24
1. A Top-Down “Greedy” Approach
Top-Down: It begins at the top of the tree at which point
all observations belong to a single region.
25
1. A Top-Down “Greedy” Approach
Top-Down: It begins at the top of the tree at which point
all observations belong to a single region.
Binary Splits: Each split at the value s for the jth predictor
creates exactly two children; R1 and R2 which leads to the
greatest possible reduction in the residual sum of squares:
{ } { }
R1 (j, s) = X|Xj < s and R2 (j, s) = X|Xj ≥ s
25
1. A Top-Down “Greedy” Approach
Top-Down: It begins at the top of the tree at which point
all observations belong to a single region.
Binary Splits: Each split at the value s for the jth predictor
creates exactly two children; R1 and R2 which leads to the
greatest possible reduction in the residual sum of squares:
{ } { }
R1 (j, s) = X|Xj < s and R2 (j, s) = X|Xj ≥ s
The goal is to find the values j and s that minimize the
equation:
∑ ∑
(yi − ŷR1 )2 + (yi − ŷR2 )2 (2)
i:xi ∈R1 (j,s) i:xi ∈R2 (j,s)
25
1. A Top-Down “Greedy” Approach
Top-Down: It begins at the top of the tree at which point
all observations belong to a single region.
Binary Splits: Each split at the value s for the jth predictor
creates exactly two children; R1 and R2 which leads to the
greatest possible reduction in the residual sum of squares:
{ } { }
R1 (j, s) = X|Xj < s and R2 (j, s) = X|Xj ≥ s
The goal is to find the values j and s that minimize the
equation:
∑ ∑
(yi − ŷR1 )2 + (yi − ŷR2 )2 (2)
i:xi ∈R1 (j,s) i:xi ∈R2 (j,s)
26
2. Stopping Rule
27
3. Pruning the Tree
28
3. Pruning the Tree
We first grow the biggest tree possible Tmax and then prune it
back in order to obtain a subtree
|T|
∑ ∑
(yi − ŷRm )2 + α|T| (3)
m=1 i:xi ∈Rm
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Complete sample
30
Background on Cross-Validation
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Complete sample
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Training set Test set
31
Background on Cross-Validation
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Complete sample
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Training set Test set
32
Background on Cross-Validation
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Complete sample
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Training set Test set
i years yi yi(pred)
U 5 373
V 3 277
W 15 1456
X 4 455
Y 1 235
Z 9 987
33
Background on Cross-Validation
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Complete sample
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Training set Test set
i years yi yi(pred)
U 5 373 697
V 3 277 226
W 15 1456 697
X 4 455 226
Y 1 235 226
Z 9 987 697
34
Background on Cross-Validation
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Complete sample
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Training set Test set
i years yi yi(pred)
U 5 373 697
V 3 277 226
W 15 1456 697
X 4 455 226
Y 1 235 226
Z 9 987 697
35
Background on Cross-Validation
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Complete sample
1 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
2 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
3 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
4 E FF G
A B C D E G H
H I J K L M N O P Q R S T U V W X Y Z
5 AA BB CC DD E F G H I J K L M N O P Q R S T U V W X Y Z
Training
Test
36
Over-fitting
●
0.8
● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
●
0.6
●
●
● ● ●
●
● ● ● ● ● ● ● ● ● ● ● ●
0.4
0 5 10 15 20
Number of Splits
37
Fitting and Pruning with rpart
cart_fit <- rpart::rpart(Salary ~ Years + Hits, data = Hitters)
min_ind <- which.min(cart_fit$cptable[, "xerror"])
min_cp <- cart_fit$cptable[min_ind, "CP"]
prune_fit <- rpart::prune(cart_fit, cp = min_cp)
rpart.plot::rpart.plot(prune_fit)
226
n=90
Hits < 118
465 949
n=90 n=83
38
Comparison with a Linear Model
39
Comparison: Linear Model vs. CART
Distributional Assumptions 3 7
Robust to multicollinearity 7 3
40
Linear Model
41
Regression Surface
Salary
Salary
Hi
Hi
ts
ts
rs
Yea rs
a
Ye
42
RMSE Performance: 10 times 10-fold CV
CART no pruning ●
linear model ●
CART ●
43
R2 Performance: 10 times 10-fold CV
CART ●
linear model ●
CART no pruning ●
44
Advantages
45
Limitations
46
Exercise
47
Build a tree by hand = no software!
Build a tree using the dataset provided below
48
References I
49
Session Info
R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.3 LTS