0% found this document useful (0 votes)
10 views

Cart Animation en Feb19 Final

This document introduces regression trees for predicting outcomes. Regression trees use a series of if-else conditional statements to split a dataset into partitions based on the values of predictor variables. They were first introduced in the 1960s and formalized by Breiman et al. in 1984. The document uses Major League Baseball player salary data to demonstrate how a regression tree can be used to predict a player's 1987 salary based on their years of experience and number of hits in 1986.

Uploaded by

Picasbrancas
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Cart Animation en Feb19 Final

This document introduces regression trees for predicting outcomes. Regression trees use a series of if-else conditional statements to split a dataset into partitions based on the values of predictor variables. They were first introduced in the 1960s and formalized by Breiman et al. in 1984. The document uses Major League Baseball player salary data to demonstrate how a regression tree can be used to predict a player's 1987 salary based on their years of experience and number of hits in 1986.

Uploaded by

Picasbrancas
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Introduction to Regression Trees

Sahir Rai Bhatnagar, PhD Candidate (Biostatistics)

Department of Epidemiology, Biostatistics and Occupational Health

February 19, 2018

1
Introduction

2
What?
A prediction model consisting of a series of If-Else
statements
e.g. Vladimir Guerrero: 7 years, 200 hits. Predict his salary
for next year?

yes Years < 4.5 no

226
n=90
Hits < 118

465 949
n=90 n=83

3
Background on CART

Recursive partitioning or segmentation methods were first


introduced in the 1960s

They were formalized by Breiman et al. (1984) [1] under


the acronym CART: Classification and Regression Tree.

CART can be applied to both regression and classification


problems depending on the response (outcome) variable:
1. qualitative (classification)

2. quantitative (regression)

4
Regression vs. Classification

died

yes Years < 4.5 no


0.38
100%
yes sex = male no
died
0.19
64%
226 age >= 9.5
n=90
Hits < 118
survived
0.53
4%
sibsp >= 2.5
465 949
died died survived survived
n=90 n=83 0.17 0.05 0.89 0.73
61% 2% 2% 36%

Fig.: Regression
Fig.: Classification

5
Regression vs. Classification

died

yes Years < 4.5 no


0.38
100%
yes sex = male no
died
0.19
64%
226 age >= 9.5
n=90
Hits < 118
survived
0.53
4%
sibsp >= 2.5
465 949
died died survived survived
n=90 n=83 0.17 0.05 0.89 0.73
61% 2% 2% 36%

Fig.: Regression
Fig.: Classification

Today’s class → regression

5
Terminology

The parent Root of


Tree Depth = Number of splits = 2 of node C the tree

Node
C

leaves

The children
of node C
6
A motivating example

7
Prediction of Major League Baseball Salaries
Major League Baseball (MLB) data from the 1986 and 1987
seasons. Available in the ISLR [4] R package:
library(ISLR)
data(Hitters)

8
Prediction of Major League Baseball Salaries
Major League Baseball (MLB) data from the 1986 and 1987
seasons. Available in the ISLR [4] R package:
library(ISLR)
data(Hitters)

Response variable yi , i = 1, . . . , 263: 1987 annual salary on


opening day in thousands of dollars

8
Prediction of Major League Baseball Salaries
Major League Baseball (MLB) data from the 1986 and 1987
seasons. Available in the ISLR [4] R package:
library(ISLR)
data(Hitters)

Response variable yi , i = 1, . . . , 263: 1987 annual salary on


opening day in thousands of dollars

Predictor variables:
1. X1 : Number of years in the major leagues

8
Prediction of Major League Baseball Salaries
Major League Baseball (MLB) data from the 1986 and 1987
seasons. Available in the ISLR [4] R package:
library(ISLR)
data(Hitters)

Response variable yi , i = 1, . . . , 263: 1987 annual salary on


opening day in thousands of dollars

Predictor variables:
1. X1 : Number of years in the major leagues

2. X2 : Number of hits in 1986

8
Prediction of Major League Baseball Salaries
Major League Baseball (MLB) data from the 1986 and 1987
seasons. Available in the ISLR [4] R package:
library(ISLR)
data(Hitters)

Response variable yi , i = 1, . . . , 263: 1987 annual salary on


opening day in thousands of dollars

Predictor variables:
1. X1 : Number of years in the major leagues

2. X2 : Number of hits in 1986

Objective
Predict the annual (salary) at the start of the 1987 season
using the predictor variables (years and hits).

8
The Data

A sample of what the data looks like:

Years Hits Salary


-Andre Dawson 11 141 500
-Andres Galarraga 2 87 92
-Barry Bonds 1 92 100
-Cal Ripken 6 177 1350
-Gary Carter 13 125 1926
-Joe Carter 4 200 250
-Ken Griffey 14 150 1000
-Mike Schmidt 2 1 2127
-Tony Gwynn 5 211 740

9
A Visual Representation of the Data


● ●
● ●

200 ●
● ●
● ●
● ●



● ●

● ● ● ● ●
● ●
● ●
● ●
● ●

● ● ● ● ●




150 ●











● ●
● ●
● ● ●


● ●

● ●
● ●


● ● ● ●
Hits



● ●
● ● ● ● ●
● ●









● ● ●



● ●
● ● ● ● ● ●


● ● ● ● ● ● ●

● ●
100 ●















● ●


● ●







● ●


● ●
● ● ●
● ●
● ● ● ●
● ●
● ● ● ● ●

● ●








● ●
● ●


● ● ●

● ● ● ●
● ● ●
● ●


50

● ●

● ● ● ●
● ●

● ● ● ●
● ●






● ● ●



● ●

0 ●

0 5 10 15 20 25
Years
Salary ●
500 ● 1000 ● 1500 ● 2000
10
How does CART work?

Roughly speaking, there are two steps [3]:

11
How does CART work?

Roughly speaking, there are two steps [3]:


1. We divide the predictor space - that is, the set of possible
values for X1 , X2 , . . . , Xp , into J non-overlapping and
exhaustive regions, R1 , R2 , . . . , RJ .

2. For every observation that falls into the region Rj , we


make the same prediction, which is simply the mean of
the response values for the training observations in Rj .

11
First Split

R1
R2
mean=226
mean=697

200
Hits

100

0
0 5 10 15 20 25
Years
Salary 500 1000 1500 2000

12
Second Split

250 R1
mean=204

200

150
Hits

100

50
R2
mean=454
0
0 5 10 15 20 25
Years
Salary 500 1000 1500 2000

13
A Mistake in the Data

Years Hits Salary


-Andre Dawson 11 141 500.00
-Andres Galarraga 2 87 91.50
-Barry Bonds 1 92 100.00
-Cal Ripken 6 177 1350.00
-Gary Carter 13 125 1925.57
-Joe Carter 4 200 250.00
-Ken Griffey 14 150 1000.00
-Mike Schmidt 2 1 2127.33
-Tony Gwynn 5 211 740.00

Mike Schmidt started his career in 1972, and was inducted


into the Baseball Hall of Fame in 1995.

14
Second Split
250 R1
R4
mean=204
mean=949

200

150
Hits

100

50
R2
mean=454 R3
mean=465
0
0 5 10 15 20 25
Years
Salary 500 1000 1500 2000

15
Third Split
250 R1 R2
mean=142 329

200

150
Hits

100

50
R3
mean=454
0
0 5 10 15 20 25
Years
Salary 500 1000 1500 2000

16
Third Split

250 R1 R2
mean=142 329

200

150
Hits

100

50
R4
R3
335 R5
mean=454
mean=518
0
0 5 10 15 20 25
Years
Salary 500 1000 1500 2000

17
Third Split
250 R1 R2
R7
mean=142 329
mean=1328

200

R6
150 mean=914
Hits

100

50
R4
R3
335 R5
mean=454
mean=518
0
0 5 10 15 20 25
Years
Salary 500 1000 1500 2000

18
And if we continue...
250

200

150
Hits

100

50

0
0 5 10 15 20 25
Years
Salary 500 1000 1500 2000
19
Stop if the number of observations is less than 20
yes Years < 4.5 no

Hits < 118


Hits >= 42
454
n=8
Hits < 185
Years < 3.5 Years < 6.5
1328
Hits < 122 Hits < 90 n=7

Hits < 114 Hits < 76 Years < 5.5


206 483 380 658
n=19 n=9 n=14 n=19
Hits < 142
252 282 622
Years < 2.5 n=18 n=12
Hits < 50 n=8
132
n=13
Years >= 13 Hits >= 152
348
Hits < 82 n=12
Years < 12
111 948 1170
n=12
Years < 8.5 n=10 n=13
76 459
n=11 n=13
Years >= 9.5 Hits < 160
549 847 1075
n=13 n=13 n=17
480 719 688
n=7 n=7 n=8
20
The Details

21
The Details

The CART algorithm requires 3 components:


1. Defining a criterion to select the best partition among all
predictors.

2. A rule to decide when a node is terminal, i.e., it becomes a


leaf.

3. Pruning the tree to avoid over-fitting.

22
1. Selecting the Best Partition

The objective is the find the regions R1 , . . . , RJ that minimize


the squared error loss:
J ∑

(yi − ŷRj )2 (1)
j=1 i∈Rj

ŷRj : the mean response for the training observations


within the jth box

23
1. Selecting the Best Partition

The objective is the find the regions R1 , . . . , RJ that minimize


the squared error loss:
J ∑

(yi − ŷRj )2 (1)
j=1 i∈Rj

ŷRj : the mean response for the training observations


within the jth box

Finding the solution to (1) is computationally infeasible


(NP-hard). Why?

23
Exhaustive Search for J = 4

24
1. A Top-Down “Greedy” Approach
Top-Down: It begins at the top of the tree at which point
all observations belong to a single region.

25
1. A Top-Down “Greedy” Approach
Top-Down: It begins at the top of the tree at which point
all observations belong to a single region.
Binary Splits: Each split at the value s for the jth predictor
creates exactly two children; R1 and R2 which leads to the
greatest possible reduction in the residual sum of squares:
{ } { }
R1 (j, s) = X|Xj < s and R2 (j, s) = X|Xj ≥ s

25
1. A Top-Down “Greedy” Approach
Top-Down: It begins at the top of the tree at which point
all observations belong to a single region.
Binary Splits: Each split at the value s for the jth predictor
creates exactly two children; R1 and R2 which leads to the
greatest possible reduction in the residual sum of squares:
{ } { }
R1 (j, s) = X|Xj < s and R2 (j, s) = X|Xj ≥ s
The goal is to find the values j and s that minimize the
equation:
∑ ∑
(yi − ŷR1 )2 + (yi − ŷR2 )2 (2)
i:xi ∈R1 (j,s) i:xi ∈R2 (j,s)

25
1. A Top-Down “Greedy” Approach
Top-Down: It begins at the top of the tree at which point
all observations belong to a single region.
Binary Splits: Each split at the value s for the jth predictor
creates exactly two children; R1 and R2 which leads to the
greatest possible reduction in the residual sum of squares:
{ } { }
R1 (j, s) = X|Xj < s and R2 (j, s) = X|Xj ≥ s
The goal is to find the values j and s that minimize the
equation:
∑ ∑
(yi − ŷR1 )2 + (yi − ŷR2 )2 (2)
i:xi ∈R1 (j,s) i:xi ∈R2 (j,s)

Greedy: at each step of the tree-building process, the best


split is made at that particular step, rather than looking
ahead and picking a split that will lead to a better tree in
some future step
25
The Best Split Using a “Greedy” Approach

26
2. Stopping Rule

minsplit: To avoid creating splits that will lead to very


small leaves, the minimum number of observations that
must exist in a node in order for a split to be attempted
(minsplit = 20 is the default in rpart).

minbucket: the minimum number of observations in any


terminal leaf node (minbucket = minsplit/3 is the
default in rpart)

27
3. Pruning the Tree

The process described above may produce good


predictions on the training set, but is likely to overfit the
data, leading to poor test set performance.

This is because the resulting tree, Tmax with |Tmax | leaves,


might be too complex.

A smaller tree with fewer splits (that is, fewer regions


R1 , . . . , RJ ) might lead to lower prediction variance and
better interpretation at the cost of a little bias. What is
this phenomenon called?

28
3. Pruning the Tree
We first grow the biggest tree possible Tmax and then prune it
back in order to obtain a subtree

We consider adding a penalty to our loss function in order to


penalize excessively large trees.

For each value of α, there exists a subtree T ⊂ Tmax that


minimizes:

|T|
∑ ∑
(yi − ŷRm )2 + α|T| (3)
m=1 i:xi ∈Rm

|T| indicates the number of terminal nodes of the tree T, Rm is


the rectangle corresponding to the mth leaf, and ŷRm is the
predicted response associated with Rm .

α is chosen using v-fold cross-validation (v → xval=10 in


rpart by default).
29
Background on Cross-Validation

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Complete sample

30
Background on Cross-Validation

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Complete sample

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Training set Test set

31
Background on Cross-Validation

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Complete sample

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Training set Test set

32
Background on Cross-Validation

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Complete sample

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Training set Test set
i years yi yi(pred)

U 5 373

V 3 277

W 15 1456

X 4 455

Y 1 235

Z 9 987

33
Background on Cross-Validation

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Complete sample

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Training set Test set
i years yi yi(pred)

U 5 373 697

V 3 277 226

W 15 1456 697

X 4 455 226

Y 1 235 226

Z 9 987 697

34
Background on Cross-Validation

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Complete sample

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Training set Test set
i years yi yi(pred)

U 5 373 697

V 3 277 226

W 15 1456 697

X 4 455 226

Y 1 235 226

Z 9 987 697

35
Background on Cross-Validation
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Complete sample

1 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

2 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

3 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

4 E FF G
A B C D E G H
H I J K L M N O P Q R S T U V W X Y Z

5 AA BB CC DD E F G H I J K L M N O P Q R S T U V W X Y Z

Training

Test

36
Over-fitting

● Complete Data Set


1.2
● 10−Fold CV
Mean Squared Error
1.0


0.8

● ● ● ● ● ● ● ● ● ● ● ● ● ●


0.6



● ● ●

● ● ● ● ● ● ● ● ● ● ● ●
0.4

0 5 10 15 20
Number of Splits

37
Fitting and Pruning with rpart
cart_fit <- rpart::rpart(Salary ~ Years + Hits, data = Hitters)
min_ind <- which.min(cart_fit$cptable[, "xerror"])
min_cp <- cart_fit$cptable[min_ind, "CP"]
prune_fit <- rpart::prune(cart_fit, cp = min_cp)
rpart.plot::rpart.plot(prune_fit)

yes Years < 4.5 no

226
n=90
Hits < 118

465 949
n=90 n=83

38
Comparison with a Linear Model

39
Comparison: Linear Model vs. CART

Characteristica Linear Model CART


Linearity Assumption 3 7

Distributional Assumptions 3 7

Robust to multicollinearity 7 3

Handles complex interactions 7 3

Allows for missing data 7 3

Confidence Intervals, p-values 3 7


a
3: yes, 7: no

40
Linear Model

lm(Salary ~ Years * Hits, data = Hitters)

Estimate Std. Error t value Pr(>|t|)


(Intercept) 159.55 95.65 1.67 0.10
Years -16.08 11.38 -1.41 0.16
Hits 0.60 0.87 0.69 0.49
Years:Hits 0.54 0.11 5.08 0.00
Table: R2 = 0.41

41
Regression Surface

Salary type=vector rpart::rpart(Salary~Years+Hits, data=...

Salary
Salary

Hi
Hi
ts

ts
rs
Yea rs
a
Ye

Fig.: Linear Model


Fig.: CART

42
RMSE Performance: 10 times 10-fold CV

CART no pruning ●

linear model ●

CART ●

320 330 340 350 360 370 380


RMSE
Confidence Level: 0.95

43
R2 Performance: 10 times 10-fold CV

CART ●

linear model ●

CART no pruning ●

0.36 0.38 0.40 0.42 0.44 0.46 0.48


Rsquared
Confidence Level: 0.95

44
Advantages

CART models are easy to interpret

You don’t need to pre-define relationships between


variables

Automatically handles higher-order interactions

45
Limitations

CART models generally produce unstable predictions (next


class → random forests)

46
Exercise

47
Build a tree by hand = no software!
Build a tree using the dataset provided below

Use the parameters


minsplit = 6 and minbucket = 2

Years Hits Salary


-Rey Quinones 1 68 70
-Barry Bonds 1 92 100
-Pete Incaviglia 1 135 172
-Dan Gladden 4 97 210
-Juan Samuel 4 157 640
-Joe Carter 4 200 250
-Tim Wallach 7 112 750
-Rafael Ramirez 7 119 875
-Harold Baines 7 169 950

48
References I

[1] Leo Breiman et al. Classification and regression trees.


CRC press, 1984.
[2] Jerome Friedman, Trevor Hastie, and Robert Tibshirani.
The elements of statistical learning. Vol. 1. Springer series
in statistics New York, 2001.
[3] Gareth James et al. An introduction to statistical learning.
Vol. 112. Springer, 2013.
[4] Gareth James et al. “Package ‘ISLR’”. In: (2017).
[5] Olivier Lopez, Xavier Milhaud, and
Pierre-Emmanuel Thérond. “Arbres de régression et de
classification (CART)”. In: l’actuariel 15 (2015), pp. 42–44.

49
Session Info
R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.3 LTS

Matrix products: default


BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.18.so

attached base packages:


[1] methods stats graphics grDevices utils datasets base

other attached packages:


[1] dplyr_0.7.2 purrr_0.2.3 readr_1.1.1
[4] tidyr_0.7.1 tibble_1.4.2 tidyverse_1.1.1
[7] caret_6.0-77 lattice_0.20-35 plotmo_3.3.4
[10] TeachingDemos_2.10 plotrix_3.6-6 visreg_2.4-1
[13] sjmisc_2.6.1 sjPlot_2.3.3 cowplot_0.8.0.9000
[16] ggplot2_2.2.1.9000 xtable_1.8-2 rpart.plot_2.1.2
[19] rpart_4.1-11 data.table_1.10.4-3 ISLR_1.2
[22] knitr_1.19

loaded via a namespace (and not attached):


[1] TH.data_1.0-8 minqa_1.2.4 colorspace_1.3-2
[4] class_7.3-14 modeltools_0.2-21 sjlabelled_1.0.1
[7] glmmTMB_0.1.1 DRR_0.0.2 DT_0.2
[10] prodlim_1.6.1 mvtnorm_1.0-6 lubridate_1.6.0
[13] xml2_1.1.1 coin_1.2-1 RSkittleBrewer_1.1
[16] codetools_0.2-15 splines_3.4.1 mnormt_1.5-5
[19] robustbase_0.92-7 effects_3.1-2 RcppRoll_0.2.2
[22] jsonlite_1.5 nloptr_1.0.4 broom_0.4.2
[25] ddalpha_1.2.1 kernlab_0.9-25 shiny_1.0.5
[28] compiler_3.4.1 httr_1.3.1 sjstats_0.11.0
[31] assertthat_0.2.0 Matrix_1.2-11 lazyeval_0.2.1
[34] htmltools_0.3.6 tools_3.4.1 bindrcpp_0.2
[37] coda_0.19-1 gtable_0.2.0 glue_1.1.1 50

You might also like