0% found this document useful (0 votes)

10 views

Cart Animation en Feb19 Final

This document introduces regression trees for predicting outcomes. Regression trees use a series of if-else conditional statements to split a dataset into partitions based on the values of predictor variables. They were first introduced in the 1960s and formalized by Breiman et al. in 1984. The document uses Major League Baseball player salary data to demonstrate how a regression tree can be used to predict a player's 1987 salary based on their years of experience and number of hits in 1986.

Uploaded by

Picasbrancas

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

Cart Animation en Feb19 Final

Uploaded by

Picasbrancas

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 60

Introduction to Regression Trees

Sahir Rai Bhatnagar, PhD Candidate (Biostatistics)

Department of Epidemiology, Biostatistics and Occupational Health

February 19, 2018

1
Introduction

2
What?
A prediction model consisting of a series of If-Else
statements
e.g. Vladimir Guerrero: 7 years, 200 hits. Predict his salary
for next year?

yes Years < 4.5 no

226
n=90
Hits < 118

465 949
n=90 n=83

3
Background on CART

Recursive partitioning or segmentation methods were ﬁrst

introduced in the 1960s

They were formalized by Breiman et al. (1984) [1] under

the acronym CART: Classiﬁcation and Regression Tree.

CART can be applied to both regression and classiﬁcation

problems depending on the response (outcome) variable:
1. qualitative (classiﬁcation)

2. quantitative (regression)

4
Regression vs. Classiﬁcation

died

yes Years < 4.5 no

0.38
100%
yes sex = male no
died
0.19
64%
226 age >= 9.5
n=90
Hits < 118
survived
0.53
4%
sibsp >= 2.5
465 949
died died survived survived
n=90 n=83 0.17 0.05 0.89 0.73
61% 2% 2% 36%

Fig.: Regression
Fig.: Classiﬁcation

5
Regression vs. Classiﬁcation

died

yes Years < 4.5 no

0.38
100%
yes sex = male no
died
0.19
64%
226 age >= 9.5
n=90
Hits < 118
survived
0.53
4%
sibsp >= 2.5
465 949
died died survived survived
n=90 n=83 0.17 0.05 0.89 0.73
61% 2% 2% 36%

Fig.: Regression
Fig.: Classiﬁcation

Today’s class → regression

5
Terminology

The parent Root of

Tree Depth = Number of splits = 2 of node C the tree

Node
C

leaves

The children
of node C
6
A motivating example

7
Prediction of Major League Baseball Salaries
Major League Baseball (MLB) data from the 1986 and 1987
seasons. Available in the ISLR [4] R package:
library(ISLR)
data(Hitters)

8
Prediction of Major League Baseball Salaries
Major League Baseball (MLB) data from the 1986 and 1987
seasons. Available in the ISLR [4] R package:
library(ISLR)
data(Hitters)

Response variable yi , i = 1, . . . , 263: 1987 annual salary on

opening day in thousands of dollars

8
Prediction of Major League Baseball Salaries
Major League Baseball (MLB) data from the 1986 and 1987
seasons. Available in the ISLR [4] R package:
library(ISLR)
data(Hitters)

Response variable yi , i = 1, . . . , 263: 1987 annual salary on

opening day in thousands of dollars

Predictor variables:
1. X1 : Number of years in the major leagues

8
Prediction of Major League Baseball Salaries
Major League Baseball (MLB) data from the 1986 and 1987
seasons. Available in the ISLR [4] R package:
library(ISLR)
data(Hitters)

Response variable yi , i = 1, . . . , 263: 1987 annual salary on

opening day in thousands of dollars

Predictor variables:
1. X1 : Number of years in the major leagues

2. X2 : Number of hits in 1986

8
Prediction of Major League Baseball Salaries
Major League Baseball (MLB) data from the 1986 and 1987
seasons. Available in the ISLR [4] R package:
library(ISLR)
data(Hitters)

Response variable yi , i = 1, . . . , 263: 1987 annual salary on

opening day in thousands of dollars

Predictor variables:
1. X1 : Number of years in the major leagues

2. X2 : Number of hits in 1986

Objective
Predict the annual (salary) at the start of the 1987 season
using the predictor variables (years and hits).

8
The Data

A sample of what the data looks like:

Years Hits Salary

-Andre Dawson 11 141 500
-Andres Galarraga 2 87 92
-Barry Bonds 1 92 100
-Cal Ripken 6 177 1350
-Gary Carter 13 125 1926
-Joe Carter 4 200 250
-Ken Griffey 14 150 1000
-Mike Schmidt 2 1 2127
-Tony Gwynn 5 211 740

9
A Visual Representation of the Data
●
●
● ●
● ●

200 ●
● ●
● ●
● ●
●
●
●
● ●
●
● ● ● ● ●
● ●
● ●
● ●
● ●
●
● ● ● ● ●
●
●
●
●
150 ●
●
●
●
●
●

●
●
●
●
●
●
●
● ●
● ●
● ● ●
●
●
● ●
●
● ●
● ●
●
●
● ● ● ●
Hits

●
●
● ●
● ● ● ● ●
● ●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
● ●
● ● ● ● ● ●
●
●
● ● ● ● ● ● ●
●
● ●
100 ●
●
●
●

●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●

●
● ●
●
●
●
●
●
●
●
● ●

●
● ●
● ● ●
● ●
● ● ● ●
● ●
● ● ● ● ●
●
● ●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
● ● ●
●
● ● ● ●
● ● ●
● ●
●
●
●

50
●
● ●
●
● ● ● ●
● ●
●
● ● ● ●
● ●
●
●
●
●
●
●
● ● ●
●
●
●
● ●
●

0 ●
●
0 5 10 15 20 25
Years
Salary ●
500 ● 1000 ● 1500 ● 2000
10
How does CART work?

Roughly speaking, there are two steps [3]:

11
How does CART work?

Roughly speaking, there are two steps [3]:

1. We divide the predictor space - that is, the set of possible
values for X1 , X2 , . . . , Xp , into J non-overlapping and
exhaustive regions, R1 , R2 , . . . , RJ .

2. For every observation that falls into the region Rj , we

make the same prediction, which is simply the mean of
the response values for the training observations in Rj .

11
First Split

R1
R2
mean=226
mean=697

200
Hits

100

0
0 5 10 15 20 25
Years
Salary 500 1000 1500 2000

12
Second Split

250 R1
mean=204

200

150
Hits

100

50
R2
mean=454
0
0 5 10 15 20 25
Years
Salary 500 1000 1500 2000

13
A Mistake in the Data

Years Hits Salary

-Andre Dawson 11 141 500.00
-Andres Galarraga 2 87 91.50
-Barry Bonds 1 92 100.00
-Cal Ripken 6 177 1350.00
-Gary Carter 13 125 1925.57
-Joe Carter 4 200 250.00
-Ken Griffey 14 150 1000.00
-Mike Schmidt 2 1 2127.33
-Tony Gwynn 5 211 740.00

Mike Schmidt started his career in 1972, and was inducted

into the Baseball Hall of Fame in 1995.

14
Second Split
250 R1
R4
mean=204
mean=949

200

150
Hits

100

50
R2
mean=454 R3
mean=465
0
0 5 10 15 20 25
Years
Salary 500 1000 1500 2000

15
Third Split
250 R1 R2
mean=142 329

200

150
Hits

100

50
R3
mean=454
0
0 5 10 15 20 25
Years
Salary 500 1000 1500 2000

16
Third Split

250 R1 R2
mean=142 329

200

150
Hits

100

50
R4
R3
335 R5
mean=454
mean=518
0
0 5 10 15 20 25
Years
Salary 500 1000 1500 2000

17
Third Split
250 R1 R2
R7
mean=142 329
mean=1328

200

R6
150 mean=914
Hits

100

50
R4
R3
335 R5
mean=454
mean=518
0
0 5 10 15 20 25
Years
Salary 500 1000 1500 2000

18
And if we continue...
250

200

150
Hits

100

0
0 5 10 15 20 25
Years
Salary 500 1000 1500 2000
19
Stop if the number of observations is less than 20
yes Years < 4.5 no

Hits < 118

Hits >= 42
454
n=8
Hits < 185
Years < 3.5 Years < 6.5
1328
Hits < 122 Hits < 90 n=7

Hits < 114 Hits < 76 Years < 5.5

206 483 380 658
n=19 n=9 n=14 n=19
Hits < 142
252 282 622
Years < 2.5 n=18 n=12
Hits < 50 n=8
132
n=13
Years >= 13 Hits >= 152
348
Hits < 82 n=12
Years < 12
111 948 1170
n=12
Years < 8.5 n=10 n=13
76 459
n=11 n=13
Years >= 9.5 Hits < 160
549 847 1075
n=13 n=13 n=17
480 719 688
n=7 n=7 n=8
20
The Details

21
The Details

The CART algorithm requires 3 components:

1. Deﬁning a criterion to select the best partition among all
predictors.

2. A rule to decide when a node is terminal, i.e., it becomes a

leaf.

3. Pruning the tree to avoid over-ﬁtting.

22
1. Selecting the Best Partition

The objective is the ﬁnd the regions R1 , . . . , RJ that minimize

the squared error loss:
J ∑
∑
(yi − ŷRj )2 (1)
j=1 i∈Rj

ŷRj : the mean response for the training observations

within the jth box

23
1. Selecting the Best Partition

The objective is the ﬁnd the regions R1 , . . . , RJ that minimize

the squared error loss:
J ∑
∑
(yi − ŷRj )2 (1)
j=1 i∈Rj

ŷRj : the mean response for the training observations

within the jth box

Finding the solution to (1) is computationally infeasible

(NP-hard). Why?

23
Exhaustive Search for J = 4

24
1. A Top-Down “Greedy” Approach
Top-Down: It begins at the top of the tree at which point
all observations belong to a single region.

25
1. A Top-Down “Greedy” Approach
Top-Down: It begins at the top of the tree at which point
all observations belong to a single region.
Binary Splits: Each split at the value s for the jth predictor
creates exactly two children; R1 and R2 which leads to the
greatest possible reduction in the residual sum of squares:
{ } { }
R1 (j, s) = X|Xj < s and R2 (j, s) = X|Xj ≥ s

Greedy: at each step of the tree-building process, the best

split is made at that particular step, rather than looking
ahead and picking a split that will lead to a better tree in
some future step
25
The Best Split Using a “Greedy” Approach

26
2. Stopping Rule

minsplit: To avoid creating splits that will lead to very

small leaves, the minimum number of observations that
must exist in a node in order for a split to be attempted
(minsplit = 20 is the default in rpart).

minbucket: the minimum number of observations in any

terminal leaf node (minbucket = minsplit/3 is the
default in rpart)

27
3. Pruning the Tree

The process described above may produce good

predictions on the training set, but is likely to overﬁt the
data, leading to poor test set performance.

This is because the resulting tree, Tmax with |Tmax | leaves,

might be too complex.

A smaller tree with fewer splits (that is, fewer regions

R1 , . . . , RJ ) might lead to lower prediction variance and
better interpretation at the cost of a little bias. What is
this phenomenon called?

28
3. Pruning the Tree
We ﬁrst grow the biggest tree possible Tmax and then prune it
back in order to obtain a subtree

We consider adding a penalty to our loss function in order to

penalize excessively large trees.

For each value of α, there exists a subtree T ⊂ Tmax that

minimizes:

|T|
∑ ∑
(yi − ŷRm )2 + α|T| (3)
m=1 i:xi ∈Rm

|T| indicates the number of terminal nodes of the tree T, Rm is

the rectangle corresponding to the mth leaf, and ŷRm is the
predicted response associated with Rm .

α is chosen using v-fold cross-validation (v → xval=10 in

rpart by default).
29
Background on Cross-Validation

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Complete sample

30
Background on Cross-Validation

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Complete sample

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Training set Test set

31
Background on Cross-Validation

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Complete sample

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Training set Test set

32
Background on Cross-Validation

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Complete sample

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Training set Test set
i years yi yi(pred)

U 5 373

V 3 277

W 15 1456

X 4 455

Y 1 235

Z 9 987

33
Background on Cross-Validation

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Complete sample

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Training set Test set
i years yi yi(pred)

U 5 373 697

V 3 277 226

W 15 1456 697

X 4 455 226

Y 1 235 226

Z 9 987 697

34
Background on Cross-Validation

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Complete sample

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Training set Test set
i years yi yi(pred)

U 5 373 697

V 3 277 226

W 15 1456 697

X 4 455 226

Y 1 235 226

Z 9 987 697

35
Background on Cross-Validation
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Complete sample

1 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

2 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

3 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

4 E FF G
A B C D E G H
H I J K L M N O P Q R S T U V W X Y Z

5 AA BB CC DD E F G H I J K L M N O P Q R S T U V W X Y Z

Training

Test

36
Over-ﬁtting

● Complete Data Set

1.2
● 10−Fold CV
Mean Squared Error
1.0

●
0.8

● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
●
0.6

●
●
● ● ●
●
● ● ● ● ● ● ● ● ● ● ● ●
0.4

0 5 10 15 20
Number of Splits

37
Fitting and Pruning with rpart
cart_fit <- rpart::rpart(Salary ~ Years + Hits, data = Hitters)
min_ind <- which.min(cart_fit$cptable[, "xerror"])
min_cp <- cart_fit$cptable[min_ind, "CP"]
prune_fit <- rpart::prune(cart_fit, cp = min_cp)
rpart.plot::rpart.plot(prune_fit)

yes Years < 4.5 no

226
n=90
Hits < 118

465 949
n=90 n=83

38
Comparison with a Linear Model

39
Comparison: Linear Model vs. CART

Characteristica Linear Model CART

Linearity Assumption 3 7

Distributional Assumptions 3 7

Robust to multicollinearity 7 3

Handles complex interactions 7 3

Allows for missing data 7 3

Conﬁdence Intervals, p-values 3 7

a
3: yes, 7: no

40
Linear Model

lm(Salary ~ Years * Hits, data = Hitters)

Estimate Std. Error t value Pr(>|t|)

(Intercept) 159.55 95.65 1.67 0.10
Years -16.08 11.38 -1.41 0.16
Hits 0.60 0.87 0.69 0.49
Years:Hits 0.54 0.11 5.08 0.00
Table: R2 = 0.41

41
Regression Surface

Salary type=vector rpart::rpart(Salary~Years+Hits, data=...

Salary
Salary

Hi
Hi
ts

ts
rs
Yea rs
a
Ye

Fig.: Linear Model

Fig.: CART

42
RMSE Performance: 10 times 10-fold CV

CART no pruning ●

linear model ●

CART ●

320 330 340 350 360 370 380

RMSE
Confidence Level: 0.95

43
R2 Performance: 10 times 10-fold CV

CART ●

linear model ●

CART no pruning ●

0.36 0.38 0.40 0.42 0.44 0.46 0.48

Rsquared
Confidence Level: 0.95

44
Advantages

CART models are easy to interpret

You don’t need to pre-deﬁne relationships between

variables

Automatically handles higher-order interactions

45
Limitations

CART models generally produce unstable predictions (next

class → random forests)

46
Exercise

47
Build a tree by hand = no software!
Build a tree using the dataset provided below

Use the parameters

minsplit = 6 and minbucket = 2

Years Hits Salary

-Rey Quinones 1 68 70
-Barry Bonds 1 92 100
-Pete Incaviglia 1 135 172
-Dan Gladden 4 97 210
-Juan Samuel 4 157 640
-Joe Carter 4 200 250
-Tim Wallach 7 112 750
-Rafael Ramirez 7 119 875
-Harold Baines 7 169 950

48
References I

[1] Leo Breiman et al. Classiﬁcation and regression trees.

CRC press, 1984.
[2] Jerome Friedman, Trevor Hastie, and Robert Tibshirani.
The elements of statistical learning. Vol. 1. Springer series
in statistics New York, 2001.
[3] Gareth James et al. An introduction to statistical learning.
Vol. 112. Springer, 2013.
[4] Gareth James et al. “Package ‘ISLR’”. In: (2017).
[5] Olivier Lopez, Xavier Milhaud, and
Pierre-Emmanuel Thérond. “Arbres de régression et de
classiﬁcation (CART)”. In: l’actuariel 15 (2015), pp. 42–44.

49
Session Info
R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.3 LTS

Matrix products: default

BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.18.so

attached base packages:

[1] methods stats graphics grDevices utils datasets base

other attached packages:

[1] dplyr_0.7.2 purrr_0.2.3 readr_1.1.1
[4] tidyr_0.7.1 tibble_1.4.2 tidyverse_1.1.1
[7] caret_6.0-77 lattice_0.20-35 plotmo_3.3.4
[10] TeachingDemos_2.10 plotrix_3.6-6 visreg_2.4-1
[13] sjmisc_2.6.1 sjPlot_2.3.3 cowplot_0.8.0.9000
[16] ggplot2_2.2.1.9000 xtable_1.8-2 rpart.plot_2.1.2
[19] rpart_4.1-11 data.table_1.10.4-3 ISLR_1.2
[22] knitr_1.19

loaded via a namespace (and not attached):

[1] TH.data_1.0-8 minqa_1.2.4 colorspace_1.3-2
[4] class_7.3-14 modeltools_0.2-21 sjlabelled_1.0.1
[7] glmmTMB_0.1.1 DRR_0.0.2 DT_0.2
[10] prodlim_1.6.1 mvtnorm_1.0-6 lubridate_1.6.0
[13] xml2_1.1.1 coin_1.2-1 RSkittleBrewer_1.1
[16] codetools_0.2-15 splines_3.4.1 mnormt_1.5-5
[19] robustbase_0.92-7 effects_3.1-2 RcppRoll_0.2.2
[22] jsonlite_1.5 nloptr_1.0.4 broom_0.4.2
[25] ddalpha_1.2.1 kernlab_0.9-25 shiny_1.0.5
[28] compiler_3.4.1 httr_1.3.1 sjstats_0.11.0
[31] assertthat_0.2.0 Matrix_1.2-11 lazyeval_0.2.1
[34] htmltools_0.3.6 tools_3.4.1 bindrcpp_0.2
[37] coda_0.19-1 gtable_0.2.0 glue_1.1.1 50

Data Analysis: Case Processing Summary
No ratings yet
Data Analysis: Case Processing Summary
5 pages
Bomdels: Information Technology Service Management
No ratings yet
Bomdels: Information Technology Service Management
7 pages
Normalitas: Case Processing Summary
No ratings yet
Normalitas: Case Processing Summary
5 pages
SPSS
No ratings yet
SPSS
26 pages
Rasa Normal Fix
No ratings yet
Rasa Normal Fix
11 pages
Multiple Linear Regression 2021
No ratings yet
Multiple Linear Regression 2021
45 pages
06-Classification_Part1
No ratings yet
06-Classification_Part1
44 pages
Question No: 67 (A) .: Sajjad Ali BBA18-153
No ratings yet
Question No: 67 (A) .: Sajjad Ali BBA18-153
4 pages
Question No: 67 (A) .: Sajjad Ali BBA18-153
No ratings yet
Question No: 67 (A) .: Sajjad Ali BBA18-153
4 pages
Ratio, Proportion and Variation
No ratings yet
Ratio, Proportion and Variation
32 pages
Hasil Penelitian
No ratings yet
Hasil Penelitian
6 pages
05 Classification Part1
No ratings yet
05 Classification Part1
35 pages
Explore
No ratings yet
Explore
44 pages
Metode: Case Processing Summary
No ratings yet
Metode: Case Processing Summary
19 pages
Triple
No ratings yet
Triple
5 pages
Chi Square Test Case Processing Summary
No ratings yet
Chi Square Test Case Processing Summary
5 pages
Ratio - Proportion Logs & Indices
No ratings yet
Ratio - Proportion Logs & Indices
29 pages
Normalitas 1
No ratings yet
Normalitas 1
16 pages
Analysis: Age Is The Major Factor That Is Contributing To Difference in Groups, Followed by Experience and Then Familiarity and at The End Expert.
No ratings yet
Analysis: Age Is The Major Factor That Is Contributing To Difference in Groups, Followed by Experience and Then Familiarity and at The End Expert.
3 pages
ANALISA BIVARIAT DEPIA
No ratings yet
ANALISA BIVARIAT DEPIA
13 pages
Week 4 Part 1 Classification
No ratings yet
Week 4 Part 1 Classification
71 pages
Frequencies: Statistics
No ratings yet
Frequencies: Statistics
8 pages
Normality Lectures and Examples
No ratings yet
Normality Lectures and Examples
10 pages
Aroma Normal Fix
No ratings yet
Aroma Normal Fix
11 pages
Data Client Ade
No ratings yet
Data Client Ade
7 pages
Meth Sheet
No ratings yet
Meth Sheet
25 pages
Explore: Case Processing Summary
No ratings yet
Explore: Case Processing Summary
6 pages
Uji Normalitas: Case Processing Summary
No ratings yet
Uji Normalitas: Case Processing Summary
2 pages
Uji Normalitas: Case Processing Summary
No ratings yet
Uji Normalitas: Case Processing Summary
2 pages
Q2-Wk5-6-Assessment
No ratings yet
Q2-Wk5-6-Assessment
3 pages
T-Test: One-Sample Statistics
No ratings yet
T-Test: One-Sample Statistics
2 pages
Template For CH 6-9
No ratings yet
Template For CH 6-9
22 pages
Lampiran SPSS Rosmaniah
No ratings yet
Lampiran SPSS Rosmaniah
3 pages
Foundation Revision Flashcards – Number
No ratings yet
Foundation Revision Flashcards – Number
7 pages
Recurring Decimal: DEFINITION: The Recurring Part Is Expressed by Putting The Bar or Dot
No ratings yet
Recurring Decimal: DEFINITION: The Recurring Part Is Expressed by Putting The Bar or Dot
6 pages
Tabel Data Spss Ikm
No ratings yet
Tabel Data Spss Ikm
10 pages
Group Statistics: Tests of Normality
No ratings yet
Group Statistics: Tests of Normality
2 pages
Group Quiz 10 11
No ratings yet
Group Quiz 10 11
11 pages
Case Processing Summary
No ratings yet
Case Processing Summary
4 pages
Wa0010.
No ratings yet
Wa0010.
4 pages
Lampiran 1 Output SPSS: Crosstabs
No ratings yet
Lampiran 1 Output SPSS: Crosstabs
8 pages
Kelompok 4 - Statistika II (A) - SPSS
No ratings yet
Kelompok 4 - Statistika II (A) - SPSS
4 pages
06 - Decision Trees
100% (1)
06 - Decision Trees
83 pages
Case Processing Summary
No ratings yet
Case Processing Summary
4 pages
Lampiran: Hasil Uji Statistik Normalitas Data Case Processing Summary
No ratings yet
Lampiran: Hasil Uji Statistik Normalitas Data Case Processing Summary
9 pages
Jadi Spss Sidang
No ratings yet
Jadi Spss Sidang
2 pages
Baru Lampiran Output SPSS
No ratings yet
Baru Lampiran Output SPSS
7 pages
SPSS Dr. Yudi
No ratings yet
SPSS Dr. Yudi
6 pages
Nama: Siti Anisa NPM: 192114063: Explore
No ratings yet
Nama: Siti Anisa NPM: 192114063: Explore
4 pages
Taller 03 Resuelto (Intervalos de Confianza II) PDF
No ratings yet
Taller 03 Resuelto (Intervalos de Confianza II) PDF
7 pages
Chapter 8
No ratings yet
Chapter 8
24 pages
Case Processing Summary
No ratings yet
Case Processing Summary
3 pages
SAMPLE2
No ratings yet
SAMPLE2
8 pages
Hasil Analisis Data SPSS Univariat: 1. Perokok
No ratings yet
Hasil Analisis Data SPSS Univariat: 1. Perokok
5 pages
Math 6-Q2-Week-3elcekbekbrjfrjfrvfufrufrufebbfufeubeuhfehhhffbeuvfevfeuxebdjbxiexjekkdkxbiexbruxbruxrbfrivr
No ratings yet
Math 6-Q2-Week-3elcekbekbrjfrjfrvfufrufrufebbfufeubeuhfehhhffbeuvfevfeuxebdjbxiexjekkdkxbiexbruxbruxrbfrivr
15 pages
CH 8 Part 2
No ratings yet
CH 8 Part 2
38 pages
Output Hasil N Gain
No ratings yet
Output Hasil N Gain
5 pages
Nama: Hasnawati Nim: 119181717 Semester: VI A MK: Biostatistik
No ratings yet
Nama: Hasnawati Nim: 119181717 Semester: VI A MK: Biostatistik
11 pages
Locating Percentiles Under The Normal Curve: Statistics & Probability
No ratings yet
Locating Percentiles Under The Normal Curve: Statistics & Probability
14 pages
Print Out SPSS
No ratings yet
Print Out SPSS
9 pages
Bob Miller's Algebra for the Clueless, 2nd edition
From Everand
Bob Miller's Algebra for the Clueless, 2nd edition
Bob Miller
5/5 (1)
3.5.7 Lab - Social Engineering
No ratings yet
3.5.7 Lab - Social Engineering
5 pages
Speed Control of DC Motor Using PID and
No ratings yet
Speed Control of DC Motor Using PID and
5 pages
EC170 UM-en
No ratings yet
EC170 UM-en
10 pages
Computer Science
No ratings yet
Computer Science
2 pages
21 05 - IFC Drawings - Mechanical Legend
No ratings yet
21 05 - IFC Drawings - Mechanical Legend
2 pages
MSM Student Authorization Form Agent
No ratings yet
MSM Student Authorization Form Agent
2 pages
SAMPLE QUESTION PAPER, ENGLISH 10th, SET-3, 2022-23
No ratings yet
SAMPLE QUESTION PAPER, ENGLISH 10th, SET-3, 2022-23
23 pages
COSC250 Essay
No ratings yet
COSC250 Essay
2 pages
Cšibxwz I Mykvmb: Bdwbu 1
No ratings yet
Cšibxwz I Mykvmb: Bdwbu 1
22 pages
CH 12 - Pumps & Hydraulic Turbines
No ratings yet
CH 12 - Pumps & Hydraulic Turbines
20 pages
WHR Type Boiler
No ratings yet
WHR Type Boiler
17 pages
50TC 14pd
No ratings yet
50TC 14pd
122 pages
Corporate Portal Scorecard: Student Name Affiliation Course Instructor's Name Assignment Date
No ratings yet
Corporate Portal Scorecard: Student Name Affiliation Course Instructor's Name Assignment Date
4 pages
Banding Decision Letter
No ratings yet
Banding Decision Letter
2 pages
CH-11 HTML Table Forms PDF
No ratings yet
CH-11 HTML Table Forms PDF
12 pages
CSE123 - Linked List V1
No ratings yet
CSE123 - Linked List V1
23 pages
CA-3 KM062-G1-Even INT-234 Predictive Analytics Roll No: - Section
No ratings yet
CA-3 KM062-G1-Even INT-234 Predictive Analytics Roll No: - Section
4 pages
EDS-T-5415: Cap Assembly - Horn Function / Durability Test
No ratings yet
EDS-T-5415: Cap Assembly - Horn Function / Durability Test
4 pages
Teaching Aptitude - Study Notes On Teaching-Learning Concept
No ratings yet
Teaching Aptitude - Study Notes On Teaching-Learning Concept
5 pages
JBT Citrus Ps 451 en
No ratings yet
JBT Citrus Ps 451 en
12 pages
נספח לצו - 1.7.21
No ratings yet
נספח לצו - 1.7.21
15 pages
Electronic Reservation Slip (ERS) : 8755468013 12619/matsyagandha Ex Sleeper Class (SL)
No ratings yet
Electronic Reservation Slip (ERS) : 8755468013 12619/matsyagandha Ex Sleeper Class (SL)
2 pages
Nodejs Notes
No ratings yet
Nodejs Notes
40 pages
Entire Group: 385-4432 Valve Gp-Modulating
No ratings yet
Entire Group: 385-4432 Valve Gp-Modulating
1 page
Comprehensive Analysis To Specify A Static Var Compensator For An Electric Arc Furnace Upgrade
No ratings yet
Comprehensive Analysis To Specify A Static Var Compensator For An Electric Arc Furnace Upgrade
13 pages
Atrea Import Za Revit
No ratings yet
Atrea Import Za Revit
4 pages
LR - Sobhith - GHG Regulatory Developments - A Case For Methanol Retrofit 1.0
No ratings yet
LR - Sobhith - GHG Regulatory Developments - A Case For Methanol Retrofit 1.0
20 pages
A. E Waite - The Book of Black Magic and of Pacts 1910 Complete
100% (2)
A. E Waite - The Book of Black Magic and of Pacts 1910 Complete
347 pages
Airworthiness Directives Final Rules: 80-19-09 R1
No ratings yet
Airworthiness Directives Final Rules: 80-19-09 R1
3 pages

Cart Animation en Feb19 Final

Uploaded by

Cart Animation en Feb19 Final

Uploaded by

Introduction to Regression Trees

Sahir Rai Bhatnagar, PhD Candidate (Biostatistics)

Department of Epidemiology, Biostatistics and Occupational Health

February 19, 2018

yes Years < 4.5 no

Recursive partitioning or segmentation methods were ﬁrst

They were formalized by Breiman et al. (1984) [1] under

CART can be applied to both regression and classiﬁcation

yes Years < 4.5 no

yes Years < 4.5 no

Today’s class → regression

The parent Root of

Response variable yi , i = 1, . . . , 263: 1987 annual salary on

Response variable yi , i = 1, . . . , 263: 1987 annual salary on

Response variable yi , i = 1, . . . , 263: 1987 annual salary on

2. X2 : Number of hits in 1986

Response variable yi , i = 1, . . . , 263: 1987 annual salary on

2. X2 : Number of hits in 1986

A sample of what the data looks like:

Years Hits Salary

Roughly speaking, there are two steps [3]:

Roughly speaking, there are two steps [3]:

2. For every observation that falls into the region Rj , we

Years Hits Salary

Mike Schmidt started his career in 1972, and was inducted

Hits < 118

Hits < 114 Hits < 76 Years < 5.5

The CART algorithm requires 3 components:

2. A rule to decide when a node is terminal, i.e., it becomes a

3. Pruning the tree to avoid over-ﬁtting.

The objective is the ﬁnd the regions R1 , . . . , RJ that minimize

ŷRj : the mean response for the training observations

The objective is the ﬁnd the regions R1 , . . . , RJ that minimize

ŷRj : the mean response for the training observations

Finding the solution to (1) is computationally infeasible

Greedy: at each step of the tree-building process, the best

minsplit: To avoid creating splits that will lead to very

minbucket: the minimum number of observations in any

The process described above may produce good

This is because the resulting tree, Tmax with |Tmax | leaves,

A smaller tree with fewer splits (that is, fewer regions

We consider adding a penalty to our loss function in order to

For each value of α, there exists a subtree T ⊂ Tmax that

|T| indicates the number of terminal nodes of the tree T, Rm is

α is chosen using v-fold cross-validation (v → xval=10 in

● Complete Data Set

yes Years < 4.5 no

Characteristica Linear Model CART

Handles complex interactions 7 3

Allows for missing data 7 3

Conﬁdence Intervals, p-values 3 7

lm(Salary ~ Years * Hits, data = Hitters)

Estimate Std. Error t value Pr(>|t|)

Salary type=vector rpart::rpart(Salary~Years+Hits, data=...

Fig.: Linear Model

320 330 340 350 360 370 380

0.36 0.38 0.40 0.42 0.44 0.46 0.48

CART models are easy to interpret

You don’t need to pre-deﬁne relationships between

Automatically handles higher-order interactions

CART models generally produce unstable predictions (next

Use the parameters

Years Hits Salary

[1] Leo Breiman et al. Classiﬁcation and regression trees.

Matrix products: default

attached base packages:

other attached packages:

loaded via a namespace (and not attached):

You might also like