0% found this document useful (0 votes)

12 views

09 Boosting

Uploaded by

Cyrus Ray

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

09 Boosting

Uploaded by

Cyrus Ray

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Statistical Machine Learning Notes 9

Boosting
Instructor: Justin Domke

1 Additive Models
The basic idea of boosting is to greedily build additive models. Let bm (x) be some predictor
(a tree, a neural network, whatever), which we will call a base learning. In boosting, we will
build a model that is the sum of base learners as

M
X
f (x) = bm (x).
m=1

The obvious way to fit an additive model is greedily. Namely, with start with the simple
function f0 (x) = 0, then iteratively add base learners to minimize the risk of fm−1 (x)+bm (x).
Forward stagewise additive modeling

• f0 (x) ← 0

• For m = 1, ..., M
X
– bm ← arg min L(fm−1 (x̂) + b(x̂), ŷ)
b
(x̂,ŷ)

– fm (x) ← fm−1 (x) + v bm (x)

Notice here that once we have fit a particular base learner, it is “frozen in”, and is not further
changed. Once can also create procedures for additive modeling that re-fit the base learners
several times.

2 Additive Regression Trees

It is quite easy to do forward stagewise additive modeling when we are interested in mini-
mizing the least-squares regression loss.

1
Boosting 2

X X 2
arg min L fm−1 (x̂) + b(x̂), ŷ = arg min fm−1 (x̂) + b(x̂) − ŷ
b b
(x̂,ŷ) (x̂,ŷ)
X 2
= arg min b(x̂) − r(x̂, ŷ) (2.1)
b
(x̂,ŷ)

where r(x̂, ŷ) = ŷ − fm−1 (x̂).

However, minimizing a problem like Eq. 2.1 is exactly what all our previous regression meth-
ods do. Thus, we can apply any method for least-squares regression essentially unchanged.
The only difference is that the target values are given by ŷ − fm−1 (x̂)– the difference between
ŷ and the additive model so far.
Here are some examples applying this to two datasets. Here (as in all the examples in these
notes) our base learner is a regression tree with two levels of splits.
Boosting 3

data
0.2
train
test
0.15

R
0.1

0.05

0
0 20 40 60 80 100
iteration

1 iteration 2 iterations 3 iterations

4 iterations 9 iterations 50 iterations

Boosting 4

data
0.04
train
test
0.03

R
0.02

0.01

0
0 20 40 60 80 100
iteration
1 iteration 2 iterations 3 iterations

9 iterations 49 iterations 100 iterations

3 Boosting
Suppose we want to minimize some other loss function. Unfortunately, if we aren’t using the
least-squares loss, the optimization
Boosting 5

X
arg min L(fm−1 (x̂) + b(x̂), ŷ) (3.1)
b
(x̂,ŷ)

will not reduce to a least-squares problem. One alternative would be to use a different base-
learner, that fits a different loss. With boosting, we don’t want to change our base learner.
This could be because it is simply inconvenient to create a new algorithm for a different loss
function. The general idea of boosting is to fit a new base learner b to, if not minimize the
risk for fm , to decrease it. As we will see, we can do this for a large range of loss functions,
all based on a least-squares base learner.

4 Gradient Boosting
Gradient boosting is a very clever algorithm. In each iteration, we set as target values the
negative gradient of the loss with respect to f . So, if the base learner closely matches the
target values, when we add some multiple v of the base learner to our additive model, it
should decrease the loss.
Gradient Boosting

• f0 (x) ← 0

• For m = 1, ..., M

– For all(x̂, ŷ)

d
L(fm−1 (x̂), ŷ)
∗ r(x̂, ŷ) ← −
df
X 2
– bm ← arg min b(x̂) − r(x̂, ŷ)
b
(x̂,ŷ)

– fm (x) ← fm−1 (x) + v bm (x)

Consider changing f (x) to f (x) + vb(x) for some small v. We want to minimize the loss, i.e.

X X X dL(fm−1 (x̂), ŷ)

L(fm−1 (x̂) + b(x̂), ŷ) ≈ L(fm−1 (x̂), ŷ) + b(x̂). (4.1)
df
(x̂,ŷ) (x̂,ŷ) (x̂,ŷ)

However, it doesn’t make much sense to simply minimize the right hand side of Eq. 4.1,
since we can always make the decrease bigger by multiplying b by a larger constant. Instead,
we choose to minimize
Boosting 6

X dL(fm−1 (x̂), ŷ) 1X

b(x̂) + b(x̂)2
df 2
(x̂,ŷ) (x̂,ŷ)

which prevents b from becoming too large. It is easy to see that

X dL(fm−1 (x̂), ŷ) 1X X −dL(fm−1 (x̂), ŷ) 2

arg min b(x̂) + b(x̂)2 = arg min (b(x̂) + )
b df 2 df
(x̂,ŷ) (x̂,ŷ) (x̂,ŷ)
X 2
= arg min b(x̂) − r(x̂, ŷ)
(x̂,ŷ)

Some example loss functions, and their derivatives:

• least-squares. L = (f − ŷ)2 → dL/f d = 2(f − ŷ)

• least-absolute-deviation. L = |f − ŷ| → dL/df = sign(f − ŷ)

−ŷ
• logistic regression. L = log(1 + exp(−ŷf )) → dL/df = 1+exp(f ŷ)
.

• exponential loss. L = exp(−ŷf ) → dL/df = −ŷ exp(−ŷf )

Fun fact: gradient boosting with a step size of v = 12 on the least-squares loss gives exactly
the forward stagewise modeling algorithm we devised above.
Here are some examples of gradient boosting on classification losses (logistic and exponen-
tial). The figures show the function f (x). As usual, we would get a predicted class by taking
the sign of f (x).
Boosting 7

Gradient Boosting, Logistic Loss, step size of v = 1.

data
0.8
train
test
0.6

R
0.4

0.2

0
0 20 40 60 80 100
iteration

1 iteration 2 iterations 3 iterations

9 iterations 49 iterations 100 iterations

Boosting 8

Gradient Boosting, Exponential loss, step size of v = 1.

data
1
train
0.8 test

0.6

R
0.4

0.2

0
0 20 40 60 80 100
iteration

1 iteration 2 iterations 3 iterations

9 iterations 49 iterations 100 iterations

Boosting 9

5 Newton Boosting
The basic idea of Newton boosting is, instead of just taking the gradient, to build a local
quadratic model, and then to minimize it. We can again do this using any least-squares
regressor as a base learner.
(Note that “Newton boosting” is not generally accepted terminology. The general idea of
fitting to a quadratic optimization is used in several algorithms for specific loss function (e.g.
Logitboost), but doesn’t seem to have been given a name. The class voted to use “Newton
Boosting” in preference to “Quadratic Boosting”.)
Newton Boosting

• f0 (x) ← 0
• For m = 1, ..., M

– For all(x̂, ŷ)

d2
∗ h(x̂, ŷ) ← L(fm−1 (x̂), ŷ)
df 2
∗ g(x̂, ŷ) ← dfd L(fm−1 (x̂), ŷ)
1 d
∗ r(x̂, ŷ) ← − L(fm−1 (x̂), ŷ)
h(x̂, ŷ) df
X 2
– bm ← arg min h(x̂, ŷ) b(x̂) − r(x̂, ŷ)
b
(x̂,ŷ)

– fm (x) ← fm−1 (x) + v bm (x)

To understand this, consider search searching for the b that minimizes the empirical risk for
(x̂),ŷ)
fm . To start with, we make a local second-order approximation. (Here, g(x̂, ŷ) = dL(fm−1
df
,
d2 L(fm−1 (x̂),ŷ)
and h(x̂, ŷ) = df 2
.)

X X 1
L(fm−1 (x̂), ŷ) + b(x̂)g(x̂, ŷ) + b(x̂)2 h(x̂, ŷ)

arg min L(fm−1 (x̂) + b(x̂), ŷ) ≈ arg min
b b 2
(x̂,ŷ) (x̂,ŷ)

1X g(x̂, ŷ)
b(x̂) + b(x̂)2

= arg min h(x̂, ŷ) 2
b 2 h(x̂, ŷ)
(x̂,ŷ)
X −g(x̂, ŷ) 2
= arg min h(x̂, ŷ) b(x̂) −
b h(x̂, ŷ)
(x̂,ŷ)
X 2
= arg min h(x̂, ŷ) b(x̂) − r(x̂, ŷ)
b
(x̂,ŷ)
Boosting 10

−g(x̂,ŷ)
This is a weighted least-squares regression, with weights h(x̂, ŷ) and targets h(x̂,ŷ)
.
Here, we collect in a table all of the loss functions and derivatives we might use.

Loss Functions L(f, ŷ)

dL d2 L
Name L
df df 2
least-squares (f − ŷ)2 2(f − ŷ) 2
least-absolute-deviation |f ′ − ŷ| sign(f ′ − ŷ)
−ŷ ŷ 2
logistic log 1 + exp(−ŷf )
1 + exp(f ŷ) 2 + exp(f ŷ) + exp(−f ŷ)
exponential exp(−ŷf ) −ŷ exp(−ŷf ) y 2 exp(−f ŷ)

Here are some experiments using Newton boosting on the same dataset as above. Generally,
we see that Newton boosting is more effective than gradient boosting in quickly reducing
the training error. However, this does also means it begins to overfit after a smaller number
of iterations. Thus, if running time is an issue (either at training or test time), Newton
boosting may be preferable, since it more quickly reduces training error.
Boosting 11

Newton Boosting, Logistic Loss, step size of v = 1.

data
0.8
train
test
0.6

R
0.4

0.2

0
0 20 40 60 80 100
iteration

1 iteration 2 iterations 3 iterations

9 iterations 49 iterations 100 iterations

Boosting 12

Newton Boosting, Logistic Loss, step size of v = 1.

data
1
train
0.8 test

0.6

R
0.4

0.2

0
0 20 40 60 80 100
iteration

1 iteration 2 iterations 3 iterations

9 iterations 49 iterations 100 iterations

Boosting 13

6 Shrinkage and Early Stopping

Now, we have not talked about regularization. What is suggested in practice is to regularize
by “early stoppping”. Namely, repeatedly boost, but hold out a test set. Select the number
of base learners M in the model by the number that leads the minimum error on the test set.
(It would be best to select this M by cross validation, then re-fit with all models, though
this is often not done is practice to reduce running times.)
Notice that we have another parameter controlling complexity, however: the step size v. For
a small step size, we will need more models M to achieve the same complexity.
Experimentally, better results are almost always given by choosing a small step size v, and a
larger number of base learners M. The following experiments suggest that this is a product
of the fact that it yields “smoother” models. Note also from the following experiments that
Newton boosting with a given step size performs similarly to gradient boosting with a smaller
step size.
Boosting 14

1
Gradient boosting, least-squares loss, v = 2

0.2
train
test
0.15
R

0.1

0.05

0
0 200 400 600 800 1000
iteration

1
Gradient boosting, least-squares loss, v = 20

0.2
train
test
0.15
R

0.1

0.05

0
0 200 400 600 800 1000
iteration

1
Gradient boosting, least-squares loss, v = 200

0.2
train
test
0.15
R

0.1

0.05

0
0 200 400 600 800 1000
iteration
Boosting 15

In the following figures, the left column shows f (x) after 100 iterations, while the middle
column shows sign(f (x)).

Newton boosting, logistic loss, v = 1

0.8
train
test
0.6
R
0.4

0.2

0
0 200 400 600 800 1000
iteration

1
Newton boosting, logistic loss,v = 10
0.8
train
test
0.6
R

0.4

0.2

0
0 200 400 600 800 1000
iteration

1
Newton boosting, logistic loss,v = 100
0.8
train
test
0.6
R

0.4

0.2

0
0 200 400 600 800 1000
iteration
Boosting 16

Gradient boosting, logistic loss,v = 1

0.8
train
test
0.6

R
0.4

0.2

0
0 200 400 600 800 1000
iteration

1
Gradient boosting, logistic loss, v = 10
0.8
train
test
0.6

R
0.4

0.2

0
0 200 400 600 800 1000
iteration

1
Gradient boosting, logistic loss, v = 100
0.8
train
test
0.6
R

0.4

0.2

0
0 200 400 600 800 1000
iteration

7 Discussion
Note that while our experiments here have used trees, any least-squares regressor could be
used at the base learner. If so inclined, we could even use a boosted regression tree as our
base learner.
Most of these loss functions can be extended to multi-class classification. This works by, in
each iteration, not fitting a single tree, but C trees, where there are C classes. For newton
boosting, this is made tractable by taking a diagonal approximation of the Hessian.
Boosting 17

Our presentation here has been quite ahistorical. The first boosting algorithms were pre-
sented from a very different, and more theoretical perspective.

DL Unit-2
No ratings yet
DL Unit-2
24 pages
Learning With Kernels Support Vector Machines, Regularization, Optimization, and Beyond by Bernhard Schlkopf, Alexander J. Smola
No ratings yet
Learning With Kernels Support Vector Machines, Regularization, Optimization, and Beyond by Bernhard Schlkopf, Alexander J. Smola
644 pages
Survey - Gradient Boosting Machine
No ratings yet
Survey - Gradient Boosting Machine
9 pages
Lect4 Log Reg
No ratings yet
Lect4 Log Reg
20 pages
Gradient Boosting
No ratings yet
Gradient Boosting
9 pages
How to Boost any Loss Function
No ratings yet
How to Boost any Loss Function
36 pages
Model-Based Boosting in R: Introduction To Gradient Boosting
No ratings yet
Model-Based Boosting in R: Introduction To Gradient Boosting
35 pages
Introduction to Boosting: Slides Adapted from Che Wanxiang (车万翔) at HIT, and Robin Dhamankar of Many thanks!
100% (1)
Introduction to Boosting: Slides Adapted from Che Wanxiang (车万翔) at HIT, and Robin Dhamankar of Many thanks!
41 pages
Boosting With The L - Loss: Regression and Classification
No ratings yet
Boosting With The L - Loss: Regression and Classification
32 pages
DM - Lecture 4
No ratings yet
DM - Lecture 4
65 pages
07 Boosting Notes
No ratings yet
07 Boosting Notes
10 pages
Boosting
No ratings yet
Boosting
11 pages
CS229 Supplemental Lecture Notes: 1 Boosting
No ratings yet
CS229 Supplemental Lecture Notes: 1 Boosting
11 pages
Generalized Boosted Models: A Guide To The GBM Package: Greg Ridgeway August 3, 2007
No ratings yet
Generalized Boosted Models: A Guide To The GBM Package: Greg Ridgeway August 3, 2007
12 pages
_LECTURE+NOTES_Boosting
No ratings yet
_LECTURE+NOTES_Boosting
8 pages
107 Boostong Models
No ratings yet
107 Boostong Models
27 pages
Boosting and Additive Tree
No ratings yet
Boosting and Additive Tree
26 pages
Main
No ratings yet
Main
50 pages
Boosting
No ratings yet
Boosting
13 pages
Computational Data Analysis: Machine Learning
No ratings yet
Computational Data Analysis: Machine Learning
26 pages
Bagging and Boosting: 9.520 Class 10, 13 March 2006 Sasha Rakhlin
No ratings yet
Bagging and Boosting: 9.520 Class 10, 13 March 2006 Sasha Rakhlin
19 pages
chapter 3- boosting theory
No ratings yet
chapter 3- boosting theory
7 pages
Supervised Learning
No ratings yet
Supervised Learning
6 pages
Lec5 Boosting v2.7 1
No ratings yet
Lec5 Boosting v2.7 1
46 pages
sol3_2016
No ratings yet
sol3_2016
8 pages
Lec13 PDF
No ratings yet
Lec13 PDF
10 pages
Lecture5 FGV
No ratings yet
Lecture5 FGV
25 pages
New Support Vector Algorithms: Letter
No ratings yet
New Support Vector Algorithms: Letter
39 pages
Theory of Deep Learning 1652786371
No ratings yet
Theory of Deep Learning 1652786371
118 pages
Introduction To Boosting - 2
No ratings yet
Introduction To Boosting - 2
79 pages
Friedman 2002
No ratings yet
Friedman 2002
12 pages
Boosting Buehlmann
No ratings yet
Boosting Buehlmann
52 pages
ML Linear Model
No ratings yet
ML Linear Model
10 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
4 pages
Lec 29
No ratings yet
Lec 29
33 pages
Math Optimization
No ratings yet
Math Optimization
11 pages
22 Boosting
No ratings yet
22 Boosting
32 pages
Boosting On The Functional ANOVA Decomposition
No ratings yet
Boosting On The Functional ANOVA Decomposition
8 pages
Boosting and Applications Yuan
No ratings yet
Boosting and Applications Yuan
41 pages
UNIT 1,2,3
No ratings yet
UNIT 1,2,3
17 pages
ML 14 Boosting
No ratings yet
ML 14 Boosting
57 pages
Function Approximation: A Gradient Boosting Machine.
No ratings yet
Function Approximation: A Gradient Boosting Machine.
45 pages
T R Ik-Cl Ervor Er Kis: (Example)
No ratings yet
T R Ik-Cl Ervor Er Kis: (Example)
122 pages
Stochastic Gradient Descent Algorithm
No ratings yet
Stochastic Gradient Descent Algorithm
6 pages
Machine Learning Notes Cs229 1
No ratings yet
Machine Learning Notes Cs229 1
217 pages
IML-Summary
No ratings yet
IML-Summary
12 pages
Deep Learning Module 3
No ratings yet
Deep Learning Module 3
15 pages
Introduction To Boosting: Cynthia Rudin PACM, Princeton University
No ratings yet
Introduction To Boosting: Cynthia Rudin PACM, Princeton University
29 pages
ML Opt
No ratings yet
ML Opt
89 pages
MLSS Complete PDF
No ratings yet
MLSS Complete PDF
106 pages
Lecture-10-boosting
No ratings yet
Lecture-10-boosting
20 pages
LectureNotes7
No ratings yet
LectureNotes7
8 pages
21csc305p Machine Learning Unit 5
No ratings yet
21csc305p Machine Learning Unit 5
61 pages
Introduction To Machine Learning - Boosting
No ratings yet
Introduction To Machine Learning - Boosting
6 pages
Experimenting XGBoost Algorithmfor Predictionand Classificationof Different Datasets
No ratings yet
Experimenting XGBoost Algorithmfor Predictionand Classificationof Different Datasets
12 pages
02_lecturenote_GD
No ratings yet
02_lecturenote_GD
10 pages
Lecture2 PDF
No ratings yet
Lecture2 PDF
111 pages
Lec 2
No ratings yet
Lec 2
5 pages
CH 1
No ratings yet
CH 1
24 pages
Jerome H. Friedman
No ratings yet
Jerome H. Friedman
44 pages
Scientist’s guide to developing explanatory statistical models using Causal Principles
No ratings yet
Scientist’s guide to developing explanatory statistical models using Causal Principles
14 pages
AGRAME - Any-GranularityRanking With Multi-Vector Embeddings
No ratings yet
AGRAME - Any-GranularityRanking With Multi-Vector Embeddings
13 pages
Invited Commentary -- Machine Learning in Causal Inference—How Do I Love
No ratings yet
Invited Commentary -- Machine Learning in Causal Inference—How Do I Love
5 pages
Sucicide Rates in India - 2014-2021
No ratings yet
Sucicide Rates in India - 2014-2021
3 pages
Efron Mixture
No ratings yet
Efron Mixture
9 pages
Kejriwal Knowledge Graph Tutorial - 2020-12-asonam-tutorial-KG
No ratings yet
Kejriwal Knowledge Graph Tutorial - 2020-12-asonam-tutorial-KG
73 pages
Lambers Analytic Geometry
No ratings yet
Lambers Analytic Geometry
186 pages
S0007125022000137 Sup 001
No ratings yet
S0007125022000137 Sup 001
34 pages
coarse
No ratings yet
coarse
8 pages
Domain Representative Keywords Selection - A Probabilistic Approach
No ratings yet
Domain Representative Keywords Selection - A Probabilistic Approach
14 pages
Dynamics of Non-Expansive Maps On Strictly Convex
No ratings yet
Dynamics of Non-Expansive Maps On Strictly Convex
18 pages
Wainwright Microsoft Slides2
No ratings yet
Wainwright Microsoft Slides2
67 pages
A Grammar of Freethought by Chapman Cohen
No ratings yet
A Grammar of Freethought by Chapman Cohen
72 pages
Ryali Multivariate Dynamical 10
No ratings yet
Ryali Multivariate Dynamical 10
17 pages
A Brief History of Computing
No ratings yet
A Brief History of Computing
61 pages

09 Boosting

Uploaded by

09 Boosting

Uploaded by

Statistical Machine Learning Notes 9

– fm (x) ← fm−1 (x) + v bm (x)

2 Additive Regression Trees

where r(x̂, ŷ) = ŷ − fm−1 (x̂).

1 iteration 2 iterations 3 iterations

4 iterations 9 iterations 50 iterations

9 iterations 49 iterations 100 iterations

– For all(x̂, ŷ)

– fm (x) ← fm−1 (x) + v bm (x)

X X X dL(fm−1 (x̂), ŷ)

X dL(fm−1 (x̂), ŷ) 1X

which prevents b from becoming too large. It is easy to see that

X dL(fm−1 (x̂), ŷ) 1X X −dL(fm−1 (x̂), ŷ) 2

Some example loss functions, and their derivatives:

• least-squares. L = (f − ŷ)2 → dL/f d = 2(f − ŷ)

• least-absolute-deviation. L = |f − ŷ| → dL/df = sign(f − ŷ)

• exponential loss. L = exp(−ŷf ) → dL/df = −ŷ exp(−ŷf )

Gradient Boosting, Logistic Loss, step size of v = 1.

1 iteration 2 iterations 3 iterations

9 iterations 49 iterations 100 iterations

Gradient Boosting, Exponential loss, step size of v = 1.

1 iteration 2 iterations 3 iterations

9 iterations 49 iterations 100 iterations

– For all(x̂, ŷ)

– fm (x) ← fm−1 (x) + v bm (x)

Loss Functions L(f, ŷ)

Newton Boosting, Logistic Loss, step size of v = 1.

1 iteration 2 iterations 3 iterations

9 iterations 49 iterations 100 iterations

Newton Boosting, Logistic Loss, step size of v = 1.

1 iteration 2 iterations 3 iterations

9 iterations 49 iterations 100 iterations

6 Shrinkage and Early Stopping

Newton boosting, logistic loss, v = 1

Gradient boosting, logistic loss,v = 1

You might also like