09 Boosting
09 Boosting
Boosting
Instructor: Justin Domke
1 Additive Models
The basic idea of boosting is to greedily build additive models. Let bm (x) be some predictor
(a tree, a neural network, whatever), which we will call a base learning. In boosting, we will
build a model that is the sum of base learners as
M
X
f (x) = bm (x).
m=1
The obvious way to fit an additive model is greedily. Namely, with start with the simple
function f0 (x) = 0, then iteratively add base learners to minimize the risk of fm−1 (x)+bm (x).
Forward stagewise additive modeling
• f0 (x) ← 0
• For m = 1, ..., M
X
– bm ← arg min L(fm−1 (x̂) + b(x̂), ŷ)
b
(x̂,ŷ)
Notice here that once we have fit a particular base learner, it is “frozen in”, and is not further
changed. Once can also create procedures for additive modeling that re-fit the base learners
several times.
1
Boosting 2
X X 2
arg min L fm−1 (x̂) + b(x̂), ŷ = arg min fm−1 (x̂) + b(x̂) − ŷ
b b
(x̂,ŷ) (x̂,ŷ)
X 2
= arg min b(x̂) − r(x̂, ŷ) (2.1)
b
(x̂,ŷ)
data
0.2
train
test
0.15
R
0.1
0.05
0
0 20 40 60 80 100
iteration
data
0.04
train
test
0.03
R
0.02
0.01
0
0 20 40 60 80 100
iteration
1 iteration 2 iterations 3 iterations
3 Boosting
Suppose we want to minimize some other loss function. Unfortunately, if we aren’t using the
least-squares loss, the optimization
Boosting 5
X
arg min L(fm−1 (x̂) + b(x̂), ŷ) (3.1)
b
(x̂,ŷ)
will not reduce to a least-squares problem. One alternative would be to use a different base-
learner, that fits a different loss. With boosting, we don’t want to change our base learner.
This could be because it is simply inconvenient to create a new algorithm for a different loss
function. The general idea of boosting is to fit a new base learner b to, if not minimize the
risk for fm , to decrease it. As we will see, we can do this for a large range of loss functions,
all based on a least-squares base learner.
4 Gradient Boosting
Gradient boosting is a very clever algorithm. In each iteration, we set as target values the
negative gradient of the loss with respect to f . So, if the base learner closely matches the
target values, when we add some multiple v of the base learner to our additive model, it
should decrease the loss.
Gradient Boosting
• f0 (x) ← 0
• For m = 1, ..., M
Consider changing f (x) to f (x) + vb(x) for some small v. We want to minimize the loss, i.e.
However, it doesn’t make much sense to simply minimize the right hand side of Eq. 4.1,
since we can always make the decrease bigger by multiplying b by a larger constant. Instead,
we choose to minimize
Boosting 6
Fun fact: gradient boosting with a step size of v = 12 on the least-squares loss gives exactly
the forward stagewise modeling algorithm we devised above.
Here are some examples of gradient boosting on classification losses (logistic and exponen-
tial). The figures show the function f (x). As usual, we would get a predicted class by taking
the sign of f (x).
Boosting 7
data
0.8
train
test
0.6
R
0.4
0.2
0
0 20 40 60 80 100
iteration
data
1
train
0.8 test
0.6
R
0.4
0.2
0
0 20 40 60 80 100
iteration
5 Newton Boosting
The basic idea of Newton boosting is, instead of just taking the gradient, to build a local
quadratic model, and then to minimize it. We can again do this using any least-squares
regressor as a base learner.
(Note that “Newton boosting” is not generally accepted terminology. The general idea of
fitting to a quadratic optimization is used in several algorithms for specific loss function (e.g.
Logitboost), but doesn’t seem to have been given a name. The class voted to use “Newton
Boosting” in preference to “Quadratic Boosting”.)
Newton Boosting
• f0 (x) ← 0
• For m = 1, ..., M
To understand this, consider search searching for the b that minimizes the empirical risk for
(x̂),ŷ)
fm . To start with, we make a local second-order approximation. (Here, g(x̂, ŷ) = dL(fm−1
df
,
d2 L(fm−1 (x̂),ŷ)
and h(x̂, ŷ) = df 2
.)
X X 1
L(fm−1 (x̂), ŷ) + b(x̂)g(x̂, ŷ) + b(x̂)2 h(x̂, ŷ)
arg min L(fm−1 (x̂) + b(x̂), ŷ) ≈ arg min
b b 2
(x̂,ŷ) (x̂,ŷ)
1X g(x̂, ŷ)
b(x̂) + b(x̂)2
= arg min h(x̂, ŷ) 2
b 2 h(x̂, ŷ)
(x̂,ŷ)
X −g(x̂, ŷ) 2
= arg min h(x̂, ŷ) b(x̂) −
b h(x̂, ŷ)
(x̂,ŷ)
X 2
= arg min h(x̂, ŷ) b(x̂) − r(x̂, ŷ)
b
(x̂,ŷ)
Boosting 10
−g(x̂,ŷ)
This is a weighted least-squares regression, with weights h(x̂, ŷ) and targets h(x̂,ŷ)
.
Here, we collect in a table all of the loss functions and derivatives we might use.
Here are some experiments using Newton boosting on the same dataset as above. Generally,
we see that Newton boosting is more effective than gradient boosting in quickly reducing
the training error. However, this does also means it begins to overfit after a smaller number
of iterations. Thus, if running time is an issue (either at training or test time), Newton
boosting may be preferable, since it more quickly reduces training error.
Boosting 11
data
0.8
train
test
0.6
R
0.4
0.2
0
0 20 40 60 80 100
iteration
data
1
train
0.8 test
0.6
R
0.4
0.2
0
0 20 40 60 80 100
iteration
1
Gradient boosting, least-squares loss, v = 2
0.2
train
test
0.15
R
0.1
0.05
0
0 200 400 600 800 1000
iteration
1
Gradient boosting, least-squares loss, v = 20
0.2
train
test
0.15
R
0.1
0.05
0
0 200 400 600 800 1000
iteration
1
Gradient boosting, least-squares loss, v = 200
0.2
train
test
0.15
R
0.1
0.05
0
0 200 400 600 800 1000
iteration
Boosting 15
In the following figures, the left column shows f (x) after 100 iterations, while the middle
column shows sign(f (x)).
0.2
0
0 200 400 600 800 1000
iteration
1
Newton boosting, logistic loss,v = 10
0.8
train
test
0.6
R
0.4
0.2
0
0 200 400 600 800 1000
iteration
1
Newton boosting, logistic loss,v = 100
0.8
train
test
0.6
R
0.4
0.2
0
0 200 400 600 800 1000
iteration
Boosting 16
R
0.4
0.2
0
0 200 400 600 800 1000
iteration
1
Gradient boosting, logistic loss, v = 10
0.8
train
test
0.6
R
0.4
0.2
0
0 200 400 600 800 1000
iteration
1
Gradient boosting, logistic loss, v = 100
0.8
train
test
0.6
R
0.4
0.2
0
0 200 400 600 800 1000
iteration
7 Discussion
Note that while our experiments here have used trees, any least-squares regressor could be
used at the base learner. If so inclined, we could even use a boosted regression tree as our
base learner.
Most of these loss functions can be extended to multi-class classification. This works by, in
each iteration, not fitting a single tree, but C trees, where there are C classes. For newton
boosting, this is made tractable by taking a diagonal approximation of the Hessian.
Boosting 17
Our presentation here has been quite ahistorical. The first boosting algorithms were pre-
sented from a very different, and more theoretical perspective.