Chapter 12
Chapter 12
12
374
Draft chapter (January 29, 2019) from “Mathematics for Machine Learning” c 2018 by Marc
Peter Deisenroth, A Aldo Faisal, and Cheng Soon Ong. To be published by Cambridge University
Press. Report errata and feedback to [Link] Please do not post or distribute this
file, please link to [Link]
Classification with Support Vector Machines 375
Figure 12.1
Example 2D data,
illustrating the
intuition of data
where we can find a
linear classifier that
x(2)
separates red
crosses from blue
dots.
x(1)
6974 SVMs. First, the SVM allows for a geometric way to think about supervised
6975 machine learning. While in Chapter 9 we considered the machine learning
6976 problem in terms of probabilistic models and attacked it using maximum
6977 likelihood estimation and Bayesian inference, here we will consider an
6978 alternative approach where we reason geometrically about the machine
6979 learning task. It relies heavily on concepts, such as inner products and
6980 projections, which we discussed in Chapter 3. The second reason why we
6981 find SVMs instructive is that in contrast to Chapter 9, the optimization
6982 problem for SVM does not admit an analytic solution so that we need to
6983 resort to a variety of optimization tools introduced in Chapter 7.
6984 The SVM view of machine learning is subtly different from the max-
6985 imum likelihood view of Chapter 9. The maximum likelihood view pro-
6986 poses a model based on a probabilistic view of the data distribution, from
6987 which an optimization problem is derived. In contrast, the SVM view starts
6988 by designing a particular function that is to be optimized during training,
6989 based on geometric intuitions. We have seen something similar already in
6990 Chapter 10 where we derived PCA from geometric principles. In the SVM
6991 case, we start by designing an objective function that is to be minimized on
6992 training data, following the principles of empirical risk minimization 8.1.
6993 This can also be understood as designing a particular loss function.
6994 Let us derive the optimization problem corresponding to training an
6995 SVM on example-label pairs. Intuitively, we imagine binary classification
6996 data, which can be separated by a hyperplane as illustrated in Figure 12.1.
6997 Here, every example xn (a vector of dimension 2) is a two-dimensional
6998 location (x(1) (2)
n and xn ), and the corresponding binary label yn is one of
6999 two different symbols (red cross or blue disc). “Hyperplane” is a word that
7000 is commonly used in machine learning, and we encountered hyperplanes
7001 already in Section 2.8. A hyperplane is an affine subspace of dimension
7002 D−1 (if the corresponding vector space is of dimension D). The examples
7003 consist of two classes (there are two possible labels) that have features
7004 (the components of the vector representing the example) arranged in such
7005 a way as to allow us to separate/classify them by drawing a straight line.
c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
376 Classification with Support Vector Machines
Draft (2019-01-29) from Mathematics for Machine Learning. Errata and feedback to [Link]
12.1 Separating Hyperplanes 377
Figure 12.2
w Equation of a
separating
w hyperplane (12.3).
(left) The standard
b way of representing
.positive the equation in 3D.
(right) For ease of
. . drawing, we look at
0 negative the hyperplane edge
on.
7024 where the second line is obtained by the linearity of the inner product
7025 (Section 3.2). Since we have chosen xa and xb to be on the hyperplane, w is orthogonal to
7026 this implies that f (xa ) = 0 and f (xb ) = 0 and hence hw, xa − xb i = 0. any vector on the
hyperplane.
7027 Recall that two vectors are orthogonal when their inner product is zero.
7028 Therefore, we obtain that w is orthogonal to any vector on the hyperplane.
7029 Remark. Recall from Chapter 2 that we can think of vectors in different
7030 ways. In this chapter, we think of the parameter vector w as an arrow
7031 indicating a direction, i.e., we consider w to be a geometric vector. In
7032 contrast, we think of the example vector x as a data point (as indicated
7033 by its coordinates), i.e., we consider x to be the coordinates of a vector
7034 with respect to the standard basis. ♦
7035 When presented with a test example, we classify the example as posi-
7036 tive or negative depending on which side of the hyperplane it occurs. Note
7037 that (12.3) not only defines a hyperplane; it additionally defines a direc-
7038 tion. In other words, it defines the positive and negative side of the hyper-
7039 plane. Therefore, to classify a test example xtest , we calculate the value of
7040 the function f (xtest ) and classify the example as +1 if f (xtest ) > 0 and
7041 −1 otherwise. Thinking geometrically, the positive examples lie “above”
7042 the hyperplane and the negative examples “below” the hyperplane.
When training the classifier, we want to ensure that the examples with
positive labels are on the positive side of the hyperplane, i.e.,
and the examples with negative labels are on the negative side, i.e.,
7043 Equation (12.7) is equivalent to (12.5) and (12.6) when we multiply both
7044 sides of (12.5) and (12.6) with yn = 1 and yn = −1, respectively.
c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
378 Classification with Support Vector Machines
Figure 12.3
Possible separating
hyperplanes. There
are many linear
classifiers (green
lines) that separate
x(2)
red crosses from
blue dots.
x(1)
Draft (2019-01-29) from Mathematics for Machine Learning. Errata and feedback to [Link]
12.2 Primal Support Vector Machine 379
.
0
7073 In other words, we combine the requirements that examples are at least
7074 r away from the hyperplane (in the positive and negative direction) into
7075 one single inequality.
7076 Since we are interested only in the direction, we add an assumption to
7077 our model that the parameter vector w is of √ unit length, i.e., kwk = 1,
7078 where we use the Euclidean norm kwk = w> w (Section 3.1). This We will see other
7079 assumption also allows a more intuitive interpretation of the distance r choices of inner
products
7080 (12.8) since it is the scaling factor of a vector of length 1.
(Section 3.2) in
7081 Remark. A reader familiar with other presentations of the margin would Section 12.4.
7082 notice that our definition of kwk = 1 is different from the standard
7083 presentation if the SVM provided by Schölkopf and Smola (2002), for
7084 example. In Section 12.2.3, we will show the equivalence of both ap-
7085 proaches. ♦
Collecting the three requirements into a single constrained optimization
c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
380 Classification with Support Vector Machines
Figure 12.5
Derivation of the
.xa
1
r w
x0a .
margin: r = kwk .
hw
,
hw
xi
,
+
xi
b=
+
b=
1
0
problem, we obtain the objective
max r
w,b,r |{z}
margin
(12.10)
subject to yn (hw, xn i + b) > r , kwk = 1 , r > 0,
| {z } | {z }
data fitting normalization
7086 which says that we want to maximize the margin r, while ensuring that
7087 the data lies on the correct side of the hyperplane.
7088 Remark. The concept of the margin turns out to be highly pervasive in
7089 machine learning. It was used by Vladimir Vapnik and Alexey Chervo-
7090 nenkis to show that when the margin is large, the “complexity” of the func-
7091 tion class is low, and, hence, learning is possible (Vapnik, 2000). It turns
7092 out that the concept is useful for various different approaches for theo-
7093 retically analyzing generalization error (Shalev-Shwartz and Ben-David,
7094 2014; Steinwart and Christmann, 2008). ♦
Draft (2019-01-29) from Mathematics for Machine Learning. Errata and feedback to [Link]
12.2 Primal Support Vector Machine 381
c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
382 Classification with Support Vector Machines
max r2
w0 ,b,r
w0
(12.22)
subject to yn , xn + b > r, r > 0.
kw0 k
Equation (12.22) explicitly states that the distance r is positive. Therefore,
Note that r > 0 we can divide the first constraint by r, which yields
because we
assumed linear max r2
w0 ,b,r
separability, and
hence there is no
(12.23)
* +
issue to divide by r. w0 b
subject to yn , xn + > 1, r>0
kw0 k r r
|{z}
| {z } 00
w00 b
Draft (2019-01-29) from Mathematics for Machine Learning. Errata and feedback to [Link]
12.2 Primal Support Vector Machine 383
x(2)
x(1) x(1)
w0
renaming the parameters to w00 and b00 . Since w00 = kw0 kr
, rearranging for
r gives
w0 1 w0 1
kw00 k = = · = . (12.24)
kw0 k r r kw0 k r
By substituting this result into (12.23) we obtain
1
max 2 subject to yn (hw00 , xn i + b00 ) > 1 . (12.25)
w00 ,b00 kw00 k
1
7124 The final step is to observe that maximizing kw00 k2
yields the same solution
1 00 2
7125 as minimizing 2
kw k , which concludes the proof of Theorem 12.1.
c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
384 Classification with Support Vector Machines
hw
measures the
,
hw
xi
distance of a
,
positive example
+
xi
b=
x+ to the positive
+
margin hyperplane
b=
1
hw, xi + b = 1
0
when x+ is on the
wrong side.
Draft (2019-01-29) from Mathematics for Machine Learning. Errata and feedback to [Link]
12.2 Primal Support Vector Machine 385
7157 We will see in this section that the margin corresponds to the regulariza-
7158 tion term. The remaining question is: what is the loss function? In contrast loss function
7159 to Chapter 9, where we consider regression problems (the output of the
7160 predictor is a real number), in this chapter, we consider binary classifica-
7161 tion problems (the output of the predictor is one of two labels {+1, −1}).
7162 Therefore, the error/loss function for each single example-label pair needs
7163 to be appropriate for binary classification. For example, the squared loss
7164 that is used for regression (9.10b) is not suitable for binary classification.
7165 Remark. The ideal loss function between binary labels is to count the num-
7166 ber of mismatches between the prediction and the label. This means that
7167 for a predictor f applied to an example xn , we compare the output f (xn )
7168 with the label yn . We define the loss to be zero if they match, and one if
7169 they do not match. This is denoted by 1(f (xn ) 6= yn ) and is called the
7170 zero-one loss. Unfortunately, the zero-one loss results in a combinatorial zero-one loss
7171 optimization problem for finding the best parameters w, b. Combinatorial
7172 optimization problems (in contrast to continuous optimization problems
7173 discussed in Chapter 7) are in general more challenging to solve. ♦
What is the loss function corresponding to the SVM? Consider the error
between the output of a predictor f (xn ) and the label yn . The loss de-
scribes the error that is made on the training data. An equivalent way to
derive (12.26a) is to use the hinge loss hinge loss
c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
386 Classification with Support Vector Machines
max{0, 1 − t}
hinge loss
of zero-one loss.
2
0
−2 0 2
t
the total loss, while regularizing the objective with `2 -regularization (see
Section 8.1.3). Using the hinge loss (12.28) gives us the unconstrained
optimization problem
N
1 X
min kwk2 + C max{0, 1 − yn (hw, xn i + b)} . (12.31)
w,b
|2 {z } n=1
regularizer
| {z }
error term
regularizer 7176 The first term in (12.31) is called the regularization term or the regularizer
loss term 7177 (see Section 9.2.3), and the second term is called the loss term or the error
error term 2
7178 term. Recall from Section 12.2.4 that the term 12 kwk arises directly from
7179 the margin. In other words, margin maximization can be interpreted as
regularization 7180 regularization.
In principle, the unconstrained optimization problem in (12.31) can be
directly solved with (sub-)gradient descent methods as described in Sec-
tion 7.1. To see that (12.31) and (12.26a) are equivalent, observe that the
hinge loss (12.28) essentially consists of two linear parts, as expressed
in (12.29). Consider the hinge loss on for a single example-label pair
(12.28). We can equivalently replace minimization of the hinge loss over t
with a minimization of a slack variable ξ with two constraints. In equation
form,
min max{0, 1 − t} (12.32)
t
is equivalent to
min ξ
ξ,t
(12.33)
subject to ξ > 0, ξ > 1 − t.
7181 By substituting this expression into (12.31) and rearranging one of the
7182 constraints, we obtain exactly the soft margin SVM (12.26a).
7183 Remark. Let us contrast our choice of the loss function in this section to the
7184 loss function for linear regression in Chapter 9. Recall from Section 9.2.1
7185 that for finding maximum likelihood estimators, we usually minimize the
7186 negative log-likelihood. Furthermore, since the likelihood term for linear
7187 regression with Gaussian noise is Gaussian, the negative log-likelihood for
Draft (2019-01-29) from Mathematics for Machine Learning. Errata and feedback to [Link]
12.3 Dual Support Vector Machine 387
7188 each example is a squared error function The squared error function is the
7189 loss function that is minimized when looking for the maximum likelihood
7190 solution. ♦
c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
388 Classification with Support Vector Machines
N
∂L X
= αn yn , (12.36)
∂b n=1
∂L
= C − αn − γi . (12.37)
∂ξn
N
X
w= αn yn xn , (12.38)
n=1
representer theorem
7210 which is a particular instance of the representer theorem (Kimeldorf and
7211 Wahba, 1970). Equation (12.38) states that the optimal weight vector in
7212 the primal is a linear combination of the examples xn . Recall from Sec-
7213 tion 2.6.1 that this means that the solution of the optimization problem
7214 lies in the span of training data. Additionally the constraint obtained by
7215 setting (12.36) to zero implies that the optimal weight vector is an affine
The representer 7216 combination of the examples. The representer theorem turns out to hold
theorem is actually7217 for very general settings of regularized empirical risk minimization (Hof-
a collection of
7218 mann et al., 2008; Argyriou and Dinuzzo, 2014). The theorem has more
theorems saying
that the solution of7219 general versions (Schölkopf et al., 2001), and necessary and sufficient
minimizing 7220 conditions on its existence can be found in (Yu et al., 2013).
empirical risk lies in
the subspace 7221 Remark. The representer theorem (12.38) also provides an explaination
(Section 2.4.3) 7222 of the name Support Vector Machine. The examples xn , for which the
defined by the
7223 corresponding parameters αn = 0, do not contribute to the solution w at
examples.
7224 all. The other examples, where αn > 0, are called support vectors since
support vectors 7225 they “support” the hyperplane. ♦
By substituting the expression for w into the Lagrangian (12.34), we
obtain the dual
N N N
*N +
1 XX X X
D(ξ, α, γ) = yi yj αi αj hxi , xj i − yi αi yj αj xj , xi
2 i=1 j=1 i=1 j=1
N
X N
X N
X N
X N
X
+C ξi − b yi αi + αi − αi ξi − γ i ξi .
i=1 i=1 i=1 i=1 i=1
(12.39)
Draft (2019-01-29) from Mathematics for Machine Learning. Errata and feedback to [Link]
12.3 Dual Support Vector Machine 389
N N N
1 XX X
min yi yj αi αj hxi , xj i − αi
α 2 i=1 j=1 i=1
N
X (12.41)
subject to yi αi = 0
i=1
0 6 αi 6 C for all i = 1, . . . , N .
c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
390 Classification with Support Vector Machines
Figure 12.9 Convex
hulls. (a) Convex
hull of points, some
of which lie within
the boundary;
(b) Convex hulls
around positive and
negative examples.
c
(a) Convex hull. (b) Convex hulls around positive (blue) and
negative (red) examples. The distance between
the two convex sets is the length of the differ-
ence vector c − d.
Draft (2019-01-29) from Mathematics for Machine Learning. Errata and feedback to [Link]
12.3 Dual Support Vector Machine 391
We pick a point c, which is in the convex hull of the set of positive exam-
ples, and is closest to the negative class distribution. Similarly we pick a
point d in the convex hull of the set of negative examples, and is closest to
the positive class distribution, see Figure 12.9(b). We define a difference
vector between d and c as
w := c − d . (12.44)
Picking the points c and d as above, and requiring them to be closest to
each other is the same as saying that we want to minimize the length/
norm of w, so that we end up with the corresponding optimization prob-
lem
1 2
arg min kwk = arg min kwk . (12.45)
w w 2
c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
392 Classification with Support Vector Machines
X X
= αn+ − αn− = 1 − 1 = 0 . (12.51b)
n:yn =+1 n:yn =−1
7257 The objective function (12.48) and the constraint (12.50), along with the
7258 assumption that α > 0, give us a constrained (convex) optimization prob-
7259 lem. This optimization problem can be shown to be the same as that of
7260 the dual hard margin SVM (Bennett and Bredensteiner, 2000a).
7261 Remark. To obtain the soft margin dual, we consider the reduced hull. The
reduced hull 7262 reduced hull is similar to the convex hull but has an upper bound to the
7263 size of the coefficients α. The maximum possible value of the elements
7264 of α restricts the size that the convex hull can take. In other words, the
7265 bound on α shrinks the convex hull to a smaller volume (Bennett and
7266 Bredensteiner, 2000b). ♦
Draft (2019-01-29) from Mathematics for Machine Learning. Errata and feedback to [Link]
12.4 Kernels 393
7282 known as the kernel trick (Schölkopf and Smola, 2002; Shawe-Taylor and kernel trick
7283 Cristianini, 2004), as it hides away the explicit non-linear feature map.
The matrix K ∈ RN ×N , resulting from the inner products or the appli-
cation of k(·, ·) to a dataset, is called the Gram matrix, and is often just Gram matrix
referred to as the kernel matrix. Kernels must be symmetric and positive kernel matrix
semi-definite functions so that every kernel matrix K is symmetric and
positive semi-definite (Section 3.2.3):
c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
394 Classification with Support Vector Machines
Draft (2019-01-29) from Mathematics for Machine Learning. Errata and feedback to [Link]
12.5 Numerical Solution 395
7327 Using this subgradient above, we can apply the optimization methods pre-
7328 sented in Section 7.1.
7329 Both the primal and the dual SVM result in a convex quadratic pro-
7330 gramming problem (constrained optimization). Note that the primal SVM
7331 in (12.26a) has optimization variables that have the size of the dimen-
7332 sion D of the input examples. The dual SVM in (12.41) has optimization
7333 variables that have the size of the number N of examples.
To express the primal SVM in the standard form (7.45) for quadratic
programming, let us assume that we use the dot product (3.5) as the
inner product. We rearrange the equation for the primal SVM (12.26a), Recall from
such that the optimization variables are all on the right and the inequality Section 3.2 that in
this book, we use
of the constraint matches the standard form. This yields the optimization
the phrase dot
N product to mean the
1 X
min kwk2 + C ξn inner product on
w,b,ξ 2 Euclidean vector
n=1 (12.55) space.
−yn x>
n w − yn b − ξn 6 −1
subject to
−ξn 6 0
n = 1, . . . , N . By concatenating the variables w, b, xn into a single vector,
and carefully collecting the terms, we obtain the following matrix form of
the soft margin SVM.
>
w w
> w
1 ID 0D,N +1
min b b + 0D+1,1 C1N,1 b
w,b,ξ 2 ξ 0N +1,D 0N +1,N +1
ξ ξ
w
−Y X −y −I N −1N,1
subject to b 6 .
0N,D+1 −I N 0N,1
ξ
(12.56)
7334 In the above optimization problem, the minimization is over [w> , b, ξ > ]> ∈
7335 RD+1+N , and we use the notation: I m to represent the identity matrix of
7336 size m × m, 0m,n to represent the matrix of zeros of size m × n, and 1m,n
7337 to represent the matrix of ones of size m × n. In addition y is the vector
7338 of labels [y1 , . . . , yN ]> , Y = diag(y) is an N by N matrix where the ele-
7339 ments of the diagonal are from y , and X ∈ RN ×D is the matrix obtained
7340 by concatenating all the examples.
We can similarly perform a collection of terms for the dual version of
the SVM (12.41). To express the dual SVM in standard form, we first have
to express the kernel matrix K such that each entry is Kij = k(xi , xj ).
c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
396 Classification with Support Vector Machines
Draft (2019-01-29) from Mathematics for Machine Learning. Errata and feedback to [Link]
12.6 Further Reading 397
7368 separately and calculates their convex conjugates (Rifkin and Lippert,
7369 2007). Readers interested in the functional analysis view (also the reg-
7370 ularization methods view) of SVMs are referred to the work by Wahba
7371 (1990). Theoretical exposition of kernels (Manton and Amblard, 2015;
7372 Aronszajn, 1950; Schwartz, 1964; Saitoh, 1988) require a basic ground-
7373 ing of linear operators (Akhiezer and Glazman, 1993). The idea of kernels
7374 have been generalized to Banach spaces (Zhang et al., 2009) and Kreı̆n
7375 spaces (Ong et al., 2004; Loosli et al., 2016).
7376 Observe that the hinge loss has three equivalent representations, as
7377 shown in (12.28) and (12.29), as well as the constrained optimization
7378 problem in (12.33). The formulation (12.28) is often used when compar-
7379 ing the SVM loss function with other loss functions (Steinwart, 2007).
7380 The two-piece formulation (12.29) is convenient for computing subgra-
7381 dients, as each piece is linear. The third formulation (12.33), as seen
7382 in Section 12.5, enables the use of convex quadratic programming (Sec-
7383 tion 7.3.2) tools.
7384 Since binary classification is a well-studied task in machine learning,
7385 other words are also sometimes used, such as discrimination, separation
7386 or decision. Furthermore, there are three quantities that can be the output
7387 of a binary classifier. First is the output of the linear function itself (often
7388 called the score), which can take any real value. This output can be used
7389 for ranking the examples, and binary classification can be thought of as
7390 picking a threshold on the ranked examples (Shawe-Taylor and Cristian-
7391 ini, 2004). The second quantity that is often considered the output of a
7392 binary classifier is after the output is passed through a non-linear function
7393 to constrain its value to a bounded range, for example in the interval [0, 1].
7394 A common non-linear function is the sigmoid function (Bishop, 2006).
7395 When the non-linearity results in well calibrated probabilities (Gneiting
7396 and Raftery, 2007; Reid and Williamson, 2011), this is called class proba-
7397 bility estimation. The third output of a binary classifier is the final binary
7398 decision {+1, −1}, which is the one most commonly assumed to be the
7399 output of the classifier.
7400 The SVM is a binary classifier that does not naturally lend itself to a
7401 probabilistic interpretation. There are several approaches for converting
7402 the raw output of the linear function (the score) into a calibrated class
7403 probability estimate (P (Y = 1|X = x)) which involve an additional cal-
7404 ibration step (Platt, 2000; Lin et al., 2007; Zadrozny and Elkan, 2001).
7405 From the training perspective, there are many related probabilistic ap-
7406 proches. We mentioned at the end of Section 12.2.5 that there is a re-
7407 lationship between loss function and the likelihood (also compare Sec-
7408 tion 8.1 and Section 8.2). The maximum likelihood approach correspond-
7409 ing to a well calibrated transformation during training is called logistic
7410 regression, which comes from a class of methods called generalized linear
7411 models. Details of logistic regression from this point of view can be found
7412 in Agresti (2002, Chapter 5) and McCullagh and Nelder (1989, Chapter
c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
398 Classification with Support Vector Machines
7413 4). Naturally, one could take a more Bayesian view of the classifier out-
7414 put by estimation a posterior distribution using Bayesian logistic regres-
7415 sion. The Bayesian view also includes the specification of the prior, which
7416 includes design choices such as conjugacy (Section 6.6.1) with the like-
7417 lihood. Additionally, one could consider latent functions as priors, which
7418 results in Gaussian process classification (Rasmussen and Williams, 2006,
7419 Chapter 3).
Draft (2019-01-29) from Mathematics for Machine Learning. Errata and feedback to [Link]