0% found this document useful (0 votes)
23 views25 pages

Chapter 12

Uploaded by

azbcs4th
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views25 pages

Chapter 12

Uploaded by

azbcs4th
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

6943

12

Classification with Support Vector Machines

6944 In many situations, we want our machine learning algorithm to predict


6945 one of a number of (discrete) outcomes. For example, an email client that
6946 sorts mail into personal mail and junk mail, which has two outcomes. An-
6947 other example is a telescope that identifies whether an object in the night
6948 sky is a galaxy, star or planet. There are usually a small number of out-
6949 comes, and more importantly there is usually no additional structure on
An example of 6950 these outcomes. In this chapter, we consider predictors that output binary
structure is if the 6951 values, i.e., there are only two possible outcomes. This machine learning
outcomes were
6952 task is called binary classification. This is in contrast to Chapter 9 where
ordered, like in the
case of small, 6953 we considered a prediction problem with continuous-valued outputs.
medium and large For binary classification the set of possible values that the label/output
t-shirts. can attain is binary, and for this chapter we denote them by {+1, −1}. In
binary classification
other words, we consider predictors of the form
f : RD → {+1, −1} . (12.1)
6954 Recall from Chapter 8 that we represent each example (data point) xn
Input example xn 6955 as a feature vector of D real numbers. The labels are often referred to as
may also be referred
6956 the positive and negative classes, respectively. One should be careful not
to as inputs, data
6957 to infer intuitive attributes of positiveness of the +1 class. For example,
points, features or
instances. 6958 in a cancer detection task, a patient with cancer is often labelled +1. In
classes 6959 principle, any two distinct values can be used, e.g., {True, False}, {0, 1}
For probabilistic 6960 or {red, blue}. The problem of binary classification is well studied, and
models, it is 6961 we defer a survey of other approaches to Section 12.6.
mathematically
6962 We present an approach known as the Support Vector Machine (SVM),
convenient to use
{0, 1} as a binary 6963 which solves the binary classification task. As in regression, we have a su-
representation, see6964 pervised learning task, where we have a set of examples xn ∈ RD along
remark after 6965 with their corresponding (binary) labels yn ∈ {+1, −1}. Given a train-
Example 6.12.
6966 ing data set consisting of example-label pairs {(x1 , y1 ), . . . , (xN , yN )}, we
6967 would like to estimate parameters of the model that will give the smallest
6968 classification error. Similar to Chapter 9 we consider a linear model, and
6969 hide away the nonlinearity in a transformation φ of the examples (9.13).
6970 We will revisit φ in Section 12.4.
6971 The SVM provides state-of-the-art results in many applications, with
6972 sound theoretical guarantees (Steinwart and Christmann, 2008). There
6973 are two main reasons why we chose to illustrate binary classification using

374
Draft chapter (January 29, 2019) from “Mathematics for Machine Learning” c 2018 by Marc
Peter Deisenroth, A Aldo Faisal, and Cheng Soon Ong. To be published by Cambridge University
Press. Report errata and feedback to [Link] Please do not post or distribute this
file, please link to [Link]
Classification with Support Vector Machines 375

Figure 12.1
Example 2D data,
illustrating the
intuition of data
where we can find a
linear classifier that
x(2)

separates red
crosses from blue
dots.

x(1)

6974 SVMs. First, the SVM allows for a geometric way to think about supervised
6975 machine learning. While in Chapter 9 we considered the machine learning
6976 problem in terms of probabilistic models and attacked it using maximum
6977 likelihood estimation and Bayesian inference, here we will consider an
6978 alternative approach where we reason geometrically about the machine
6979 learning task. It relies heavily on concepts, such as inner products and
6980 projections, which we discussed in Chapter 3. The second reason why we
6981 find SVMs instructive is that in contrast to Chapter 9, the optimization
6982 problem for SVM does not admit an analytic solution so that we need to
6983 resort to a variety of optimization tools introduced in Chapter 7.
6984 The SVM view of machine learning is subtly different from the max-
6985 imum likelihood view of Chapter 9. The maximum likelihood view pro-
6986 poses a model based on a probabilistic view of the data distribution, from
6987 which an optimization problem is derived. In contrast, the SVM view starts
6988 by designing a particular function that is to be optimized during training,
6989 based on geometric intuitions. We have seen something similar already in
6990 Chapter 10 where we derived PCA from geometric principles. In the SVM
6991 case, we start by designing an objective function that is to be minimized on
6992 training data, following the principles of empirical risk minimization 8.1.
6993 This can also be understood as designing a particular loss function.
6994 Let us derive the optimization problem corresponding to training an
6995 SVM on example-label pairs. Intuitively, we imagine binary classification
6996 data, which can be separated by a hyperplane as illustrated in Figure 12.1.
6997 Here, every example xn (a vector of dimension 2) is a two-dimensional
6998 location (x(1) (2)
n and xn ), and the corresponding binary label yn is one of
6999 two different symbols (red cross or blue disc). “Hyperplane” is a word that
7000 is commonly used in machine learning, and we encountered hyperplanes
7001 already in Section 2.8. A hyperplane is an affine subspace of dimension
7002 D−1 (if the corresponding vector space is of dimension D). The examples
7003 consist of two classes (there are two possible labels) that have features
7004 (the components of the vector representing the example) arranged in such
7005 a way as to allow us to separate/classify them by drawing a straight line.

c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
376 Classification with Support Vector Machines

7006 In the following, we formalize the idea of finding a linear separator


7007 of the two classes. We introduce the idea of the margin and then extend
7008 linear separators to allow for examples to fall on the “wrong” side, incur-
7009 ring a classification error. We present two equivalent ways of formalizing
7010 the SVM: the geometric view (Section 12.2.4) and the loss function view
7011 (Section 12.2.5). We derive the dual version of the SVM using Lagrange
7012 multipliers (Section 7.2). The dual SVM allows us to observe a third way
7013 of formalizing the SVM: in terms of the convex hulls of the examples of
7014 each class (Section 12.3.2). We conclude by briefly describing kernels and
7015 how to numerically solve the nonlinear kernel-SVM optimization problem.

7016 12.1 Separating Hyperplanes


7017 Given two examples represented as vectors xi and xj , one way to compute
7018 the similarity between them is using a inner product hxi , xj i. Recall from
7019 Section 3.2 that inner products are closely related to the angle between
7020 two vectors. The value of the inner product between two vectors depends
7021 on the length (norm) of each vector. Furthermore, inner products allow
7022 us to rigorously define geometric concepts such as orthogonality and pro-
7023 jections.
The main idea behind many classification algorithms is to represent
data in RD and then partition this space, ideally in a way that examples
with the same label are in the same partition (and no other examples).
In the case of binary classification, the space would be divided into two
parts corresponding to the positive and negative classes, respectively. We
consider a particularly convenient partition, which is to (linearly) split
the space into two halves using a hyperplane. Let example x ∈ RD be an
element of the data space. Consider a function
f : RD → R (12.2a)
x 7→ hw, xi + b , (12.2b)
parametrized by w ∈ RD and b ∈ R. Recall from Section 2.8 that hy-
perplanes are affine subspaces. Therefore, we define the hyperplane that
separates the two classes in our binary classification problem as
x ∈ RD : f (x) = 0 .

(12.3)
An illustration of the hyperplane is shown in Figure 12.2, where the
vector w is a vector normal to the hyperplane and b the intercept. We can
derive that w is a normal vector to the hyperplane in (12.3) by choosing
any two examples xa and xb on the hyperplane and showing that the
vector between them is orthogonal to w. In the form of an equation,
f (xa ) − f (xb ) = hw, xa i + b − (hw, xb i + b) (12.4a)
= hw, xa − xb i , (12.4b)

Draft (2019-01-29) from Mathematics for Machine Learning. Errata and feedback to [Link]
12.1 Separating Hyperplanes 377

Figure 12.2
w Equation of a
separating
w hyperplane (12.3).
(left) The standard
b way of representing
.positive the equation in 3D.
(right) For ease of
. . drawing, we look at
0 negative the hyperplane edge
on.

7024 where the second line is obtained by the linearity of the inner product
7025 (Section 3.2). Since we have chosen xa and xb to be on the hyperplane, w is orthogonal to
7026 this implies that f (xa ) = 0 and f (xb ) = 0 and hence hw, xa − xb i = 0. any vector on the
hyperplane.
7027 Recall that two vectors are orthogonal when their inner product is zero.
7028 Therefore, we obtain that w is orthogonal to any vector on the hyperplane.
7029 Remark. Recall from Chapter 2 that we can think of vectors in different
7030 ways. In this chapter, we think of the parameter vector w as an arrow
7031 indicating a direction, i.e., we consider w to be a geometric vector. In
7032 contrast, we think of the example vector x as a data point (as indicated
7033 by its coordinates), i.e., we consider x to be the coordinates of a vector
7034 with respect to the standard basis. ♦
7035 When presented with a test example, we classify the example as posi-
7036 tive or negative depending on which side of the hyperplane it occurs. Note
7037 that (12.3) not only defines a hyperplane; it additionally defines a direc-
7038 tion. In other words, it defines the positive and negative side of the hyper-
7039 plane. Therefore, to classify a test example xtest , we calculate the value of
7040 the function f (xtest ) and classify the example as +1 if f (xtest ) > 0 and
7041 −1 otherwise. Thinking geometrically, the positive examples lie “above”
7042 the hyperplane and the negative examples “below” the hyperplane.
When training the classifier, we want to ensure that the examples with
positive labels are on the positive side of the hyperplane, i.e.,

hw, xn i + b > 0 when yn = +1 (12.5)

and the examples with negative labels are on the negative side, i.e.,

hw, xn i + b < 0 when yn = −1 . (12.6)

Refer to Figure 12.2 for a geometric intuition of positive and negative


examples. These two conditions are often presented in a single equation

yn (hw, xn i + b) > 0 . (12.7)

7043 Equation (12.7) is equivalent to (12.5) and (12.6) when we multiply both
7044 sides of (12.5) and (12.6) with yn = 1 and yn = −1, respectively.

c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
378 Classification with Support Vector Machines

Figure 12.3
Possible separating
hyperplanes. There
are many linear
classifiers (green
lines) that separate

x(2)
red crosses from
blue dots.

x(1)

7045 12.2 Primal Support Vector Machine


7046 Based on the concept of distances from points to a hyperplane, we now
7047 are in a position to discuss the support vector machine. For a dataset
7048 {(x1 , y1 ), . . . , (xN , yN )} that is linearly separable, we have infinitely many
7049 candidate hyperplanes (refer to Figure 12.3), and therefore classifiers,
7050 that solve our classification problem without any (training) errors. To find
7051 a unique solution, one idea is to choose the separating hyperplane that
7052 maximizes the margin between the positive and negative examples. In
7053 other words, we want the positive and negative examples to be separated
A classifier with 7054 by a large margin (Section 12.2.1). In the following, we compute the dis-
large margin turns7055 tance between an example and a hyperplane to derive the margin. Recall
out to generalize
7056 that the closest point on the hyperplane to a given point (example xn ) is
well (Steinwart and
Christmann, 2008).7057 obtained by the orthogonal projection (Section 3.8).

7058 12.2.1 Concept of the Margin


margin 7059 The concept of the margin is intuitively simple: It is the distance of the
There could be two
7060 separating hyperplane to the closest examples in the dataset, assuming
or more closest 7061 that the dataset is linearly separable. However, when trying to formalize
examples to a
7062 this distance, there is a technical wrinkle that may be confusing. The tech-
hyperplane.
7063 nical wrinkle is that we need to define a scale at which to measure the
7064 distance. A potential scale is to consider the scale of the data, i.e., the raw
7065 values of xn . There are problems with this, as we could change the units
7066 of measurement of xn and change the values in xn , and, hence, change
7067 the distance to the hyperplane. As we will see shortly, we define the scale
7068 based on the equation of the hyperplane (12.3) itself.
Consider a hyperplane hw, xi + b, and an example xa as illustrated in
Figure 12.4. Without loss of generality, we can consider the example xa
to be on the positive side of the hyperplane, i.e., hw, xa i + b > 0. We
would like to compute the distance r > 0 of xa from the hyperplane. We
do so by considering the orthogonal projection (Section 3.8) of xa onto
the hyperplane, which we denote by x0a . Since w is orthogonal to the

Draft (2019-01-29) from Mathematics for Machine Learning. Errata and feedback to [Link]
12.2 Primal Support Vector Machine 379

. xa Figure 12.4 Vector


addition to express
r w
x0a .
distance to
hyperplane:
w
xa = x0a + r kwk .

.
0

hyperplane, we know that the distance r is just a scaling of this vector w.


If the length of w is known, then we can use this scaling factor r factor
to work out the absolute distance between xa and x0a . For convenience
we choose to use a vector of unit length (its norm is 1), and obtain this
w
by dividing w by its norm, kwk . Using vector addition (Section 2.4) we
obtain
w
xa = x0a + r . (12.8)
kwk
7069 Another way of thinking about r is that it is the coordinate of xa in the
7070 subspace spanned by w/ kwk. We have now expressed the distance of xa
7071 from the hyperplane as r, and if we choose xa to be the point closest to
7072 the hyperplane, this distance r is the margin.
Recall that we would like the positive examples to be further than r
from the hyperplane, and the negative examples to be further than dis-
tance r (in the negative direction) from the hyperplane. Analogously to
the combination of (12.5) and (12.6) into (12.7), we formulate this ob-
jective as
yn (hw, xn i + b) > r . (12.9)

7073 In other words, we combine the requirements that examples are at least
7074 r away from the hyperplane (in the positive and negative direction) into
7075 one single inequality.
7076 Since we are interested only in the direction, we add an assumption to
7077 our model that the parameter vector w is of √ unit length, i.e., kwk = 1,
7078 where we use the Euclidean norm kwk = w> w (Section 3.1). This We will see other
7079 assumption also allows a more intuitive interpretation of the distance r choices of inner
products
7080 (12.8) since it is the scaling factor of a vector of length 1.
(Section 3.2) in
7081 Remark. A reader familiar with other presentations of the margin would Section 12.4.
7082 notice that our definition of kwk = 1 is different from the standard
7083 presentation if the SVM provided by Schölkopf and Smola (2002), for
7084 example. In Section 12.2.3, we will show the equivalence of both ap-
7085 proaches. ♦
Collecting the three requirements into a single constrained optimization

c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
380 Classification with Support Vector Machines

Figure 12.5
Derivation of the
.xa
1
r w
x0a .
margin: r = kwk .

hw
,
hw

xi
,

+
xi

b=
+
b=

1
0
problem, we obtain the objective
max r
w,b,r |{z}
margin
(12.10)
subject to yn (hw, xn i + b) > r , kwk = 1 , r > 0,
| {z } | {z }
data fitting normalization

7086 which says that we want to maximize the margin r, while ensuring that
7087 the data lies on the correct side of the hyperplane.
7088 Remark. The concept of the margin turns out to be highly pervasive in
7089 machine learning. It was used by Vladimir Vapnik and Alexey Chervo-
7090 nenkis to show that when the margin is large, the “complexity” of the func-
7091 tion class is low, and, hence, learning is possible (Vapnik, 2000). It turns
7092 out that the concept is useful for various different approaches for theo-
7093 retically analyzing generalization error (Shalev-Shwartz and Ben-David,
7094 2014; Steinwart and Christmann, 2008). ♦

7095 12.2.2 Traditional Derivation of the Margin


7096 In the previous section, we derived (12.10) by making the observation that
7097 we are only interested in the direction of w and not its length, leading to
7098 the assumption that kwk = 1. In this section, we derive the margin max-
7099 imization problem by making a different assumption. Instead of choosing
7100 that the parameter vector is normalised, we choose a scale for the data.
7101 We choose this scale such that the value of the predictor hw, xi + b is 1 at
Recall that we 7102 the closest example. Let us also denote the example in the dataset that is
currently consider7103 closest to the hyperplane by xa .
linearly separable
Figure 12.5 is identical to Figure 12.4, except that now we rescaled the
data.
axes, such that the example xa lies exactly on the margin, i.e., hw, xa i +
b = 1. Since x0a is the orthogonal projection of xa onto the hyperplane, it
must by definition lie on the hyperplane, i.e.,
hw, x0a i + b = 0 . (12.11)

Draft (2019-01-29) from Mathematics for Machine Learning. Errata and feedback to [Link]
12.2 Primal Support Vector Machine 381

By substituting (12.8) into (12.11) we obtain


 
w
w, xa − r + b = 0. (12.12)
kwk
Exploiting the bilinearity of the inner product (see Section 3.2), we get
hw, wi
hw, xa i + b − r = 0. (12.13)
kwk
Observe that the first term is 1 by our assumption of scale, i.e., hw, xa i +
b = 1. From (3.16) in Section 3.1 we know that hw, wi = kwk2 . Hence,
the second term reduces to rkwk. Using these simplifications, we obtain
1
r= . (12.14)
kwk
7104 This means we derived the distance r in terms of the normal vector w of
7105 the hyperplane. At first glance this equation is counterintuitive as we seem We can also think of
7106 to have derived the distance from the hyperplane in terms of the length the distance as the
projection error that
7107 of the vector w, but we do not yet know this vector. One way to think
incurs when
7108 about it is to consider the distance r to be a temporary variable that we projecting xa onto
7109 only use for this derivation. Therefore, for the rest of this section we will the hyperplane.
1
7110 denote the distance to the hyperplane by kwk . In Section 12.2.3, we will
7111 see that the choice that the margin equals 1 is equivalent to our previous
7112 assumption of kwk = 1 in Section 12.2.1.
Similar to the argument to obtain (12.9), we want the positive and
negative examples to be at least 1 away from the hyperplane, which yields
the condition
yn (hw, xn i + b) > 1 . (12.15)
Combining the margin maximization with the fact that examples need to
be on the correct side of the hyperplane (based on their labels) gives us
1
max (12.16)
w,b kwk
subject to yn (hw, xn i + b) > 1 for all n = 1, . . . , N. (12.17)
Instead of maximizing the reciprocal of the norm as in (12.16), we often
minimize the squared norm. We also often include a constant 12 that does The squared norm
not affect the optimal w, b but yields a tidier form when we compute the results in a convex
quadratic
gradient. Then, our objective becomes
programming
1 problem for the
min kwk2 (12.18) SVM (Section 12.5).
w,b 2
subject to yn (hw, xn i + b) > 1 for all n = 1, . . . , N . (12.19)
7113 Equation (12.18) is known as the hard margin SVM. The reason for the hard margin SVM
7114 expression “hard” is because the above formulation does not allow for any
7115 violations of the margin condition. We will see in Section 12.2.4 that this

c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
382 Classification with Support Vector Machines

7116 “hard” condition can be relaxed to accommodate violations if the data is


7117 not linearly separable.

7118 12.2.3 Why we can set the Margin to 1


7119 In Section 12.2.1, we argued that we would like to maximize some value
7120 r, which represents the distance of the closest example to the hyperplane.
7121 In Section 12.2.2, we scaled the data such that the closest example is of
7122 distance 1 to the hyperplane. In this section, we relate the two derivations,
7123 and show that they are equivalent.
Theorem 12.1. Maximizing the margin r, where we consider normalized
weights as in (12.10),
max r
w,b,r |{z}
margin
(12.20)
subject to yn (hw, xn i + b) > r , kwk = 1 , r > 0,
| {z } | {z }
data fitting normalization

is equivalent to scaling the data, such that the margin is unity:


1 2
min kwk
w,b
|2 {z }
margin (12.21)
subject to yn (hw, xn i + b) > 1 .
| {z }
data fitting

Proof Consider (12.20). Since the square is a strictly monotonic trans-


formation for non-negative arguments, the maximum stays the same if we
consider r2 in the objective. Since kwk = 1 we can reparameterize the
equation with a new weight vector w0 that is not normalized by explicitly
w0
using kw 0 k . We obtain

max r2
w0 ,b,r

w0
  (12.22)
subject to yn , xn + b > r, r > 0.
kw0 k
Equation (12.22) explicitly states that the distance r is positive. Therefore,
Note that r > 0 we can divide the first constraint by r, which yields
because we
assumed linear max r2
w0 ,b,r
separability, and  
hence there is no
(12.23)
* +
issue to divide by r.  w0 b 
subject to yn  , xn +  > 1, r>0
 
 kw0 k r r
|{z}
| {z } 00
w00 b

Draft (2019-01-29) from Mathematics for Machine Learning. Errata and feedback to [Link]
12.2 Primal Support Vector Machine 383

Figure 12.6 (left)


linearly separable
data, with a large
margin. (right)
non-separable data.
x(2)

x(2)
x(1) x(1)

w0
renaming the parameters to w00 and b00 . Since w00 = kw0 kr
, rearranging for
r gives
w0 1 w0 1
kw00 k = = · = . (12.24)
kw0 k r r kw0 k r
By substituting this result into (12.23) we obtain
1
max 2 subject to yn (hw00 , xn i + b00 ) > 1 . (12.25)
w00 ,b00 kw00 k
1
7124 The final step is to observe that maximizing kw00 k2
yields the same solution
1 00 2
7125 as minimizing 2
kw k , which concludes the proof of Theorem 12.1.

7126 12.2.4 Soft Margin SVM: Geometric View


7127 In the case where data is not linearly separable, we may wish to allow
7128 some examples to fall within the margin region, or even to be on the
7129 wrong side of the hyperplane as illustrated in Figure 12.6.
7130 The model that allows for some classification errors is called the soft soft margin SVM
7131 margin SVM. In this section, we derive the resulting optimization problem
7132 using geometric arguments. In Section 12.2.5, we will derive an equiv-
7133 alent optimization problem using the idea of a loss function. Using La-
7134 grange multipliers (Section 7.2), we will derive the dual optimization
7135 problem of the SVM in Section 12.3. This dual optimization problem al-
7136 lows us to observe a third interpretation of the SVM: as a hyperplane that
7137 bisects the line between convex hulls corresponding to the positive and
7138 negative data examples (Section 12.3.2).
The key geometric idea is to introduce a slack variable ξn corresponding slack variable
to each example-label pair (xn , yn ) that allows a particular example to be
within the margin or even on the wrong side of the hyperplane (refer to
Figure 12.7). We subtract the value of ξn from the margin, constraining
ξn to be non-negative. To encourage correct classification of the samples

c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
384 Classification with Support Vector Machines

Figure 12.7 Soft


Margin SVM allows
. w
examples to be
within the margin or
on the wrong side of ξ
the hyperplane. The
slack variable ξ
x+ .

hw
measures the

,
hw

xi
distance of a

,
positive example

+
xi

b=
x+ to the positive

+
margin hyperplane

b=

1
hw, xi + b = 1

0
when x+ is on the
wrong side.

we add ξn to the objective


N
1 X
min kwk2 + C ξn (12.26a)
w,b,ξ 2 n=1

subject to yn (hw, xn i + b) > 1 − ξn (12.26b)


ξn > 0 (12.26c)
7139 for n = 1, . . . , N . In contrast to the optimization problem (12.18) for the
soft margin SVM 7140 hard margin SVM, this one is called the soft margin SVM. The parameter
7141 C > 0 trades off the size of the margin and the total amount of slack that
regularization 7142 we have. This parameter is called the regularization parameter since, as
parameter 7143 we will see in the following section, the margin term in the objective func-
7144 tion (12.26a) is a regularization term. The margin term kwk2 is called
regularizer 7145 the regularizer, and in many books on numerical optimization, the regu-
7146 larization parameter multiplied with this term (Section 8.1.3). This is in
7147 contrast to our formulation in this section. Here a large value of C implies
7148 low regularization, as we give the slack variables larger weight, hence giv-
7149 ing more priority to examples which do not lie on the correct side of the
There are 7150 margin.
alternative
parametrizations of
7151 Remark. In the formulation of the soft margin SVM (12.26a) w is reg-
this regularization,7152 ularized, but b is not regularized. We can see this by observing that the
which is 7153 regularization term does not contain b. The unregularized term b compli-
why (12.26a) is also
7154 cates theoretical analysis (Steinwart and Christmann, 2008, Chapter 1)
often referred to as
the C-SVM.
7155 and decreases computational efficiency (Fan et al., 2008). ♦

7156 12.2.5 Soft Margin SVM: Loss Function View


Let us consider a different approach for deriving the SVM, following the
principle of empirical risk minimization (Section 8.1). For the SVM we
choose hyperplanes as the hypothesis class, that is
f (x) = hw, xi + b. (12.27)

Draft (2019-01-29) from Mathematics for Machine Learning. Errata and feedback to [Link]
12.2 Primal Support Vector Machine 385

7157 We will see in this section that the margin corresponds to the regulariza-
7158 tion term. The remaining question is: what is the loss function? In contrast loss function
7159 to Chapter 9, where we consider regression problems (the output of the
7160 predictor is a real number), in this chapter, we consider binary classifica-
7161 tion problems (the output of the predictor is one of two labels {+1, −1}).
7162 Therefore, the error/loss function for each single example-label pair needs
7163 to be appropriate for binary classification. For example, the squared loss
7164 that is used for regression (9.10b) is not suitable for binary classification.
7165 Remark. The ideal loss function between binary labels is to count the num-
7166 ber of mismatches between the prediction and the label. This means that
7167 for a predictor f applied to an example xn , we compare the output f (xn )
7168 with the label yn . We define the loss to be zero if they match, and one if
7169 they do not match. This is denoted by 1(f (xn ) 6= yn ) and is called the
7170 zero-one loss. Unfortunately, the zero-one loss results in a combinatorial zero-one loss
7171 optimization problem for finding the best parameters w, b. Combinatorial
7172 optimization problems (in contrast to continuous optimization problems
7173 discussed in Chapter 7) are in general more challenging to solve. ♦
What is the loss function corresponding to the SVM? Consider the error
between the output of a predictor f (xn ) and the label yn . The loss de-
scribes the error that is made on the training data. An equivalent way to
derive (12.26a) is to use the hinge loss hinge loss

`(t) = max{0, 1 − t} where t = yf (x) = y(hw, xi + b) . (12.28)


If f (x) is on the correct side (based on the corresponding label y ) of the
hyperplane, and further than distance 1, this means that t > 1 and the
hinge loss returns a value of zero. If f (x) is on the correct side but too
close to the hyperplane (0 < t < 1) the example x is within the margin,
and the hinge loss returns a positive value. When the example is on the
wrong side of the hyperplane (t < 0) the hinge loss returns an even larger
value, which increases linearly. In other words, we pay a penalty once we
are too close than the margin, even if the prediction is correct, and the
penalty increases linearly. An alternative way to express the hinge loss is
by considering it as two linear pieces
(
0 if t > 1
`(t) = , (12.29)
1 − t if t < 1
as illustrated in Figure 12.8. The loss corresponding to the hard margin
SVM 12.18 is defined as
(
0 if t > 1
`(t) = . (12.30)
∞ if t < 1
7174 This loss can be interpreted as never allowing any examples inside the
7175 margin.
For a given training set {(x1 , y1 ), . . . , (xN , yN )} we seek to minimize

c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
386 Classification with Support Vector Machines

Figure 12.8 The


4
hinge loss is a zero-one loss
convex upper bound

max{0, 1 − t}
hinge loss
of zero-one loss.
2

0
−2 0 2
t

the total loss, while regularizing the objective with `2 -regularization (see
Section 8.1.3). Using the hinge loss (12.28) gives us the unconstrained
optimization problem
N
1 X
min kwk2 + C max{0, 1 − yn (hw, xn i + b)} . (12.31)
w,b
|2 {z } n=1
regularizer
| {z }
error term

regularizer 7176 The first term in (12.31) is called the regularization term or the regularizer
loss term 7177 (see Section 9.2.3), and the second term is called the loss term or the error
error term 2
7178 term. Recall from Section 12.2.4 that the term 12 kwk arises directly from
7179 the margin. In other words, margin maximization can be interpreted as
regularization 7180 regularization.
In principle, the unconstrained optimization problem in (12.31) can be
directly solved with (sub-)gradient descent methods as described in Sec-
tion 7.1. To see that (12.31) and (12.26a) are equivalent, observe that the
hinge loss (12.28) essentially consists of two linear parts, as expressed
in (12.29). Consider the hinge loss on for a single example-label pair
(12.28). We can equivalently replace minimization of the hinge loss over t
with a minimization of a slack variable ξ with two constraints. In equation
form,
min max{0, 1 − t} (12.32)
t

is equivalent to
min ξ
ξ,t
(12.33)
subject to ξ > 0, ξ > 1 − t.
7181 By substituting this expression into (12.31) and rearranging one of the
7182 constraints, we obtain exactly the soft margin SVM (12.26a).
7183 Remark. Let us contrast our choice of the loss function in this section to the
7184 loss function for linear regression in Chapter 9. Recall from Section 9.2.1
7185 that for finding maximum likelihood estimators, we usually minimize the
7186 negative log-likelihood. Furthermore, since the likelihood term for linear
7187 regression with Gaussian noise is Gaussian, the negative log-likelihood for

Draft (2019-01-29) from Mathematics for Machine Learning. Errata and feedback to [Link]
12.3 Dual Support Vector Machine 387

7188 each example is a squared error function The squared error function is the
7189 loss function that is minimized when looking for the maximum likelihood
7190 solution. ♦

7191 12.3 Dual Support Vector Machine


7192 The description of the SVM in the previous sections, in terms of the vari-
7193 ables w and b, is known as the primal SVM. Recall that we consider inputs
7194 x ∈ RD with D features. Since w is of the same dimension as x, this
7195 means that the number of parameters (the dimension of w) of the opti-
7196 mization problem grows linearly with the number of features.
7197 In the following, we consider an equivalent optimization problem (the
7198 so-called dual view), which is independent of the number of features. In-
7199 stead the number of parameters increases with the number of examples
7200 in the training set. We saw a similar idea appear in Chapter 10, where we
7201 expressed the learning problem in a way that does not scale with the num-
7202 ber of features. This is useful for problems where we have more features
7203 than the number of examples in the training dataset. The dual SVM also
7204 has the additional advantage that it easily allows kernels to be applied,
7205 as we shall see at the end of this chapter. The word “dual” appears often
7206 in mathematical literature, and in this particular case it refers to convex
7207 duality. The following subsections are essentially an application of convex
7208 duality, which we discussed in Section 7.2.

7209 12.3.1 Convex Duality via Lagrange Multipliers


Recall the primal soft margin SVM (12.26a). We call the variables w, b and
ξ corresponding to the primal SVM the primal variables. We use αn > 0 In Chapter 7 we
as the Lagrange multiplier corresponding to the constraint (12.26b) that used λ as Lagrange
multipliers. In this
the examples are classified correctly and γn > 0 as the Lagrange multi-
section we follow
plier corresponding to the non-negativity constraint of the slack variable, the notation
see (12.26c). The Lagrangian is then given by commonly chosen in
SVM literature, and
N
1 X use α and γ.
L(w, b, ξ, α, γ) = kwk2 + C ξn (12.34)
2 n=1
N
X N
X
− αn (yn (hw, xn i + b) − 1 + ξn ) − γ n ξn .
n=1 n=1
| {z } | {z }
constraint (12.26b) constraint (12.26c)

By differentiating the Lagrangian (12.34) with respect to the three primal


variables w, b and ξ respectively, we obtain
N
∂L >
X
=w − αn yn xn > , (12.35)
∂w n=1

c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
388 Classification with Support Vector Machines
N
∂L X
= αn yn , (12.36)
∂b n=1
∂L
= C − αn − γi . (12.37)
∂ξn

We now find the maximum of the Lagrangian by setting each of these


partial derivatives to zero. By setting (12.35) to zero we find

N
X
w= αn yn xn , (12.38)
n=1

representer theorem
7210 which is a particular instance of the representer theorem (Kimeldorf and
7211 Wahba, 1970). Equation (12.38) states that the optimal weight vector in
7212 the primal is a linear combination of the examples xn . Recall from Sec-
7213 tion 2.6.1 that this means that the solution of the optimization problem
7214 lies in the span of training data. Additionally the constraint obtained by
7215 setting (12.36) to zero implies that the optimal weight vector is an affine
The representer 7216 combination of the examples. The representer theorem turns out to hold
theorem is actually7217 for very general settings of regularized empirical risk minimization (Hof-
a collection of
7218 mann et al., 2008; Argyriou and Dinuzzo, 2014). The theorem has more
theorems saying
that the solution of7219 general versions (Schölkopf et al., 2001), and necessary and sufficient
minimizing 7220 conditions on its existence can be found in (Yu et al., 2013).
empirical risk lies in
the subspace 7221 Remark. The representer theorem (12.38) also provides an explaination
(Section 2.4.3) 7222 of the name Support Vector Machine. The examples xn , for which the
defined by the
7223 corresponding parameters αn = 0, do not contribute to the solution w at
examples.
7224 all. The other examples, where αn > 0, are called support vectors since
support vectors 7225 they “support” the hyperplane. ♦
By substituting the expression for w into the Lagrangian (12.34), we
obtain the dual

N N N
*N +
1 XX X X
D(ξ, α, γ) = yi yj αi αj hxi , xj i − yi αi yj αj xj , xi
2 i=1 j=1 i=1 j=1
N
X N
X N
X N
X N
X
+C ξi − b yi αi + αi − αi ξi − γ i ξi .
i=1 i=1 i=1 i=1 i=1
(12.39)

Note that there are no longer any terms


PNinvolving the primal variable w.
By setting (12.36) to zero, we obtain n=1 yn αn = 0. Therefore, the term
involving b also vanishes. Recall that inner products are symmetric and
bilinear (see Section 3.2). Therefore, the first two terms in (12.39) are
over the same objects. These terms (colored blue) can be simplified, and

Draft (2019-01-29) from Mathematics for Machine Learning. Errata and feedback to [Link]
12.3 Dual Support Vector Machine 389

we obtain the Lagrangian


N N N N
1 XX X X
D(ξ, α, γ) = − yi yj αi αj hxi , xj i + αi + (C − αi − γi )ξi .
2 i=1 j=1 i=1 i=1
(12.40)
The last term in this equation is a collection of all terms that contain slack
variables ξi . By setting (12.37) to zero, we see that the last term in (12.39)
is also zero. Furthermore, by using the same equation and recalling that
the Lagrange multiplers γi are non-negative, we conclude that αi 6 C .
We now obtain the dual optimization problem of the SVM, which is ex-
pressed exclusively in terms of the Lagrange multipliers αi . Recall from
Lagrangian duality (Definition 7.1) that we maximize the dual problem.
This is equivalent to minimizing the negative dual problem, such that we
end up with the dual SVM dual SVM

N N N
1 XX X
min yi yj αi αj hxi , xj i − αi
α 2 i=1 j=1 i=1
N
X (12.41)
subject to yi αi = 0
i=1

0 6 αi 6 C for all i = 1, . . . , N .

7226 The equality constraint in (12.41) is obtained from setting (12.36) to


7227 zero. The inequality constraint αi > 0 is the condition imposed on La-
7228 grange multipliers of inequality constraints (Section 7.2). The inequality
7229 constraint αi 6 C is discussed in the previous paragraph.
7230 The set of inequality constraints in the SVM are called “box constraints”
7231 because they limit the vector α = [α1 , . . . , αN ]> ∈ RN of Lagrange mul-
7232 tipliers to be inside the box defined by 0 and C on each axis. These
7233 axis-aligned boxes are particularly efficient to implement in numerical
7234 solvers (Dostál, 2009, Chapter 5). It turns out
Once we obtain the dual parameters α we can recover the primal pa- examples that lie
exactly on the
rameters w by using the representer theorem (12.38). Let us call the op- margin are
timal primal parameter w∗ . However, there remains the question on how examples whose
to obtain the parameter b∗ . Consider an example xn that lies exactly on dual parameters lie
the margin’s boundary, i.e., hw∗ , xn i + b = yn . Recall that yn is either +1 strictly inside the
box constraints,
or −1. Therefore, the only unknown is b, which can be computed by
0 < αi < C. This is
derived using the
b∗ = yn − hw∗ , xn i . (12.42) Karush Kuhn Tucker
conditions, for
7235 Remark. In principle, there may be no examples that lie exactly on the example in
7236 margin. In this case, we should compute |yn − hw∗ , xn i | for all support Schölkopf and
7237 vectors and take the median value of this absolute value difference to be Smola (2002).

7238 the value of b∗ . A derivation of this can be found in [Link]


7239 eu/2012/06/07/the-svm-bias-term-conspiracy/. ♦

c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
390 Classification with Support Vector Machines
Figure 12.9 Convex
hulls. (a) Convex
hull of points, some
of which lie within
the boundary;
(b) Convex hulls
around positive and
negative examples.
c

(a) Convex hull. (b) Convex hulls around positive (blue) and
negative (red) examples. The distance between
the two convex sets is the length of the differ-
ence vector c − d.

7240 12.3.2 Dual SVM: Convex Hull View


7241 Another approach to obtain the dual SVM is to consider an alternative
7242 geometric argument. Consider the set of examples xn with the same label.
7243 We would like to build a convex set that contains all the examples such
7244 that it is the smallest possible set. This is called the convex hull and is
7245 illustrated in Figure 12.9.
7246 Let us first build some intuition about a convex combination of points.
7247 Consider two points x1 and x2 and corresponding non-negative weights
7248 α1 , α2 > 0 such that α1 + α2 = 1. The equation α1 x1 + α2 x2 describes
7249 each point on a line between x1 and x2 . Consider what happens P3when we
7250 add a third point x3 along with a weight α3 > 0 such that n=1 αn =
7251 1. The convex combination of these three points x1 , x2 , x3 span a two-
convex hull 7252 dimensional area. The convex hull of this area is the triangle formed by
7253 the edges corresponding to each pair of of points. As we add more points,
7254 and the number of points become greater than the number of dimensions,
7255 some of the points will be inside the convex hull, as we can see in Fig-
7256 ure 12.9(a).
In general, building a convex convex hull can be done by introducing
non-negative weights αn > 0 corresponding to each example xn . Then
the convex hull can be described as the set
(N ) N
X X
conv (X) = αn xn with αn = 1 and αn > 0, (12.43)
n=1 n=1

for all n = 1, . . . , N . If the two clouds of points corresponding to the


positive and negative classes are separated, then the convex hulls do not
overlap. Given the training data (x1 , y1 ), . . . , (xN , yN ) we form two con-
vex hulls, corresponding to the positive and negative classes respectively.

Draft (2019-01-29) from Mathematics for Machine Learning. Errata and feedback to [Link]
12.3 Dual Support Vector Machine 391

We pick a point c, which is in the convex hull of the set of positive exam-
ples, and is closest to the negative class distribution. Similarly we pick a
point d in the convex hull of the set of negative examples, and is closest to
the positive class distribution, see Figure 12.9(b). We define a difference
vector between d and c as
w := c − d . (12.44)
Picking the points c and d as above, and requiring them to be closest to
each other is the same as saying that we want to minimize the length/
norm of w, so that we end up with the corresponding optimization prob-
lem
1 2
arg min kwk = arg min kwk . (12.45)
w w 2

Since c must be in the positive convex hull, it can be expressed as a convex


combination of the positive examples, i.e., for non-negative coefficients
αn+
X
c= αn+ xn . (12.46)
n:yn =+1

In (12.46) we use the notation n : yn = +1 to indicate the set of indices


n for which yn = +1. Similarly, for the examples with negative labels we
obtain
X
d= αn− xn . (12.47)
n:yn =−1

By substituting (12.44), (12.46) and (12.47) into (12.45), we obtain the


objective
2
1 X X
min αn+ xn − αn− xn . (12.48)
α 2
n:yn =+1 n:yn =−1

Let α be the set of all coefficients, i.e., the concatenation of α+ and α− .


Recall that we require that for each convex hull that their coefficients sum
to one,
X X
αn+ = 1 and αn− = 1 . (12.49)
n:yn =+1 n:yn =−1

This implies the constraint


N
X
yn αn = 0. (12.50)
n=1

This result can be seen by multiplying out the individual classes


N
X X X
yn αn = (+1)αn+ + (−1)αn− (12.51a)
n=1 n:yn =+1 n:yn =−1

c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
392 Classification with Support Vector Machines
X X
= αn+ − αn− = 1 − 1 = 0 . (12.51b)
n:yn =+1 n:yn =−1

7257 The objective function (12.48) and the constraint (12.50), along with the
7258 assumption that α > 0, give us a constrained (convex) optimization prob-
7259 lem. This optimization problem can be shown to be the same as that of
7260 the dual hard margin SVM (Bennett and Bredensteiner, 2000a).
7261 Remark. To obtain the soft margin dual, we consider the reduced hull. The
reduced hull 7262 reduced hull is similar to the convex hull but has an upper bound to the
7263 size of the coefficients α. The maximum possible value of the elements
7264 of α restricts the size that the convex hull can take. In other words, the
7265 bound on α shrinks the convex hull to a smaller volume (Bennett and
7266 Bredensteiner, 2000b). ♦

7267 12.4 Kernels


7268 Consider the formulation of the dual SVM (12.41). Notice that the in-
7269 ner product in the objective occurs only between examples xi and xj .
7270 There are no inner products between the examples and the parameters.
7271 Therefore, if we consider a set of features φ(xi ) to represent xi , the only
7272 change in the dual SVM will be to replace the inner product. This mod-
7273 ularity, where the choice of the classification method (the SVM) and the
7274 choice of the feature representation φ(x) can be considered separately,
7275 provides flexibility for us to explore the two problems independently. In
7276 this section we discuss the representation φ(x) and briefly introduce the
7277 idea of kernels, but do not go into the technical details.
Since φ(x) could be a non-linear function, we can use the SVM (which
assumes a linear classifier) to construct classifiers that are nonlinear in
the examples xn . This provides a second avenue, in addition to the soft
margin, for users to deal with a dataset that is not linearly separable. It
turns out that there are many algorithms and statistical methods, which
have this property that we observed in the dual SVM: the only inner prod-
ucts are those that occur between examples. Instead of explicitly defining
a non-linear feature map φ(·) and computing the resulting inner product
between examples xi and xj , we define a similarity function k(xi , xj ) be-
kernels tween xi and xj . For a certain class of similarity functions, called kernels,
the similarity function implicitly defines a non-linear feature map φ(·).
The inputs X of the Kernels are by definition functions k : X × X → R for which there exists
kernel function can a Hilbert space H and φ : X → H a feature map such that
be very general, and
is not necessarily k(xi , xj ) = hφ(xi ), φ(xj )iH . (12.52)
restricted to RD .
7278 There is a unique reproducing kernel Hilbert space associated with every
7279 kernel k (Aronszajn, 1950; Berlinet and Thomas-Agnan, 2004). In this
canonical feature 7280 unique association φ(x) = k(·, x) is called the canonical feature map.
map 7281 The generalization from an inner product to a kernel function (12.52) is

Draft (2019-01-29) from Mathematics for Machine Learning. Errata and feedback to [Link]
12.4 Kernels 393

7282 known as the kernel trick (Schölkopf and Smola, 2002; Shawe-Taylor and kernel trick
7283 Cristianini, 2004), as it hides away the explicit non-linear feature map.
The matrix K ∈ RN ×N , resulting from the inner products or the appli-
cation of k(·, ·) to a dataset, is called the Gram matrix, and is often just Gram matrix
referred to as the kernel matrix. Kernels must be symmetric and positive kernel matrix
semi-definite functions so that every kernel matrix K is symmetric and
positive semi-definite (Section 3.2.3):

∀z ∈ RN z > Kz > 0 . (12.53)

7284 Some popular examples of kernels for multivariate real-valued data xi ∈


7285 RD are the polynomial kernel, the Gaussian radial basis function kernel,
7286 and the rational quadratic kernel (Schölkopf and Smola, 2002; Rasmussen
7287 and Williams, 2006). Figure 12.10 illustrates the effect of different kernels
7288 on separating hyperplanes on an example dataset. Note that we are still
7289 solving for hyperplanes, that is the hypothesis class of functions are still
7290 linear. The non-linear surfaces are due to the kernel function.
7291 Remark. Unfortunately for the fledgling machine learner, there are multi-
7292 ple meanings of the word kernel. In this chapter, the word kernel comes
7293 from the idea of the Reproducing Kernel Hilbert Space (RKHS) (Aron-
7294 szajn, 1950; Saitoh, 1988). We have discussed the idea of the kernel in
7295 linear algebra (Section 2.7.3), where the kernel is another word for the
7296 null space. The third common use of the word kernel in machine learning
7297 is the smoothing kernel in kernel density estimation (Section 11.5). ♦
7298 Since the explicit representation φ(x) is mathematically equivalent to
7299 the kernel representation k(xi , xj ) a practitioner will often design the
7300 kernel function, such that it can be computed more efficiently than the
7301 inner product between explicit feature maps. For example, consider the
7302 polynomial kernel (Schölkopf and Smola, 2002), where the number of
7303 terms in the explicit expansion grows very quickly (even for polynomials
7304 of low degree) when the input dimension is large. The kernel function
7305 only requires one multiplication per input dimension, which can provide
7306 significant computational savings. Another example is the Gaussian ra-
7307 dial basis function kernel (Schölkopf and Smola, 2002; Rasmussen and
7308 Williams, 2006) where the corresponding feature space is infinite dimen-
7309 sional. In this case, we cannot explicitly represent the feature space but
7310 can still compute similarities between a pair of examples using the kernel. The choice of
7311 Another useful aspect of the kernel trick is that there is no need for kernel, as well as
the parameters of
7312 the original data to be already represented as multivariate real-valued
the kernel are often
7313 data. Note that the inner product is defined on the output of the function chosen using nested
7314 φ(·), but does not restrict the input to real numbers. Hence, the function cross validation
7315 φ(·) and the kernel function k(·, ·) can be defined on any object, e.g., (Section 8.5.2).
7316 sets, sequences, strings, graphs and distributions (Ben-Hur et al., 2008;
7317 Gärtner, 2008; Shi et al., 2009; Vishwanathan et al., 2010; Sriperumbudur
7318 et al., 2010).

c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
394 Classification with Support Vector Machines

Figure 12.10 SVM


with different
kernels. Note that
while the decision
boundary is
nonlinear, the
underlying problem
being solved is for a
linear separating
hyperplane (albeit
with a nonlinear
kernel).

7319 12.5 Numerical Solution


7320 We conclude our discussion of SVMs by looking at how to express the
7321 problems derived in this chapter in terms of the concepts presented in
7322 Chapter 7. We consider two different approaches for finding the optimal
7323 solution for the SVM. First we consider the loss view of SVM 8.1.2 and ex-
7324 press this as an unconstrained optimization problem. Then we express the
7325 constrained versions of the primal and dual SVMs as quadratic programs
7326 in standard form 7.3.2.
Consider the loss function view of the SVM (12.31). This is a convex
unconstrained optimization problem, but the hinge loss (12.28) is not dif-
ferentiable. Therefore, we apply a subgradient approach for solving it.
However, the hinge loss is differentiable almost everywhere, except for
one single point at the hinge t = 1. At this point, the gradient is a set of
possible values that lie between 0 and −1. Therefore, the subgradient g of

Draft (2019-01-29) from Mathematics for Machine Learning. Errata and feedback to [Link]
12.5 Numerical Solution 395

the hinge loss is given by



−1
 t<1
g(t) = [−1, 0] t=1 . (12.54)

0 t>1

7327 Using this subgradient above, we can apply the optimization methods pre-
7328 sented in Section 7.1.
7329 Both the primal and the dual SVM result in a convex quadratic pro-
7330 gramming problem (constrained optimization). Note that the primal SVM
7331 in (12.26a) has optimization variables that have the size of the dimen-
7332 sion D of the input examples. The dual SVM in (12.41) has optimization
7333 variables that have the size of the number N of examples.
To express the primal SVM in the standard form (7.45) for quadratic
programming, let us assume that we use the dot product (3.5) as the
inner product. We rearrange the equation for the primal SVM (12.26a), Recall from
such that the optimization variables are all on the right and the inequality Section 3.2 that in
this book, we use
of the constraint matches the standard form. This yields the optimization
the phrase dot
N product to mean the
1 X
min kwk2 + C ξn inner product on
w,b,ξ 2 Euclidean vector
n=1 (12.55) space.
−yn x>
n w − yn b − ξn 6 −1
subject to
−ξn 6 0
n = 1, . . . , N . By concatenating the variables w, b, xn into a single vector,
and carefully collecting the terms, we obtain the following matrix form of
the soft margin SVM.
 >    
w   w
> w
1  ID 0D,N +1   
min b b + 0D+1,1 C1N,1  b 
w,b,ξ 2 ξ 0N +1,D 0N +1,N +1
ξ ξ
 
  w  
−Y X −y −I N   −1N,1
subject to b 6 .
0N,D+1 −I N 0N,1
ξ
(12.56)

7334 In the above optimization problem, the minimization is over [w> , b, ξ > ]> ∈
7335 RD+1+N , and we use the notation: I m to represent the identity matrix of
7336 size m × m, 0m,n to represent the matrix of zeros of size m × n, and 1m,n
7337 to represent the matrix of ones of size m × n. In addition y is the vector
7338 of labels [y1 , . . . , yN ]> , Y = diag(y) is an N by N matrix where the ele-
7339 ments of the diagonal are from y , and X ∈ RN ×D is the matrix obtained
7340 by concatenating all the examples.
We can similarly perform a collection of terms for the dual version of
the SVM (12.41). To express the dual SVM in standard form, we first have
to express the kernel matrix K such that each entry is Kij = k(xi , xj ).

c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
396 Classification with Support Vector Machines

Or if we are using an explicit feature representation Kij = hxi , xj i. For


convenience of notation we introduce a matrix with zeros everywhere ex-
cept on the diagonal, where we store the labels, that is, Y = diag(y). The
dual SVM can be written as
1
min α> Y KY α − 1> N,1 α (12.57)
α 2
 > 
y
−y > 
 
0N +2,1
subject to   α6 . (12.58)
−I N  C1N,1
IN
Remark. In Section 7.3.1 and 7.3.2 we introduced the standard forms of
the constraints to be inequality constraints. We will express the dual SVM’s
equality constraint as two inequality constraints, i.e.,
Ax = b is replaced by Ax 6 b and Ax > b (12.59)
7341 Particular software implementations of convex optimization methods may
7342 provide the ability to express equality constraints. ♦
7343 Since there are many different possible views of the SVM, there are
7344 many approaches for solving the resulting optimization problem. The ap-
7345 proach presented here, expressing the SVM problem in standard convex
7346 optimization form, is not often used in practice. The two main imple-
7347 mentations of SVM solvers are (Chang and Lin, 2011) (which is open
7348 source) and (Joachims, 1999). Since SVMs have a clear and well defined
7349 optimization problem, many approaches based on numerical optimization
7350 techniques (Nocedal and Wright, 2006) can be applied (Shawe-Taylor and
7351 Sun, 2011).

7352 12.6 Further Reading


7353 The SVM is one of many approaches for studying binary classification.
7354 Other approaches include the perceptron, logistic regression, Fisher dis-
7355 criminant, nearest neighbor, naive Bayes, and random forest (Bishop, 2006;
7356 Murphy, 2012). A short tutorial on SVMs and kernels on discrete se-
7357 quences can be found in Ben-Hur et al. (2008). The development of SVMs
7358 is closely linked to empirical risk minimization discussed in Section 8.1.
7359 Hence, the SVM has strong theoretical properties (Vapnik, 2000; Stein-
7360 wart and Christmann, 2008). The book about kernel methods (Schölkopf
7361 and Smola, 2002) includes many details of support vector machines and
7362 how to optimize them. A broader book about kernel methods (Shawe-
7363 Taylor and Cristianini, 2004) also includes many linear algebra approaches
7364 for different machine learning problems.
7365 An alternative derivation of the dual SVM can be obtained using the
7366 idea of the Legendre-Fenchel transform (Section 7.3.3). The derivation
7367 considers each term of the unconstrained formulation of the SVM (12.31)

Draft (2019-01-29) from Mathematics for Machine Learning. Errata and feedback to [Link]
12.6 Further Reading 397

7368 separately and calculates their convex conjugates (Rifkin and Lippert,
7369 2007). Readers interested in the functional analysis view (also the reg-
7370 ularization methods view) of SVMs are referred to the work by Wahba
7371 (1990). Theoretical exposition of kernels (Manton and Amblard, 2015;
7372 Aronszajn, 1950; Schwartz, 1964; Saitoh, 1988) require a basic ground-
7373 ing of linear operators (Akhiezer and Glazman, 1993). The idea of kernels
7374 have been generalized to Banach spaces (Zhang et al., 2009) and Kreı̆n
7375 spaces (Ong et al., 2004; Loosli et al., 2016).
7376 Observe that the hinge loss has three equivalent representations, as
7377 shown in (12.28) and (12.29), as well as the constrained optimization
7378 problem in (12.33). The formulation (12.28) is often used when compar-
7379 ing the SVM loss function with other loss functions (Steinwart, 2007).
7380 The two-piece formulation (12.29) is convenient for computing subgra-
7381 dients, as each piece is linear. The third formulation (12.33), as seen
7382 in Section 12.5, enables the use of convex quadratic programming (Sec-
7383 tion 7.3.2) tools.
7384 Since binary classification is a well-studied task in machine learning,
7385 other words are also sometimes used, such as discrimination, separation
7386 or decision. Furthermore, there are three quantities that can be the output
7387 of a binary classifier. First is the output of the linear function itself (often
7388 called the score), which can take any real value. This output can be used
7389 for ranking the examples, and binary classification can be thought of as
7390 picking a threshold on the ranked examples (Shawe-Taylor and Cristian-
7391 ini, 2004). The second quantity that is often considered the output of a
7392 binary classifier is after the output is passed through a non-linear function
7393 to constrain its value to a bounded range, for example in the interval [0, 1].
7394 A common non-linear function is the sigmoid function (Bishop, 2006).
7395 When the non-linearity results in well calibrated probabilities (Gneiting
7396 and Raftery, 2007; Reid and Williamson, 2011), this is called class proba-
7397 bility estimation. The third output of a binary classifier is the final binary
7398 decision {+1, −1}, which is the one most commonly assumed to be the
7399 output of the classifier.
7400 The SVM is a binary classifier that does not naturally lend itself to a
7401 probabilistic interpretation. There are several approaches for converting
7402 the raw output of the linear function (the score) into a calibrated class
7403 probability estimate (P (Y = 1|X = x)) which involve an additional cal-
7404 ibration step (Platt, 2000; Lin et al., 2007; Zadrozny and Elkan, 2001).
7405 From the training perspective, there are many related probabilistic ap-
7406 proches. We mentioned at the end of Section 12.2.5 that there is a re-
7407 lationship between loss function and the likelihood (also compare Sec-
7408 tion 8.1 and Section 8.2). The maximum likelihood approach correspond-
7409 ing to a well calibrated transformation during training is called logistic
7410 regression, which comes from a class of methods called generalized linear
7411 models. Details of logistic regression from this point of view can be found
7412 in Agresti (2002, Chapter 5) and McCullagh and Nelder (1989, Chapter

c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
398 Classification with Support Vector Machines

7413 4). Naturally, one could take a more Bayesian view of the classifier out-
7414 put by estimation a posterior distribution using Bayesian logistic regres-
7415 sion. The Bayesian view also includes the specification of the prior, which
7416 includes design choices such as conjugacy (Section 6.6.1) with the like-
7417 lihood. Additionally, one could consider latent functions as priors, which
7418 results in Gaussian process classification (Rasmussen and Williams, 2006,
7419 Chapter 3).

Draft (2019-01-29) from Mathematics for Machine Learning. Errata and feedback to [Link]

You might also like