Topic 4: SUPPORT VECTOR MACHINES
STAT 37710/CAAM 37710/CMSC 35400 Machine Learning
Risi Kondor, The University of Chicago
Regularized Risk Minimization (RRM)
Find the hypothesis fb by solving a problem of the form
1 X
m
fb = arg min ℓ(f (xi ), yi ) + λ Ω[f ]
f ∈F m | {z }
i=1
| {z } regularizer
training error
• F can be quite a rich hypothesis space.
• The purpose of the regularizer is to avoid overfitting.
• λ is a tunable parameter.
• ℓ(yb, y) : loss function
• ℓ might or might not be the same loss as in Etrue .
[Tykhonov regularization] [Vapnik 1970’s–]
2
2/22
/22
Optimization: equality constraints
Problem:
minimize f (x) subject to g(x) = c.
x∈Rn
1. Form the Lagrangian L(x, λ) = f (x) − λ (g(x) − c) .
2. The solution must be at a critical point of L . → Setting
∂L(x, λ)
=0 i = 1, 2, . . . , n.
∂xi
yields a curve of solutions x = γ(λ) .
3. Reintroducing the constraint g(γ(λ))=c gives λ , hence the optimal x .
3
3/22
/22
Optimization: inequality constraints
Problem:
minimize f (x) subject to g(x) ≥ c.
x∈Rn
1. Form the Lagrangian L(x, λ) = f (x) − λ (g(x) − c) .
2. Introduce the dual function
h(λ) = inf L(x, λ).
x
3. Solve the dual problem
λ∗ = argmax h(λ) subject to λ ≥ 0.
λ
4. The optimal x is inf x L(x, λ∗ ) (assuming strong duality).
When f is a convex function and g(x) ≥ c defines a convex region of
space, this gives the global optimum.
4
4/22
/22
Karush–Kuhn–Tucker conditions
At the optimal solution x∗ of
minimize f (x) subject to g(x) ≥ c.
x∈Rn
either
1. we are the boundary → g(x∗ ) = c or
2. we are at an interior point → λ∗ = 0.
→ Complementary slackness: λ∗ (g(x∗ ) − c) = 0.
5
5/22
/22
Support Vector Machines
Linear classifiers
To apply RRM, go back to binary classification in Rn with a linear (affine)
hyperplane:
Input space: X = Rn
Output space: Y = {−1, +1}
Hypothesis:
f (x) = w · x + b.
h(x) = sgn(f (x))
(Note the sneaky difference between f and h )
Question: Of all possible hyperplanes that separate the data which one do
we choose?
7
7/22
/22
The margin
Recall, the margin of a point (x, y) to the hyperplane f (x) = w · x + b = 0
(with ∥w∥ = 1 ) is
y (w · x + b).
The margin of a dataset S = {(x1 , y1 ), . . . , (xm , ym )} to f is
mini yi (w · xi + b) .
In the case of the perceptron we saw that having a large margin is desirable.
IDEA: Choose w and b explicitly to maximize the margin! → Support
Vector Machines (SVM)
8
8/22
/22
Maximizing the margin
Choose the hyperplane that has the largest margin!
9
9/22
/22
Hard Margin Support Vector
Machine
Given a dataset S = {(x1 , y1 ), . . . , (xm , ym )} ,
maximize δ s.t. yi (w · xi + b) ≥ δ ∀i.
∥w∥=1, b
Equivalent formulation: drop the ∥w∥ = 1 constraint and solve
1
minimize ∥w∥2 s.t. yi (w · xi + b) ≥ 1 ∀i.
w, b 2
10
10/22
/22
The primal problem
The primal SVM optimization problem
1
minimize ∥w∥2 s.t. yi (w · xi + b) ≥ 1 ∀i
w,b 2
This is a nice convex optimization problem (a QP) with a unique minimum.
→ Introduce a Lagrangian.
11
11/22
/22
From primal to dual
1
minimize ∥w∥2 s.t. yi (w · xi + b) ≥ 1 ∀i
w,b 2
Lagrangian:
P
L(w, b, α) = 1
2 ∥w∥2 − i αi (yi (w · xi + b) − 1)
P
∂
∂wi L(w, b, α) =0 ⇒ w− i α i y i xi =0
P
∂
∂b L(w, b, α) =0 ⇒ i α i yi =0
Dual function:
X 1X
L(α) = αi − αi αj yi yj (xi · xj )
2
i i,j
12
12/22
/22
The dual problem
The dual SVM optimization problem
X 1X
maximize L(α) = αi − αi αj yi yj (xi · xj )
α1 ,...,αm 2
i i,j
X
subject to yi αi = 0 and αi ≥ 0 ∀i
i
Still a QP, but in fewer variables, so easier to solve. In particular,
hX i hX i
h(x) = sgn αi yi (x · xi ) + b = sgn γi (x · xi ) + b ,
i i
where γi = yi αi . → The solution lies in the span of the data,
P
w= i γi xi .
13
13/22
/22
Support vector machine
14
14/22
/22
Sparsity of support vectors
The KKT conditions prescribe that
αi (yi (xi · w + b) − 1) = 0 ∀i
So αi ̸= 0 only for those examples that lie exactly on the margin, and
therefore only these “support vectors” influence the solution
hX i
h(x) = sgn α i y i ( x · xi ) + b
i
→ Sparsity is a precious thing.
Question: But what about non-separable data? → Soft margin SVMs
15
15/22
/22
The Soft Margin SVM
The primal SVM optimization problem
1 CX
minimize ∥w∥2 + ξi s.t. yi (w · xi + b) ≥ 1 − ξi ξi ≥ 0 ∀i
w,b,ξ1 ,...,ξm 2 m
i
The ξi ’s are called slack variables and C is a “softness parameter”
[Cortes & Vapnik, 1995]
16
/22
16/22
From primal to dual
1 CX
minimize ∥w∥2 + ξi s.t. yi (w · xi + b) ≥ 1 − ξi ξi ≥ 0 ∀i
w,b,ξ1 ,...,ξm 2 m
i
Lagrangian:
P P P
L(w, b, α, β) = 12 ∥w∥2 + m
C
i ξi − i αi (yi (w·xi +b)−1+ξi )− i β i ξi
P
∂
∂wi L(w, b, α, β) =0 ⇒ w − i αi yi xi = 0
P
∂
∂b L(w, b, α, β) =0 ⇒ i α i yi = 0
∂
∂ξi L(w, b, α, β) =0 ⇒ αi + βi = C
m
17
17/22
/22
Soft margin SVM dual
The dual SVM optimization problem
X 1X
maximize L(α) = αi − αi αj yi yj (xi · xj )
α1 ,...,αm 2
i i,j
X C
subject to yi αi = 0 and 0 ≤ αi ≤ ∀i
m
i
18
18/22
/22
SVM is just a form of RRM
At the optimum of the primal problem the slacks are as small as possible:
ξi = max {0, 1 − yi (w · xi + b)} = (1 − yi (w · xi + b))≥0 ,
| {z }
ℓhinge (w·xi ,yi )
where (z)≥0 = max(0, z) .
The soft-margin SVM finds
X m
b 1 1 2
f = argmin ℓhinge (f (xi ), yi ) + ∥w∥ .
f ∈F m |2C {z }
i=1
| {z } regularizer
empirical loss
where F is the hypothesis space of f (x) = w · x + b linear functions.
19
19/22
/22
Loss functions for classification
20
20/22
/22
Loss functions for regression
21
21/22
/22
22
22/22
/22