kernel_perceptron
kernel_perceptron
Kernelized Perceptron
Maria-Florina Balcan
09/12/2018
Quick Recap about
Perceptron and Margins
The Online Learning Model
• Example arrive sequentially.
• We need to make a prediction.
Afterwards observe the outcome.
For i=1, 2, …, :
Example 𝑥𝑖
Observe c ∗ (𝑥𝑖 )
Margin of example 𝑥2
𝑥2
Geometric Margin
Definition: The margin of example 𝑥 w.r.t. a linear sep. 𝑤 is the
distance from 𝑥 to the plane 𝑤 ⋅ 𝑥 = 0.
- +
+
- +
- - R
-
-- -
-
(Normalized margin: multiplying all points by 100, or dividing all points by 100,
doesn’t change the number of mistakes; algo is invariant to scaling.)
So far, talked about margins in
the context of (nearly) linearly
separable datasets
What if Not Linearly Separable
Problem: data not linearly separable in the most natural
feature representation.
No good linear
Example: vs separator in pixel
representation.
Solutions:
• “Learn a more complex class of functions”
• (e.g., decision trees, neural networks, boosting).
Definition
K(⋅,⋅) is a kernel if it can be viewed as a legal definition of
inner product:
• ∃ ϕ: X → RN s.t. K x, z = ϕ x ⋅ ϕ(z)
• Range of ϕ is called the Φ-space.
• N can be very large.
Definition
K(⋅,⋅) is a kernel if it can be viewed as a legal definition of
inner product:
• ∃ ϕ: X → RN s.t. K x, z = ϕ x ⋅ ϕ(z)
• Range of ϕ is called the Φ-space.
• N can be very large.
ϕ: R2 → R4 , x1 , x2 → Φ x = (x12 , x22 , x1 x2 , x2 x1 )
= x⋅𝑧 2 = K(x, z)
Avoid explicitly expanding the features
Feature space can grow really large and really quickly….
Crucial to think of ϕ as implicit, not explicit!!!!
𝑂 𝑛 𝑐𝑜𝑚𝑝𝑢𝑡𝑎𝑡𝑖𝑜𝑛!
𝑘 𝑥, 𝑧 = 𝑥 ⊤ 𝑧 𝑑 =𝜙 𝑥 ⋅𝜙 𝑧
Kernelizing a learning algorithm
• If all computations involving instances are in terms of
inner products then:
Conceptually, work in a very high diml space and the alg’s
performance depends only on linear separability in that
extended space.
Computationally, only need to modify the algo by replacing
each x ⋅ z with a K x, z .
• clustering: k-means.
Kernelizing the Perceptron Algorithm
• Set t=1, start with the all zero vector 𝑤1 .
O
• Given example 𝑥, predict + iff 𝑤𝑡 ⋅ 𝑥 ≥ 0
X O
X
X XX O
+ + Φ-space
+
w*
+
- +
+
- +
- - R
-
- - -
-
Kernels: More Examples
• Linear: K x, z = x ⋅ 𝑧
• Polynomial: K x, 𝑧 = x ⋅ 𝑧 d or K x, 𝑧 = 1 + x ⋅ 𝑧 d
2
𝑥−𝑧
• Gaussian: K x, 𝑧 = exp −
2 𝜎2
||𝑥−𝑧||
• Laplace Kernel: K x, 𝑧 = exp −
2 𝜎2
𝑎𝑇 𝐾𝑎 ≥ 0
I.e., 𝐾 = (𝐾 𝑥𝑖 , 𝑥𝑗 )𝑖,𝑗=1,…,𝑚 is positive semi-definite.
Kernel Methods
then K x, z = c1 K1 x, z + c2 K 2 x, z is a kernel.
then K x, z = K1 x, z K 2 x, z is a kernel.
= σ𝑖 ϕ1,i x ϕ1,𝑖 z K 2 x, z = K1 x, z K 2 x, z
Kernels, Discussion
• If all computations involving instances are in terms
of inner products then:
Conceptually, work in a very high diml space and the alg’s
performance depends only on linear separability in that
extended space.
Computationally, only need to modify the algo by replacing
each x ⋅ z with a K x, z .
• clustering: k-means.
Kernels, Discussion
• If all computations involving instances are in terms
of inner products then:
Conceptually, work in a very high diml space and the alg’s
performance depends only on linear separability in that
extended space.
Computationally, only need to modify the algo by replacing
each x ⋅ z with a K x, z .