0% found this document useful (0 votes)
4 views

kernel_perceptron

The document discusses kernel methods in machine learning, focusing on the kernelized perceptron algorithm. It explains the concept of kernels as implicit mappings that allow algorithms to operate in higher-dimensional spaces without explicitly expanding features. The document also outlines the importance of geometric margins in determining the performance of the perceptron and provides examples of different types of kernels.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

kernel_perceptron

The document discusses kernel methods in machine learning, focusing on the kernelized perceptron algorithm. It explains the concept of kernels as implicit mappings that allow algorithms to operate in higher-dimensional spaces without explicitly expanding features. The document also outlines the importance of geometric margins in determining the performance of the perceptron and provides examples of different types of kernels.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Kernels Methods in Machine Learning

Kernelized Perceptron

Maria-Florina Balcan
09/12/2018
Quick Recap about
Perceptron and Margins
The Online Learning Model
• Example arrive sequentially.
• We need to make a prediction.
Afterwards observe the outcome.
For i=1, 2, …, :
Example 𝑥𝑖

Phase i: Online Algorithm Prediction ℎ(𝑥𝑖 )

Observe c ∗ (𝑥𝑖 )

Mistake bound model


• Analysis wise, make no distributional assumptions.

• Goal: Minimize the number of mistakes.


Perceptron Algorithm in Online Model
WLOG homogeneous linear separators
X = Rn
• Set t=1, start with the all zero vector 𝑤1 .
X O
O
• Given example 𝑥, predict + iff 𝑤𝑡 ⋅ 𝑥 ≥ 0 X
X XX O
O
• On a mistake, update as follows:
O
X w
X
X O
• Mistake on positive, 𝑤𝑡+1 ← 𝑤𝑡 + 𝑥 X
X O
O
• Mistake on negative, 𝑤𝑡+1 ← 𝑤𝑡 − 𝑥

Note 1: wt is weighted sum of incorrectly classified examples


𝑤𝑡 = 𝑎𝑖1 𝑥𝑖1 + ⋯ + 𝑎𝑖𝑘 𝑥𝑖𝑘 So, 𝑤𝑡 ⋅ 𝑥 = 𝑎𝑖1 𝑥𝑖1 ⋅ 𝑥 + ⋯ + 𝑎𝑖𝑘 𝑥𝑖𝑘 ⋅ 𝑥

Note 2: Number of mistakes ever made depends only on the


geometric margin (amount of wiggle room) of
examples seen.
• No matter how long the sequence is or how high dimension n is!
Geometric Margin
Definition: The margin of example 𝑥 w.r.t. a linear sep. 𝑤 is the
distance from 𝑥 to the plane 𝑤 ⋅ 𝑥 = 0.
Margin of example 𝑥1
If 𝑤 = 1, margin of x
w.r.t. w is |𝑥 ⋅ 𝑤|.
𝑥1
w

Margin of example 𝑥2

𝑥2
Geometric Margin
Definition: The margin of example 𝑥 w.r.t. a linear sep. 𝑤 is the
distance from 𝑥 to the plane 𝑤 ⋅ 𝑥 = 0.

Definition: The margin 𝛾𝑤 of a set of examples 𝑆 wrt a linear


separator 𝑤 is the smallest margin over points 𝑥 ∈ 𝑆.

Definition: The margin 𝛾 of a set of examples 𝑆 is the maximum


𝛾𝑤 over all linear separators 𝑤.
w
𝛾 +
- 𝛾 +
+
- - +
- - +
-
-
- -
Poll time
Perceptron: Mistake Bound
Theorem: If data linearly separable by margin 𝛾 and points inside
a ball of radius 𝑅, then Perceptron makes ≤ 𝑅/𝛾 2 mistakes.

• No matter how long the sequence is how high dimension n is!

+ + + Margin: the amount of


wiggle-room available for
 + w* a solution.

-  +
+
- +
- - R
-
-- -
-
(Normalized margin: multiplying all points by 100, or dividing all points by 100,
doesn’t change the number of mistakes; algo is invariant to scaling.)
So far, talked about margins in
the context of (nearly) linearly
separable datasets
What if Not Linearly Separable
Problem: data not linearly separable in the most natural
feature representation.

No good linear
Example: vs separator in pixel
representation.

Solutions:
• “Learn a more complex class of functions”
• (e.g., decision trees, neural networks, boosting).

• “Use a Kernel” (a neat solution that attracted a lot of attention)

• “Use a Deep Network”


• “Combine Kernels and Deep Networks”
Overview of Kernel Methods
What is a Kernel?
A kernel K is a legal def of dot-product: i.e. there exists an
implicit mapping Φ s.t. K( , ) =Φ( )⋅ Φ( )
E.g., K(x,y) = (x ¢ y + 1)d

: (n-dimensional space) ! nd-dimensional space

Why Kernels matter?


• Many algorithms interact with data only via dot-products.

• So, if replace x ⋅ z with K x, z they act implicitly as if data


was in the higher-dimensional Φ-space.
• If data is linearly separable by large margin in the Φ-space,
then good sample complexity.
Kernels

Definition
K(⋅,⋅) is a kernel if it can be viewed as a legal definition of
inner product:
• ∃ ϕ: X → RN s.t. K x, z = ϕ x ⋅ ϕ(z)
• Range of ϕ is called the Φ-space.
• N can be very large.

• But think of ϕ as implicit, not explicit!!!!


Example
For n=2, d=2, the kernel K x, z = x ⋅ z d corresponds to

𝑥1 , 𝑥2 → Φ 𝑥 = (𝑥12 , 𝑥22 , 2𝑥1 𝑥2 )


Original space Φ-space
x2
X
X X
X X X
X X X
X X X
X O O
O X X
x1
X
O O O
O O X
O X
O O O z1
O X
O O X
X O X
X X X X
O X
X
X X z3 X X
X X
X
X
Example
ϕ: R2 → R3 , x1 , x2 → Φ x = (x12 , x22 , 2x1 x2 )

ϕ x ⋅ ϕ 𝑧 = x12 , x22 , 2x1 x2 ⋅ (𝑧12 , 𝑧22 , 2𝑧1 𝑧2 )


= x1 𝑧1 + x2 𝑧2 2 = x⋅𝑧 2 = K(x, z)

Original space Φ-space


x2
X
X X
X X
X X
X X X X
O X
X O O X
X
x1
X O
O O O O X
O X
O O z1
O
O X O
O O X
X X
X X O X X
X
X z3 X X
X X
X X X
X
Kernels

Definition
K(⋅,⋅) is a kernel if it can be viewed as a legal definition of
inner product:
• ∃ ϕ: X → RN s.t. K x, z = ϕ x ⋅ ϕ(z)
• Range of ϕ is called the Φ-space.
• N can be very large.

• But think of ϕ as implicit, not explicit!!!!


Example

Note: feature space might not be unique.

ϕ: R2 → R3 , x1 , x2 → Φ x = (x12 , x22 , 2x1 x2 )

ϕ x ⋅ ϕ 𝑧 = x12 , x22 , 2x1 x2 ⋅ (𝑧12 , 𝑧22 , 2𝑧1 𝑧2 )


= x1 𝑧1 + x2 𝑧2 2 = x⋅𝑧 2 = K(x, z)

ϕ: R2 → R4 , x1 , x2 → Φ x = (x12 , x22 , x1 x2 , x2 x1 )

ϕ x ⋅ ϕ 𝑧 = (x12 , x22 , x1 x2 , x2 x1 ) ⋅ (z12 , z22 , z1 z2 , z2 z1 )

= x⋅𝑧 2 = K(x, z)
Avoid explicitly expanding the features
Feature space can grow really large and really quickly….
Crucial to think of ϕ as implicit, not explicit!!!!

• Polynomial kernel degreee 𝑑 , 𝑘 𝑥, 𝑧 = 𝑥 ⊤ 𝑧 𝑑 =𝜙 𝑥 ⋅𝜙 𝑧


– 𝑥1𝑑 , 𝑥1 𝑥2 … 𝑥𝑑 , 𝑥12 𝑥2 … 𝑥𝑑−1
– Total number of such feature is
𝑑+𝑛−1 𝑑+𝑛−1 !
=
𝑑 𝑑! 𝑛 − 1 !
– 𝑑 = 6, 𝑛 = 100, there are 1.6 billion terms

𝑂 𝑛 𝑐𝑜𝑚𝑝𝑢𝑡𝑎𝑡𝑖𝑜𝑛!

𝑘 𝑥, 𝑧 = 𝑥 ⊤ 𝑧 𝑑 =𝜙 𝑥 ⋅𝜙 𝑧
Kernelizing a learning algorithm
• If all computations involving instances are in terms of
inner products then:
 Conceptually, work in a very high diml space and the alg’s
performance depends only on linear separability in that
extended space.
 Computationally, only need to modify the algo by replacing
each x ⋅ z with a K x, z .

• Examples of kernalizable algos:


• classification: Perceptron, SVM.

• regression: linear, ridge regression.

• clustering: k-means.
Kernelizing the Perceptron Algorithm
• Set t=1, start with the all zero vector 𝑤1 .
O
• Given example 𝑥, predict + iff 𝑤𝑡 ⋅ 𝑥 ≥ 0
X O
X
X XX O

• On a mistake, update as follows: X


O
w
O
X
X O
• Mistake on positive, 𝑤𝑡+1 ← 𝑤𝑡 + 𝑥 X
X O
O
• Mistake on negative, 𝑤𝑡+1 ← 𝑤𝑡 − 𝑥

Easy to kernelize since 𝑤𝑡 is weighted sum of incorrectly


classified examples 𝑤𝑡 = 𝑎𝑖1 𝑥𝑖1 + ⋯ + 𝑎𝑖𝑘 𝑥𝑖𝑘

Replace 𝑤𝑡 ⋅ 𝑥 = 𝑎𝑖1 𝑥𝑖1 ⋅ 𝑥 + ⋯ + 𝑎𝑖𝑘 𝑥𝑖𝑘 ⋅ 𝑥 with


𝑎𝑖1 𝐾(𝑥𝑖1 , 𝑥) + ⋯ + 𝑎𝑖𝑘 𝐾(𝑥𝑖𝑘 , 𝑥)

Note: need to store all the mistakes so far.


Kernelizing the Perceptron Algorithm
Φ-space
• Given 𝑥, predict + iff 𝜙(𝑥𝑖𝑡−1 ) ⋅ 𝜙(𝑥)

𝑎𝑖1 𝐾(𝑥𝑖1 , 𝑥) + ⋯ + 𝑎𝑖𝑡−1 𝐾(𝑥𝑖𝑡−1 , 𝑥) ≥ 0 X X


O O
X XX O
O
• On the 𝑡 th mistake, update as follows:
O
X w
X X O
X O
X
• Mistake on positive, set 𝑎𝑖𝑡 ← 1; store 𝑥𝑖𝑡 O

• Mistake on negative, 𝑎𝑖𝑡 ← −1; store 𝑥𝑖𝑡

Perceptron 𝑤𝑡 = 𝑎𝑖1 𝑥𝑖1 + ⋯ + 𝑎𝑖𝑘 𝑥𝑖𝑘


𝑤𝑡 ⋅ 𝑥 = 𝑎𝑖1 𝑥𝑖1 ⋅ 𝑥 + ⋯ + 𝑎𝑖𝑘 𝑥𝑖𝑘 ⋅ 𝑥 → 𝑎𝑖1 𝐾(𝑥𝑖1 , 𝑥) + ⋯ + 𝑎𝑖𝑘 𝐾(𝑥𝑖𝑘 , 𝑥)

Exact same behavior/prediction rule as if mapped data in the


𝜙-space and ran Perceptron there!
Do this implicitly, so computational savings!!!!!
Generalize Well if Good Margin
• If data is linearly separable by margin in the 𝜙-space,
then small mistake bound.
𝑅 2
• If margin 𝛾 in 𝜙-space, then Perceptron makes mistakes.
𝛾

+ + Φ-space
+
w*
 +
-  +
+
- +
- - R
-
- - -
-
Kernels: More Examples
• Linear: K x, z = x ⋅ 𝑧

• Polynomial: K x, 𝑧 = x ⋅ 𝑧 d or K x, 𝑧 = 1 + x ⋅ 𝑧 d

2
𝑥−𝑧
• Gaussian: K x, 𝑧 = exp −
2 𝜎2

||𝑥−𝑧||
• Laplace Kernel: K x, 𝑧 = exp −
2 𝜎2

• Kernel for non-vectorial data, e.g., measuring similarity


between sequences.
Properties of Kernels
Theorem (Mercer)
K is a kernel if and only if:
• K is symmetric
• For any set of training points 𝑥1 , 𝑥2 , … , 𝑥𝑚 and for
any 𝑎1 , 𝑎2 , … , 𝑎𝑚 ∈ 𝑅, we have:
σ𝑖,𝑗 𝑎𝑖 𝑎𝑗 𝐾 𝑥𝑖 , 𝑥𝑗 ≥ 0

𝑎𝑇 𝐾𝑎 ≥ 0
I.e., 𝐾 = (𝐾 𝑥𝑖 , 𝑥𝑗 )𝑖,𝑗=1,…,𝑚 is positive semi-definite.
Kernel Methods

• Offer great modularity.

• No need to change the underlying learning


algorithm to accommodate a particular choice
of kernel function.
• Also, we can substitute a different algorithm
while maintaining the same kernel.
Kernel, Closure Properties

Easily create new kernels using basic ones!

Fact: If K1 ⋅,⋅ and K 2 ⋅,⋅ are kernels c1 ≥ 0, 𝑐2 ≥ 0,

then K x, z = c1 K1 x, z + c2 K 2 x, z is a kernel.

Key idea: concatenate the 𝜙 spaces.


ϕ x = ( c1 ϕ1 x , c2 ϕ2 (x))
ϕ x ⋅ ϕ(z) = c1 ϕ1 x ⋅ ϕ1 z + c2 ϕ2 x ⋅ ϕ2 z
𝐾1 (𝑥, 𝑧) 𝐾2 (𝑥, 𝑧)
Kernel, Closure Properties

Easily create new kernels using basic ones!

Fact: If K1 ⋅,⋅ and K 2 ⋅,⋅ are kernels,

then K x, z = K1 x, z K 2 x, z is a kernel.

Key idea: ϕ x = ϕ1,i x ϕ2,j x


𝑖∈ 1,…,𝑛 ,𝑗∈{1,…,𝑚}

ϕ x ⋅ ϕ(z) = ෍ ϕ1,i x ϕ2,j x ϕ1,i z ϕ2,j z


𝑖,𝑗

= ෍ ϕ1,i x ϕ1,𝑖 z ෍ ϕ2,𝑗 x ϕ2,j z


𝑖 𝑗

= σ𝑖 ϕ1,i x ϕ1,𝑖 z K 2 x, z = K1 x, z K 2 x, z
Kernels, Discussion
• If all computations involving instances are in terms
of inner products then:
 Conceptually, work in a very high diml space and the alg’s
performance depends only on linear separability in that
extended space.
 Computationally, only need to modify the algo by replacing
each x ⋅ z with a K x, z .

• Lots of Machine Learning algorithms are kernalizable:


• classification: Perceptron, SVM.

• regression: linear regression.

• clustering: k-means.
Kernels, Discussion
• If all computations involving instances are in terms
of inner products then:
 Conceptually, work in a very high diml space and the alg’s
performance depends only on linear separability in that
extended space.
 Computationally, only need to modify the algo by replacing
each x ⋅ z with a K x, z .

How to choose a kernel:


• Kernels often encode domain knowledge (e.g., string kernels)

• Use Cross-Validation to choose the 2


parameters, e.g., 𝜎 for
Gaussian Kernel K x, 𝑧 = exp − 2 𝜎2
𝑥−𝑧

• Learn a good kernel; e.g., [Lanckriet-Cristianini-Bartlett-El Ghaoui-


Jordan’04]

You might also like