Lecture 6
High dimensional representation
outline
1. Motivation
2. Kernel method
Transformation between distance and
similarity measure
Kernel similarity measurement
Kernel and reproducing kernel Hilbert space
(RKHS)
Kernel trick
Kernel functions
Motivations
Motivation
• The performance of machine learning methods is heavily influenced by the
different forms of data representation on which they are applied.
• PCA method is based on keeping only the eigenvectors that encodes the most
variation among the data.
• However, the PCA is not always good enough for learning representation of the
data.
• The PCA is purely second order representation. However, more information in
the natural images is in the higher order of the data.
• A revolution in pattern analysis has occurred with the introducing of kernel-
based learning methods.
Kernel Method
• Transformation between distance and
similarity measure
• The measure of distance is an important routine in data
processing and analysis.
• One of the mostly used dissimilarity measure is the Euclidean
distance. It is defined as the L2 norm (square root of the
vector inner product) of the difference of the two vectors or
two points.
• If the similarity is interpreted as a covariance, then is the
Euclidean distance could be written as a similarity matrix.
•
So a concept called kernel rises up, which it
considered as a transformation between
distance and similarity matrix.
• If the covariance is of The
the form:
kernel concepts becomes a basic for a
number of algorithms in machine learning
• Kernel similarity measurement
Kernel method comes up with a different idea
for similarity measurement. And the difference
is that, the kernel calculates distance in the
space of transformed feature.
Given the transformation , that maps the data
from original feature space to some higher
dimensional feature space.
As shown in the Figure , takes points xi and xj
mapped them into a Gaussian centered on xi,
and xj, respectively
Graphical illustration of the feature space of the Gaussian kernel
•• The kernel is the same as a dot product of mapped features.
The kernel gives large number value if the two inputs are
similar, whereas in contrast low value if the inputs are
dissimilar.
• Distance in transformed feature space is computed as the
following:
• Kernel and reproducing kernel Hilbert space
(RKHS)
Riesz’ representation theorem
• Riesz’ representation theorem tells whenever, there is a linear
continuous function (f(x)) it can be represented as a dot
product with other some element of Hilbert space (H).
• The H can be defined as an inner product space.
• The Riesz’ representation theorem states that, there is an
element can be written as:
•
• Using Riesz’ representation theorem, the kernel
can be defined as a reproducing kernel Hilbert
space (RKHS) if:
• Given kernel , one can construct the RKHS as the
completion of the space of functions spanned by
the set with an inner product defined as follows.
•
• Note that
•• Testing that is an inner product is by checking the following
conditions:
1. Symmetry
2. Positive definiteness
it is a dot product of the vector
with itself > 0
Summary
• So as long as we define a kernel function and construct the
kernel matrix and it is positive definite kernel.
• It means we could find a mapping such that it is possible to
rewrite the kernel function in term of inner product of the
mapped features.
• Conversely, for every RKHS there exists an associated
reproducing kernel which is symmetric and positive definite
(PD).
• Kernel trick
if is an extremely high dimensional, constructing the kernel need to:
•
• The extremely high dimensional feature vector
• Then computing the inner products in the feature space which seem
computationally inefficient and very expensive.
However, by using kernel trick, the need is just only
• Evaluating the kernel and knowing that there is a map and inner product.
• The evaluation of the kernel function is much easier than the computation
of the transformation of the feature followed by the inner product
computation.
Example
•• The basic idea of kernel trick is given in the following example.
The example shows that the inner products in the feature
space could be evaluated implicitly in the input space.
• Assuming there is a transformation mapping from original 2D
features to some higher three dimensional set of features
•• is needed just to compute
• Then O(n) is needed to compute the kernel which is inner
product in the feature space.
2
xi , x j = x i2 x 2j1 + 2 xi1 xi2 x j1 x j2 , xi22 x 2j2 = ( xi1 x j1 )2 + 2( xi1 x j1 xi2 x j2 ) + ( xi2 x j2 ) 2
1
• The and are the dot product terms taken in the input space.
• The is the dot product term taken in the input space raised to power of 2.
• So, just take the inner product between xi and xj which is O(n), then
square that and the kernel function is computed.
• In the above example examined only the 2D case, but the n
dimensional case is just generalization of the 2D case.
• So you have n dimensional xi and xj space vectors, then
calculate the dot product (get a single number) and raise it to
power of r. So we get a simple summation operation, and it
does not matter if the r is big or large, since we get the same
computation complexity.
• Kernel functions
• The typical kernel functions that can express the similarity
between the xi and xj are:
• Linear: (i.e there is no transformation, but just
it is computing the inner product of two input vectors)
• Polynomial:
• Just compute the inner product of these two vectors and put
into the exponent, where r is the parameter specifies the
maximum degree of the polynomial function.
• Sigmoid kernel:
Where, ћ and θ are the steepness and offset parameters, respectively.
• Laplacian radial basis function:
• Gaussian radial basis function:
• It is considered as one of the preferred kernel function, which
computes the Gaussian with square distance between xi and xj.
• It takes the points and mapped them into a Gaussian function centered
on the xi, and xj points.
Application of kernel
Support Vector Machine
• This section addresses one of the main applications of the
kernel method which explain why kernel method could be
useful. The SVM is one of the powerful classification
algorithms, which looking for a decision surface that separates
between the two groups of the data points.
• It is first appeared in the form of SVM which is one of the
powerful binary classification algorithms.
• For non-linearly separable data, the kernel method helps
researchers to build an efficient SVM linear classifier in high-
dimensional feature space.
Linear separable
Non linearly separable
Figure 1 The idea of kernel based SVM classifier.
• Figure 1 shows separable training data sets, it seem to it is
impossible to use a linear separator. Thus a more complicated
(curve instead of line) nonlinear classifier is needed.
• Applying the kernel trick is a way to create kernel based SVM
classifiers, and this allows the algorithm to separate the data
points using a hyper plane in a transformed features (higher
dimensional, 3D)
Projects
• Group 6
Handwriting Recognition