IVPML Unit III
IVPML Unit III
SVM Linear SVM: Linear SVM is used for linearly separable data,
which means if a dataset can be classified into two classes by
using a single straight line, then such data is termed as
linearly separable data, and classifier is used called as Linear
SVM classifier.
Non-linear SVM: Non-Linear SVM is used for non-linearly
separated data, which means if a dataset cannot be
classified by using a straight line, then such data is termed as
non-linear data and classifier used is called as Non-linear
SVM classifier.
Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-
dimensional space, but we need to find out the best decision boundary that helps to classify
the data points. This best boundary is known as the hyperplane of SVM.
Support Vectors: The data points or vectors that are the closest to the hyperplane and
which affect the position of the hyperplane are termed as Support Vector. Since these vectors
support the hyperplane, hence called a Support vector.
SVM
•SVMs are supervised learning models that data
analyze used for classification and regression
analysis.
Identify the right hyperplane
• Here, we have three hyperplanes (A, B and C). Now,
identify the right hyperplane to classify star and circle.
Find a,b,c, such that ax + by ≥ c for red points ax + by ≤ (or < ) c for green
points.
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
w: weight vector
denotes -1 x: data vector
2022/4/29 11
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1
2022/4/29 12
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x +b)
denotes +1
denotes -1
2022/4/29 13
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1
2022/4/29 14
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x+b)
denotes +1
denotes -1
2022/4/29 15
Classifier Margin
x f yest
f(x,w,b) = sign(w. x +b)
denotes +1
denotes -1 Define the margin of a
linear classifier as the
width that the boundary
could be increased by
before hitting a datapoint.
2022/4/29 16
Maximum Margin
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1 The maximum margin
linear classifier is the linear
classifier with the, um,
maximum margin.
This is the simplest kind of
SVM (Called an LSVM)
Linear SVM
2022/4/29 17
Maximum Margin
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1 The maximum margin
linear classifier is the linear
classifier with the, um,
maximum margin.
This is the simplest kind of
Support Vectors SVM (Called an LSVM)
are those
datapoints that
the margin pushes
up against
Linear SVM
2022/4/29 18
Why Maximum Margin?
f(x,w,b) = sign(w. x +b)
denotes +1
denotes -1 The maximum margin
linear classifier is the linear
classifier with the, um,
maximum margin.
This is the simplest kind of
Support Vectors SVM (Called an LSVM)
are those
datapoints that
the margin pushes
up against
2022/4/29 19
How to calculate the distance from a point to a line?
denotes +1
denotes -1 x
wx +b = 0
X – Vector
W – Normal Vector
b – Scale Value
http://mathworld.wolfram.com/Point-LineDistance2-Dimensional.html
In our case, w1*x1+w2*x2+b=0,
thus, w=(w1,w2), x=(x1,x2)
2022/4/29 20
Estimate the Margin
denotes +1
denotes -1 x
wx +b = 0
X – Vector
W
W – Normal Vector
b – Scale Value
2022/4/29 21
• Consider a random point X and we want to know whether it lies on the
right side of the plane or the left side of the plane (positive or negative). To
find this first we assume this point is a vector (X) and then we make a vector
Use of Dot Product (w) which is perpendicular to the hyperplane. Let’s say the distance of vector
w from origin to decision boundary is ‘c’. Now we take the projection of X
in SVM: vector on w. We already know that projection of any vector or another
vector is called dot-product. Hence, we take the dot product of x and w
vectors. If the dot product is greater than ‘c’ then we can say that the point
lies on the right side. If the dot product is less than ‘c’ then the point is on
the left side and if the dot product is equal to ‘c’ then the point lies on the
decision boundary.
Sec. 15.1
m wTxa + b = 1
• wTxb + b = -1
Hyperplane
wT x + b = 0
• Extra scale
constraint: mini=1,…,n |
wTxi + b| = 1
• This implies:
wT(xa–xb) = 2
wT x + b = 0
m = ||xa–xb||2 = 2/||
w||2
23
Large-margin Decision Boundary
• The decision boundary should be as far away from the data of both classes
as possible
– We should maximize the margin, m
Class 2
m
Class 1
2022/4/29 24
Example
Linear SVM: Separable Case
Example
Linear SVM: Separable Case Direction of w must be perpendicular to DB
Example
Linear SVM: Separable Case
Example
Linear SVM: Separable Case
-Ve
+Ve
Real life applications of SVM
Face detection – SVM classify parts of the image as a face and non-face and
create a square boundary around the face.
Text and hypertext categorization – SVMs allow Text and hypertext
categorization for both inductive and transductive models. They use training data
to classify documents into different categories. It categorizes on the basis of the
score generated and then compares with the threshold value.
Examples:
Classification of news articles into “business” and “Movies”
Classification of web pages into personal home pages and others
Classification of images – Use of SVMs provides better search accuracy for image
classification. It provides better accuracy in comparison to the traditional query-
based searching techniques.
Bioinformatics – It includes protein classification and cancer classification. We
use SVM for identifying the classification of genes, patients on the basis of genes
and other biological problems.
Protein fold and remote homology detection – Apply SVM algorithms for protein
remote homology detection. DNA binding protein identification
Handwriting recognition – We use SVMs to recognize handwritten characters
used widely.
Generalized predictive control(GPC) – Use SVM based GPC to control chaotic
dynamics with useful parameters.
Importance
The idea behind SVMs is to make use of a
(nonlinear) mapping function Φ that
transforms data in input space to data in
feature space in such a way as to render a
problem linearly separable. The SVM then
automaticall discover the optimal
y separating s (which,
mapped back hyperplane
into input space via Φ−1 when
, can
be a complex decision surface).
Example
=101 =311 = 3 -1 1
= 101 =311
101 101
101 2 301 4
= 3 -1 1 = 3 -1 1
10 1 3 11
301 4 9 -1 1 9
11 11
Y=wX+b
0=X-2
X=2
Activation Function
A neural network without an activation function is essentially just
a linear regression model. The activation function does
the non-linear transformation to the input making it capable to
learn and perform more complex tasks.
Linear Function
•Equation : Linear function has the equation similar to as
of a straight line i.e. y = ax
•No matter how many layers we have, if all are linear in
nature, the final activation function of last layer is nothing
but just a linear function of the input of first layer.
•Range : -inf to +inf
•Uses : Linear activation function is used at just
one place i.e. output layer.
Sigmoid Function
•It is a function which is plotted as ‘S’ shaped graph.
•Equation :
A = 1/(1 + e-x)
•Nature : Non-linear. Notice that X values lies between -2
to 2, Y values are very steep. This means, small changes in
x would also bring about large changes in the value of Y.
•Value Ra
nge : 0 to 1
2.718281828−5 = 0.00674
2.71828182810 = 22026.46576
Tanh Function
Equation :-f(x) = tanh(x) = 2/(1 + e-2x) - 1 OR
tanh(x) = 2 * sigmoid(2x) - 1
•Value Range :- -1 to +1
•Nature :- non-linear
SIGN/SIGNUM Function
RELU
•RELU :- Stands for Rectified linear unit. It is the most widely
used activation function. Chiefly implemented in hidden layers
of Neural network.
•Equation :- A(x) = max(0,x). It gives an output x if x is positive
and 0 otherwise.
•Value Range :- [0, inf)
•Nature :- non-linear, which means we can easily backpropagate the
errors and have multiple layers of neurons being activated by the
ReLU function.
Leaky RELU
Softmax
Softmax Function :- The softmax function is also a type of
sigmoid function but is handy when we are trying to handle
classification problems.
•Nature :- non-linear
•Uses :- Usually used when trying to handle multiple classes. The
softmax function would squeeze the outputs for each class between
0 and 1 and would also divide by the sum of the outputs.
•Output:- The softmax function is ideally used in the output layer
of the classifier where we are actually trying to attain the
probabilities to define the class of each input.
Useful for finding most probable occurrence of output with respect to other outputs. The
softmax activation function is commonly used as an activation function in the case of multi-class
classification problems in machine learning. The output of the softmax is interpreted as the
Softmax
Let’s consider a neural network that classifies a given image, whether it is of cat, dog, tiger,
or none. Let X is the feature vector (i.e. X = [x1, x2, x3, x4]).
• X1=[4.7,1.3];
• label1 = predict(SVMModel,X1)
• X2=[5.2,2.1]
• label2 = predict(SVMModel,X2)
SVM Classifier
Speeded up robust features (SURF) to find a feature descriptor followed by K-means to
obtain a visual vocabulary. The number of k-means clusters is the size of our visual
vocabulary and the size of our features.
bag = bagOfFeatures(imds) returns a bag of visual features. imds is
an ImageDatastore object. By default, SURF features are used to
generate the vocabulary features. Vocabulary is quantized using K-
means
algorithm.
Clustering
Also called unsupervised learning, sometimes called
classification by statisticians and sorting by
psychologists and segmentation by people in marketing
•More informally, finding natural groupings
among objects.
Clustering is subjective
Black dots
Examples
• Cluster customers based on their purchase histories
• Cluster products based on the sets of customers who purchased them
• Cluster documents based on similar words
• Regions of images
Cluster documents based on similar words
Cluster –Regions of images
Aspects of clustering
• A clustering algorithm
– Partitional clustering
– Hierarchical clustering
–…
• A distance (similarity, or dissimilarity) function
• Clustering quality
– Inter-clusters distance maximized
– Intra-clusters distance minimized
• The quality of a clustering result depends on
the algorithm, the distance function, and the
application.
Red dots Green dots
Intra Cluster
distances are
Inter Cluster minimized
distances are
maximized
Black dots
Similarity
The quality or state of being similar; likeness; resemblance;
as, a similarity of features. Similarity is hard to define, but
we know it when we see it.
Similarity Measures
The most popular way to evaluate a similarity measure
is the use of distance measures. Distances are normally
used to measure the similarity or dissimilarity between
two data objects
Manhattan
Euclidean
Chebychev
Cosine distance
• Minkowski Distance
For x (x1 x2 xn ) and y (y1
y 2 yn )
1
d (x, y) | 1x y1 |p | x2 y2 | p | xn yn | p p
, p0
– p = 1: Manhattan (city block) distance
d(x , y) | x1 y 1 | | x 2 y 2 | | x n y n |
– p = 2: Euclidean distance
d(x, y) |x y |2 |x y |2 | y |2
1 2 2 n n
1 x
– Weighted Euclidean distance
dist(x, y) w (x y )2 w (x y )2 ... w (x y 2
1 r
1
)
2 r r
Chebychev distance: one wants to define two data points
as "different" if they are different on any one of the
attributes.
d (x, y) 1 cos(x, y)
– Property: 0 d(x, y) 2
– Nonmetric vector objects: keywords in documents,
gene features in micro-arrays, …
– Applications: information retrieval, biologic
taxonomy,
...
Manhattan Distance between two points
(x1, y1) and (x2, y2) is:
|x1 – x2| + |y1 – y2|
2 p1
p3 p4
1
p2
0
0 1 2 3 4 5 6
• Example: Manhatten and Euclidean distances
d(x , y) | x1 y 1 | | x 2 y 2 | | x n y n |
B 3 L1 p1 p2 p3 p4
p1 0 4 4 6
2 p1
p2 4 0 2 4
p3 p4
p3 4 2 0 2
1
p4 6 4 2 0
p2
0 Distance Matrix for Manhattan Distance
0 1 2 3 4 5 6
A
L2 p1 p2 p3 p4
point A B
p1 0 2.828 3.162 5.099
p1 0 2
p2 2 0
p2 2.828 0 1.414 3.162
p3 3 1 p3 3.162 1.414 0 2
p4 5 1 p4 5.099 3.162 2 0
d(x , y) | x y |2 | x y |2 | y |
x n n
1 1 2 2
2
• Example: Calculate Cosine measure
76
• Example: Cosine measure
x1 (3, 2, 0, 5, 2, 0, 0), x2 (1,0, 0, 0,1,
0, 2)
2 0
31
32 22 02 0520252
002 2142 0 0
0 022 5
6.48
1 0 0 0 1 0
2 2 2 2 2 2
2
2 5 6 2.45
cos(x
1,x )
2 6.48 0.32
d (x1, x2 ) 2.45
1 cos(x1, x2 ) 1
0.32 0.68cos(x, y)
x y x y
1 1 n n
xn2 y12
x 12 n
y 2
d (x, y) 1 cos(x,
Distance Measures
• Distance for Binary Features
– For binary features, their value can be converted
into 1 or 0.
– Contingency table for binary feature vectors, x
and y
y
Cluster 0
Cluster 1
Cluster 3
Clustering Techniques
Techniques
Partitioning Hierarchica
Density Based
Based l Based
Manual calculations:
1 1 C1=1 3-1=2;3-2=1; 1 1
2 2 assign to 2 2 2 C2=(2+3)/2=2.5
3 C2=2 3 2
4 4 4-1=3;4-2.5=1.5;
assign to 2
1 1
2 2 C2=(2+3+4)/3=3
3 2
4 2
Example
Divide the sample into 2 clusters
Step 4:
Partitioning–based clustering as exemplified by the approach in
the k-means algorithm
Goal : partition N instances into k clusters.
Steps of the algorithm:
1. Select k instances and allocate these as initial means
(centroids, prototypes)
2. Calculate the distance (typically Euclidean) from each instance
to all the
centroids
3. Associate all instances to the closest means (centroids, protype)
4. Let the resulting subsets of instances constitute the initial
clusters
5. Create new means (centroids, prototype) as the centroid of all
instances in
each cluster
6. Recalculate and reallocate all instances. An instance can change
cluster when the centroids are recomputed.
7. Reiterate from 4 until centroids remains stable.
Partitional Clustering
• Nonhierarchical, each instance is placed in
exactly one of K nonoverlapping clusters.
• Since only one set of clusters is output, the user
normally has to input the desired number of
clusters K.
10
1 2 3 4 5 6 7 8 9 10
K-means Clustering: Step 1
Algorithm: k-means, Distance Metric: Euclidean Distance
5
4
k1
k2
2
k3
0
0 1 2 3 4 5
K-means Clustering: Step 2
Algorithm: k-means, Distance Metric: Euclidean Distance
5
4
k1
k2
2
k3
0
0 1 2 3 4 5
K-means Clustering: Step 3
Algorithm: k-means, Distance Metric: Euclidean Distance
5
4
k1
2
k3
k2
1
0
0 1 2 3 4 5
K-means Clustering: Step 4
Algorithm: k-means, Distance Metric: Euclidean Distance
5
4
k1
2
k3
k2
1
0
0 1 2 3 4 5
K-means Clustering: Step 5
Algorithm: k-means, Distance Metric: Euclidean Distance
expression in condition 2 5
4
k1
2
k2
k3
1
0
0 1 2 3 4 5
expression in condition
1
Comments on the K-Means Method
• Strength
– Relatively efficient: O(tkn), where n is # objects, k is # clusters,
and t is # iterations. Normally, k, t << n.
– Often terminates at a local optimum. The global optimum may
be found using techniques such as: deterministic annealing
and genetic algorithms
• Weakness
– Applicable only when mean is defined, then what about
categorical data?
– Need to specify k, the number of clusters, in advance
– Unable to handle noisy data and outliers
– Not suitable to discover clusters with non-convex
shapes
Comments on the K-Means Method
Hierarchical Based
HB
Divisive Agglomerative
• Divisive fashion: a top-down approach where all observations start in one cluster,
and splits are performed recursively as one moves down the hierarchy.
Hierarchical clustering
Hierarchical clustering is a cluster technique which seeks to build
a hierarchy of clusters. The results of the hierarchical clustering are
usually presented in a dendrogram.
Proximity of two clusters is the average of the distances between the instances in the two
clusters. A Proximity matrix for clusters can be calculated from a Distance Matrix for
the instances. The ´proximity´ matrix is recalculated in each step of the algorithm.
Agglomerative
Initially consider every data point as an individual Cluster and at
every step, merge the nearest pairs of the cluster. (It is a bottom-up
method). At first every data set is considered as individual entity or
cluster. At every iteration, the clusters merge with different clusters
until one cluster is formed.
Algorithm for Agglomerative Hierarchical Clustering is:
•Calculate the similarity of one cluster with all the other
clusters (calculate proximity matrix)
•Consider every data point as a individual cluster
•Merge the clusters which are highly similar or close to each other.
•Recalculate the proximity matrix for each cluster
•Repeat Step 3 and 4 until only a single cluster remains.
What is a natural grouping among these objects?
(How-to) Hierarchical Clustering
The number of dendrograms with n leafs
=
2n Since we cannot test all possible trees
3!2 n
we will have to heuristic search of all
n-2
possible trees. We could do this..
2 !
Number Number of
Possible of Leafs Bottom-Up (agglomerative): Starting
Dendrograms
2 1 with each item in its own cluster,
3 3 find the best pair to merge into a
4 15
5 105 new cluster. Repeat until all clusters
... … are fused together.
10 34,459,425
0 8 8 7 7
0 2 4 4
0 3 3
D( , ) = 0 1
8 D( , ) = 0
Bottom-Up (agglomerative): Starting
with each item in its own cluster, find
the best pair to merge into a new
cluster. Repeat until all clusters are
fused together.
Consider all
Choose
possible
… the best
merges…
Consider all
Choose
possible
… the best
merges…
Consider all
Choose
possible
… the best
merges…
Consider all
Choose
possible
… the best
merges…
Consider all
Choose
possible
… the best
merges…
x y
p1 0.40 0.53
p2 0.22 0.38
p3 0.35 0.32
p4 0.26 0.19
p5 0.08 0.41
p6 0.45 0.30
Step 2. Calculate the distance from each object (point) to all other
points, using Euclidean distance measure, and place the numbers in a
distance matrix.
x y
p1 0.40 0.53
p2 0.22 0.38
p3 0.35 0.32
p4 0.26 0.19
p5 0.08 0.41
p6 0.45 0.30
Step 3 Identify the two clusters with the shortest distance in the
matrix, and merge them together. Re-compute the distance matrix,
as those two clusters are now in a single cluster, (no longer exist by
themselves).
By looking at the distance matrix above, we see that p3 and p6
have the smallest distance from all - 0.11 So, we merge those
two in a single cluster, and re-compute the distance matrix.
Since, we have merged (p3, p6) together in a cluster, we now have one entry for
(p3, p6) in the table, and no longer have p3 or p6 separately. Therefore, we need
to re-compute the distance from each point to our new cluster - (p3, p6).
dist( (p3, p6), p1 ) = MIN ( dist(p3, p1) , dist(p6, p1) ) = MIN ( 0.22 , 0.23 )
= 0.22
dist( (p3, p6), p2 )=MIN(dist(p3, p2) , dist(p6, p2) ) = MIN ( 0.15 , 0.25 )
= 0.15
dist( (p3, p6), p4 )=MIN(dist(p3, p4) , dist(p6, p4) ) = MIN ( 0.15 , 0.22 )
= 0.15
dist( (p3, p6), p5 )=MIN(dist(p3, p5) , dist(p6, p5) ) = MIN ( 0.28 , 0.39)
= 0.28
dist( (p3, p6), (p2, p5) ) = MIN ( dist(p3, p2) , dist(p6, p2), dist(p3, p5), dist(p6, p5) )
= MIN ( 0.15 , 0.25, 0.28, 0.39) =
0.15
dist( (p2, p5), p1 ) = MIN ( dist(p2, p1) , dist(p5, p1) ) = MIN ( 0.24 , 0.34 )
= 0.24
dist( (p2, p5), p4 ) = MIN ( dist(p2, p14) , dist(p5, p4) ) = MIN ( 0.2 ,
0.29 )
= 0.2
dist( ((p2, p5),(p3,p6)), p1) = MIN ( dist(p2, p5),p1 , dist(p3, p6),p1
= MIN ( 0.24 , 0.22) = 0.22
dist( ((p2, p5),(p3,p6)), p4) = MIN ( dist(p2, p5),p4 , dist(p3, p6),p4
= MIN ( 0.2 , 0.15) = 0.15
dist( ((p2, p5,p3,p6,p4), p1) =
0.22
We know how to measure the distance between two
objects, but defining the distance between an object
and a cluster, or defining the distance between two
clusters is non obvious.
•Single linkage (nearest neighbor): In this method the distance between two
clusters is determined by the distance of the two closest objects (nearest
neighbors) in the different clusters.
•Complete linkage (furthest neighbor): In this method, the distances between
clusters are determined by the greatest distance between any two objects in the
different clusters (i.e., by the "furthest neighbors").
•Group average linkage: In this method, the distance between two clusters is
calculated as the average distance between all pairs of objects in the two different
clusters.
•Wards Linkage: In this method, we try to minimize the variance of the merged
clusters
Density-based clustering
Density-based clustering is a clustering technique which groups together instances
that are closely packed together (instances with many nearby neighbors), marking as
outliers instances that lie alone in low-density regions (whose nearest neighbors are
far away).
Properties of algorithms:
• Clusters are dense regions in the instance space, separated by regions of lower
instance density
• A cluster is defined as a set of connected instances with maximal density
• Does not need a predefined target value for # of clusters but
needs definitions of tresholds for reachability and density
• Discovers clusters of arbitrary shape.
• Is insensitive to noise.
Examples of algorithms:
• DBSCAN-
Density- Based Spatial Clustering of Applications with
Noise
Density-based clustering as exemplified with the approach in
DBSCAN
Instances are classified as core instances, reachable instances or outliers.
• A core instance has a minimum numbers of instances with a treshold radius.
• An instance is density reachable fram another instance if it is within a
treshold radius from a core instance.
• An instance is density connected to another instance if both instances are
density reachable from a third instance or if they are directly density
reachable from each other.
• All instances not reachable from any other instances are considered as
outliers
(possibly noise).
• If p is a core instance, then it forms a cluster together with all instances
that are reachable from it. Each cluster contains at least one core instance ;
non- core points can be part of a cluster, but they form its "edge“.
• All points within the cluster are mutually density-connected.
• If a point is density-reachable from any point of the cluster, it is part of
the cluster as well.
Density- Based Spatial Clustering
of Applications with Noise
•Core — This is a point that has at least m points within distance n from itself.
•Border — This is a point that has at least one Core point at a distance n.
•Noise — This is a point that is neither a Core nor a Border. And it has less than m
points within distance n from itself.
Point A and the other red instances are core instances, because
the area surrounding these instances in an ε radius contain a
specified minimum of 4 points Because they are all reachable
from one another, they form a single cluster. Points B and C are
not core points, but are reachable from A (via other core
points) and thus belong to the cluster as well. Point N is a
noise point that is neither a core point nor directly-reachable.
Density-Based Clustering:
Locates regions of high density that are separated from one another
by regions of low density.