0% found this document useful (0 votes)
31 views

IVPML Unit III

SVMs are supervised learning models that analyze data used for classification and regression analysis. They work by identifying the hyperplane that maximizes the margin between two classes of data points. The hyperplane with the maximum margin is the optimal decision boundary and is known as the maximum margin hyperplane.

Uploaded by

aravindjas95020
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

IVPML Unit III

SVMs are supervised learning models that analyze data used for classification and regression analysis. They work by identifying the hyperplane that maximizes the margin between two classes of data points. The hyperplane with the maximum margin is the optimal decision boundary and is known as the maximum margin hyperplane.

Uploaded by

aravindjas95020
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 139

SVM

•SVM builds a hyperplane or set of hyper planes in a high


dimensional space (number of features) which can be used
for classification or regression.

•Widely used for classification.

•Hyper plane divides into two classes in two dimensional


spaces. To separate the two classes of data points, there
are many possible hyperplanes that could be chosen. Our
objective is to find a plane that has the maximum margin,
i.e the maximum distance between data points of both
classes.
SVM
Suppose we see a strange cat that also has some features of
dogs, so if we want a model that can accurately identify
whether it is a cat or dog, so such a model can be created by
using the SVM algorithm. We will first train our model with lots
SVM of images of cats and dogs so that it can learn about different
features of cats and dogs, and then we test it with this strange
creature. So as support vector creates a decision boundary
between these two data (cat and dog) and choose extreme
cases (support vectors), it will see the extreme case of cat and
dog. On the basis of the support vectors, it will classify it as a
cat. Consider the below diagram:
Types of SVM
SVM can be of two types:

SVM Linear SVM: Linear SVM is used for linearly separable data,
which means if a dataset can be classified into two classes by
using a single straight line, then such data is termed as
linearly separable data, and classifier is used called as Linear
SVM classifier.
Non-linear SVM: Non-Linear SVM is used for non-linearly
separated data, which means if a dataset cannot be
classified by using a straight line, then such data is termed as
non-linear data and classifier used is called as Non-linear
SVM classifier.
Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-
dimensional space, but we need to find out the best decision boundary that helps to classify
the data points. This best boundary is known as the hyperplane of SVM.
Support Vectors: The data points or vectors that are the closest to the hyperplane and
which affect the position of the hyperplane are termed as Support Vector. Since these vectors
support the hyperplane, hence called a Support vector.
SVM
•SVMs are supervised learning models that data
analyze used for classification and regression
analysis.
Identify the right hyperplane
• Here, we have three hyperplanes (A, B and C). Now,
identify the right hyperplane to classify star and circle.

• Hyperplane “B” has excellently performed this job.


5
SVM
Input: set of (input, output) training pair
samples; the input sample features
call
x1 , x2 ,....xn and the output result y. Typically,
there can be lots of input features .

Output: set of weights w (or wi), one for each


feature, whose linear combination predicts the
value of y (like neural networks).
In order to classify a dataset like the one above
2d view of the data to a 3d view.

Our hyperplane can no longer be a line. It must now be a


plane as shown in the above.
Support vectors

Find a,b,c, such that ax + by ≥ c for red points ax + by ≤ (or < ) c for green
points.

The problem of finding the optimal hyper plane is an optimization


problem and can be solved by optimization techniques (we use
Lagrange multipliers to get this problem into a form that can be
solved analytically).
Linearly separable Not linearly separable
Linear Classifiers Estimation:

x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
w: weight vector
denotes -1 x: data vector

How would you


classify this data?

2022/4/29 11

Linear Classifiers
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1

How would you


classify this data?

2022/4/29 12

Linear Classifiers
x f yest
f(x,w,b) = sign(w. x +b)
denotes +1
denotes -1

How would you


classify this data?

2022/4/29 13

Linear Classifiers
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1

How would you


classify this data?

2022/4/29 14

Linear Classifiers
x f yest
f(x,w,b) = sign(w. x+b)
denotes +1
denotes -1

Any of these would


be fine..

..but which is best?

2022/4/29 15

Classifier Margin
x f yest
f(x,w,b) = sign(w. x +b)
denotes +1
denotes -1 Define the margin of a
linear classifier as the
width that the boundary
could be increased by
before hitting a datapoint.

2022/4/29 16

Maximum Margin
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1 The maximum margin
linear classifier is the linear
classifier with the, um,
maximum margin.
This is the simplest kind of
SVM (Called an LSVM)

Linear SVM
2022/4/29 17

Maximum Margin
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1 The maximum margin
linear classifier is the linear
classifier with the, um,
maximum margin.
This is the simplest kind of
Support Vectors SVM (Called an LSVM)
are those
datapoints that
the margin pushes
up against

Linear SVM
2022/4/29 18
Why Maximum Margin?
f(x,w,b) = sign(w. x +b)
denotes +1
denotes -1 The maximum margin
linear classifier is the linear
classifier with the, um,
maximum margin.
This is the simplest kind of
Support Vectors SVM (Called an LSVM)
are those
datapoints that
the margin pushes
up against

2022/4/29 19
How to calculate the distance from a point to a line?
denotes +1
denotes -1 x
wx +b = 0

X – Vector
W – Normal Vector
b – Scale Value

 http://mathworld.wolfram.com/Point-LineDistance2-Dimensional.html
 In our case, w1*x1+w2*x2+b=0,
 thus, w=(w1,w2), x=(x1,x2)

2022/4/29 20
Estimate the Margin
denotes +1
denotes -1 x
wx +b = 0

X – Vector
W
W – Normal Vector
b – Scale Value

• What is the distance expression for a point x to


a line wx+b= 0? xwb xw
d (x) b 
d
 w
2
2
 i1 w2i

2022/4/29 21
• Consider a random point X and we want to know whether it lies on the
right side of the plane or the left side of the plane (positive or negative). To
find this first we assume this point is a vector (X) and then we make a vector
Use of Dot Product (w) which is perpendicular to the hyperplane. Let’s say the distance of vector
w from origin to decision boundary is ‘c’. Now we take the projection of X
in SVM: vector on w. We already know that projection of any vector or another
vector is called dot-product. Hence, we take the dot product of x and w
vectors. If the dot product is greater than ‘c’ then we can say that the point
lies on the right side. If the dot product is less than ‘c’ then the point is on
the left side and if the dot product is equal to ‘c’ then the point lies on the
decision boundary.
Sec. 15.1

Linear Support Vector Machine (SVM)

m wTxa + b = 1

• wTxb + b = -1
Hyperplane
wT x + b = 0
• Extra scale
constraint: mini=1,…,n |
wTxi + b| = 1

• This implies:
wT(xa–xb) = 2
wT x + b = 0
m = ||xa–xb||2 = 2/||
w||2
23
Large-margin Decision Boundary
• The decision boundary should be as far away from the data of both classes
as possible
– We should maximize the margin, m

Class 2

m
Class 1

2022/4/29 24
Example
Linear SVM: Separable Case
Example
Linear SVM: Separable Case Direction of w must be perpendicular to DB
Example
Linear SVM: Separable Case
Example
Linear SVM: Separable Case

-Ve

+Ve
Real life applications of SVM
Face detection – SVM classify parts of the image as a face and non-face and
create a square boundary around the face.
Text and hypertext categorization – SVMs allow Text and hypertext
categorization for both inductive and transductive models. They use training data
to classify documents into different categories. It categorizes on the basis of the
score generated and then compares with the threshold value.
Examples:
Classification of news articles into “business” and “Movies”
Classification of web pages into personal home pages and others
Classification of images – Use of SVMs provides better search accuracy for image
classification. It provides better accuracy in comparison to the traditional query-
based searching techniques.
Bioinformatics – It includes protein classification and cancer classification. We
use SVM for identifying the classification of genes, patients on the basis of genes
and other biological problems.
Protein fold and remote homology detection – Apply SVM algorithms for protein
remote homology detection. DNA binding protein identification
Handwriting recognition – We use SVMs to recognize handwritten characters
used widely.
Generalized predictive control(GPC) – Use SVM based GPC to control chaotic
dynamics with useful parameters.
Importance
The idea behind SVMs is to make use of a
(nonlinear) mapping function Φ that
transforms data in input space to data in
feature space in such a way as to render a
problem linearly separable. The SVM then
automaticall discover the optimal
y separating s (which,
mapped back hyperplane
into input space via Φ−1 when
, can
be a complex decision surface).
Example

With a 1 as a bias input,


discover a simple SVM that
accurately discriminates
the two classes.
Since the data is linearly separable, we can use a linear SVM (that
is, one whose mapping function Φ( ) is the identity function). By
inspection, it should be obvious that there are three support
vectors
Now, computing the dot products results in

=101 =311 = 3 -1 1

= 101 =311
101 101
101 2 301 4

= 3 -1 1 = 3 -1 1
10 1 3 11

301 4 9 -1 1 9
11 11
Y=wX+b
0=X-2
X=2
Activation Function
A neural network without an activation function is essentially just
a linear regression model. The activation function does
the non-linear transformation to the input making it capable to
learn and perform more complex tasks.
Linear Function
•Equation : Linear function has the equation similar to as
of a straight line i.e. y = ax
•No matter how many layers we have, if all are linear in
nature, the final activation function of last layer is nothing
but just a linear function of the input of first layer.
•Range : -inf to +inf
•Uses : Linear activation function is used at just
one place i.e. output layer.
Sigmoid Function
•It is a function which is plotted as ‘S’ shaped graph.
•Equation :
A = 1/(1 + e-x)
•Nature : Non-linear. Notice that X values lies between -2
to 2, Y values are very steep. This means, small changes in
x would also bring about large changes in the value of Y.
•Value Ra
nge : 0 to 1

2.718281828−5 = 0.00674
2.71828182810 = 22026.46576
Tanh Function
Equation :-f(x) = tanh(x) = 2/(1 + e-2x) - 1 OR
tanh(x) = 2 * sigmoid(2x) - 1
•Value Range :- -1 to +1
•Nature :- non-linear
SIGN/SIGNUM Function
RELU
•RELU :- Stands for Rectified linear unit. It is the most widely
used activation function. Chiefly implemented in hidden layers
of Neural network.
•Equation :- A(x) = max(0,x). It gives an output x if x is positive
and 0 otherwise.
•Value Range :- [0, inf)
•Nature :- non-linear, which means we can easily backpropagate the
errors and have multiple layers of neurons being activated by the
ReLU function.
Leaky RELU
Softmax
Softmax Function :- The softmax function is also a type of
sigmoid function but is handy when we are trying to handle
classification problems.
•Nature :- non-linear
•Uses :- Usually used when trying to handle multiple classes. The
softmax function would squeeze the outputs for each class between
0 and 1 and would also divide by the sum of the outputs.
•Output:- The softmax function is ideally used in the output layer
of the classifier where we are actually trying to attain the
probabilities to define the class of each input.

Useful for finding most probable occurrence of output with respect to other outputs. The
softmax activation function is commonly used as an activation function in the case of multi-class
classification problems in machine learning. The output of the softmax is interpreted as the
Softmax
Let’s consider a neural network that classifies a given image, whether it is of cat, dog, tiger,
or none. Let X is the feature vector (i.e. X = [x1, x2, x3, x4]).

Suppose we input an image and


obtained
these values

By observing the probability distribution, we can say


that the supplied image is of a dog.
Square-law based RBF kernel which
eliminates the exponential term as found in
Gaussian RBF
Application
• Y=label
• 'versicolor'
• 'virginica’
• X features-versicolor
• 4.70000000000000 1.40000000000000
• 4.50000000000000 1.50000000000000
• 4.90000000000000 1.50000000000000
• X features-virginica
• 5.20000000000000 2
• 5.40000000000000 2.30000000000000
• 5.10000000000000 1.80000000000000
Application
• load fisheriris
• inds = ~strcmp(species,'setosa');
• X = meas(inds,3:4);
• Y = species(inds);
• SVMModel = fitcsvm(X,Y);

• X1=[4.7,1.3];
• label1 = predict(SVMModel,X1)

• X2=[5.2,2.1]
• label2 = predict(SVMModel,X2)
SVM Classifier
Speeded up robust features (SURF) to find a feature descriptor followed by K-means to
obtain a visual vocabulary. The number of k-means clusters is the size of our visual
vocabulary and the size of our features.
bag = bagOfFeatures(imds) returns a bag of visual features. imds is
an ImageDatastore object. By default, SURF features are used to
generate the vocabulary features. Vocabulary is quantized using K-
means
algorithm.
Clustering
Also called unsupervised learning, sometimes called
classification by statisticians and sorting by
psychologists and segmentation by people in marketing
•More informally, finding natural groupings
among objects.

Clustering or cluster analysis is a machine learning technique, which groups


the unlabelled dataset.
It can be defined as "A way of grouping the data points into different clusters, consisting
of similar data points. The objects with the possible similarities remain in a
group that has less or no similarities with another group."
What is a natural grouping among these objects?
What is a natural grouping among these objects?

Clustering is subjective

Simpson's Family School Employees Females Males


Clustering
•Organizing a group of objects share similar
that characteristics is called clustering.

•Clustering is a process for finding similar groups in data.

•Cluster is a collection of data objects similar to one


another within the same cluster and dissimilar to the
objects in other clusters.

•Clustering is often called an unsupervised learning


because no predefined classes.
Clustering
Three distinct groupings of data, each one of these groups is a cluster.

Red dots Green dots

Black dots
Examples
• Cluster customers based on their purchase histories
• Cluster products based on the sets of customers who purchased them
• Cluster documents based on similar words
• Regions of images
Cluster documents based on similar words
Cluster –Regions of images
Aspects of clustering
• A clustering algorithm
– Partitional clustering
– Hierarchical clustering
–…
• A distance (similarity, or dissimilarity) function
• Clustering quality
– Inter-clusters distance  maximized
– Intra-clusters distance  minimized
• The quality of a clustering result depends on
the algorithm, the distance function, and the
application.
Red dots Green dots

Intra Cluster
distances are
Inter Cluster minimized
distances are
maximized

Black dots
Similarity
The quality or state of being similar; likeness; resemblance;
as, a similarity of features. Similarity is hard to define, but
we know it when we see it.
Similarity Measures
The most popular way to evaluate a similarity measure
is the use of distance measures. Distances are normally
used to measure the similarity or dissimilarity between
two data objects

Manhattan

Euclidean

Chebychev

Cosine distance
• Minkowski Distance
For x  (x1 x2    xn ) and y  (y1
y 2    yn )
 
1
d (x, y)  | 1x y1 |p  | x2  y2 | p  | xn  yn | p p
, p0

– p = 1: Manhattan (city block) distance
d(x , y) | x1  y 1 |  | x 2  y 2 |   | x n  y n |
– p = 2: Euclidean distance

d(x, y)  |x  y |2 |x  y |2    |  y |2
1 2 2 n n
1 x
– Weighted Euclidean distance
dist(x, y)  w (x  y )2  w (x  y )2  ...  w (x  y 2
1 r
1
)
2 r r
Chebychev distance: one wants to define two data points
as "different" if they are different on any one of the
attributes.

dist(x , y)  max(| x1  y1 |, | x2  y2 |, ..., | xr  yr


|)
Properties
d(x,y)  0
d(x,x) = 0
d(x,y) = d(y,x)
d(x,y)  d(x,k) + d(k,y)
• Cosine Measure (Similarity vs. Distance)
For
x  (x1 x2    xn ) and y  (y1 y2    yn
)
x1 y1      x n
cos(x, y)
 x 2  y n   x 2 y2 
  y2
1 n 1
n

d (x, y)  1  cos(x, y)
– Property: 0  d(x, y)  2
– Nonmetric vector objects: keywords in documents,
gene features in micro-arrays, …
– Applications: information retrieval, biologic
taxonomy,
...
Manhattan Distance between two points
(x1, y1) and (x2, y2) is:
|x1 – x2| + |y1 – y2|

• point1 = { -1, 5 }; point2 = { 1, 6 }


• point3 = { 3, 5 }; point4 = { 2, 3 }

• Distance of { 1, 6 }, { 3, 5 }, { 2, 3 } from { -1, 5 }


are 3, 4, 5 respectively.
• Example: Calculate Manhatten and Euclidean
distances
3

2 p1
p3 p4
1
p2
0
0 1 2 3 4 5 6
• Example: Manhatten and Euclidean distances
d(x , y) | x1  y 1 |  | x 2  y 2 |   | x n  y n |
B 3 L1 p1 p2 p3 p4
p1 0 4 4 6
2 p1
p2 4 0 2 4
p3 p4
p3 4 2 0 2
1
p4 6 4 2 0
p2
0 Distance Matrix for Manhattan Distance
0 1 2 3 4 5 6
A
L2 p1 p2 p3 p4
point A B
p1 0 2.828 3.162 5.099
p1 0 2
p2 2 0
p2 2.828 0 1.414 3.162
p3 3 1 p3 3.162 1.414 0 2
p4 5 1 p4 5.099 3.162 2 0

Data Matrix Distance Matrix for Euclidean Distance

d(x , y) | x  y |2  | x  y |2    |  y |
x n n
 1 1 2 2
2
• Example: Calculate Cosine measure

76
• Example: Cosine measure
x1  (3, 2, 0, 5, 2, 0, 0), x2  (1,0, 0, 0,1,
0, 2)

2 0 
31
32  22  02 0520252
 002 2142  0 0
0 022  5
6.48
1  0  0  0 1  0
2 2 2 2 2 2
 2
2 5 6  2.45
cos(x
 1,x ) 
 2 6.48 0.32
d (x1, x2 ) 2.45
1 cos(x1, x2 )  1
0.32  0.68cos(x, y)
x y    x y
1 1 n n

      xn2 y12     
x 12 n
y 2
d (x, y)  1  cos(x,
Distance Measures
• Distance for Binary Features
– For binary features, their value can be converted
into 1 or 0.
– Contingency table for binary feature vectors, x
and y
y

a : n u m b e r of features that equal 1 for both x a n d y


b : n u m b e r of features that equal 1 for x but that are 0 for
y c : n u m b e r of features that equal 0 for x but that are 1
for y d : n u m b e r of features that equal 0 for both x a n d y
Distance Measures
• Distance for Binary Features
– Distance for symmetric binary features
Both of their states equally valuable and carry the same
weight; i.e., no preference on which outcome should be
coded as 1 or 0 , e.g. gender
b
d(x, y)
 c a
bc
– Distance for asymmetric binary featuresd
Outcomes of the states not equally important, e.g., the
positive and negative outcomes of a disease test ; the rarest
one is set to 1 and the other is 0.
d(x, y)
Jaccard coefficient

bc
Distance Measures
• Example: Distance for binary features
Nam Gend Feve Coug Test- Test- Test- Test-  “Y”: yes
e er rY hN 1P 2N 3N 4 N  “P”: positive
Y N P N P N  “N”: negative
Y P N N N N

– gender is a symmetric feature (less important)


– the remaining features are asymmetric binary
– set the values “Y” and “P” to 1, and the value “N” to
0 Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
JJaacckk MM 1 0 1 0 0 0
MMaayry FF 1 0 1 0 1 0
Jm
im MM 1 1 0 0 0 0
Distance Measures
• Example: Distance for binary features
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4  “Y”: yes
 “P”: positive
Jack M 1 0 1 0 0 0  “N”: negative
Mary F 1 0 1 0 1 0
Jim M 1 1 0 0 0 0

– gender is a symmetric feature (less important)


– the remaining features are asymmetric binary
– set the values “Y” and “P” to 1, and the value “N” to 0
Mary
0
Jack d( Jack,Mary) 
1
2  0  0.33
Jim

11 
d( Jack, Jim) 
Jack
1
1  1  0.67

Mary 1 1
d( Jim,Mary) 
Jim 2
1  1  0.75

2
Cluster Images Based on Visual Similarity
• Feature Vector
• PCA for reducing the dimensions of our feature vector
• K-Means the clustering algorithm we’re going to use

Cluster 0

Cluster 1

Cluster 3
Clustering Techniques

Techniques

Partitioning Hierarchica
Density Based
Based l Based

K-Means Agglomerative DBSCAN


K-Means
Working of K-Means Algorithm
• We can understand the working of K-Means clustering algorithm
with the help of following steps −
• Step 1 − First, we need to specify the number of clusters, K, need
to be generated by this algorithm.
• Step 2 − Next, randomly select K data points and assign each
data point to a cluster. In simple words, classify the data based
on the number of data points.

• Step 3 − Now it will compute the cluster centroids.


• Step 4 − Next, keep iterating the following until we find
optimal
centroid which is the assignment of data points to the clusters
that are not changing any more
– 4.1 − First, the sum of squared distance between data
points
and centroids would be computed.
– 4.2 − Now, we have to assign each data point to the
cluster
that is closer than other cluster (centroid).
K-means Clustering
Example

a=[1 2;3 4]; a=uint8(a); L = imsegkmeans(a,2);


L=
1 1
2 2

Manual calculations:
1 1 C1=1 3-1=2;3-2=1; 1 1
2 2 assign to 2 2 2 C2=(2+3)/2=2.5
3 C2=2 3 2
4 4 4-1=3;4-2.5=1.5;
assign to 2
1 1
2 2 C2=(2+3+4)/3=3
3 2
4 2
Example
Divide the sample into 2 clusters
Step 4:
Partitioning–based clustering as exemplified by the approach in
the k-means algorithm
Goal : partition N instances into k clusters.
Steps of the algorithm:
1. Select k instances and allocate these as initial means
(centroids, prototypes)
2. Calculate the distance (typically Euclidean) from each instance
to all the
centroids
3. Associate all instances to the closest means (centroids, protype)
4. Let the resulting subsets of instances constitute the initial
clusters
5. Create new means (centroids, prototype) as the centroid of all
instances in
each cluster
6. Recalculate and reallocate all instances. An instance can change
cluster when the centroids are recomputed.
7. Reiterate from 4 until centroids remains stable.
Partitional Clustering
• Nonhierarchical, each instance is placed in
exactly one of K nonoverlapping clusters.
• Since only one set of clusters is output, the user
normally has to input the desired number of
clusters K.
10

1 2 3 4 5 6 7 8 9 10
K-means Clustering: Step 1
Algorithm: k-means, Distance Metric: Euclidean Distance
5

4
k1

k2
2

k3
0
0 1 2 3 4 5
K-means Clustering: Step 2
Algorithm: k-means, Distance Metric: Euclidean Distance
5

4
k1

k2
2

k3
0
0 1 2 3 4 5
K-means Clustering: Step 3
Algorithm: k-means, Distance Metric: Euclidean Distance
5

4
k1

2
k3
k2
1

0
0 1 2 3 4 5
K-means Clustering: Step 4
Algorithm: k-means, Distance Metric: Euclidean Distance
5

4
k1

2
k3
k2
1

0
0 1 2 3 4 5
K-means Clustering: Step 5
Algorithm: k-means, Distance Metric: Euclidean Distance

expression in condition 2 5

4
k1

2
k2
k3
1

0
0 1 2 3 4 5

expression in condition
1
Comments on the K-Means Method
• Strength
– Relatively efficient: O(tkn), where n is # objects, k is # clusters,
and t is # iterations. Normally, k, t << n.
– Often terminates at a local optimum. The global optimum may
be found using techniques such as: deterministic annealing
and genetic algorithms
• Weakness
– Applicable only when mean is defined, then what about
categorical data?
– Need to specify k, the number of clusters, in advance
– Unable to handle noisy data and outliers
– Not suitable to discover clusters with non-convex
shapes
Comments on the K-Means Method
Hierarchical Based

HB

Divisive Agglomerative

Hierarchical clustering proceeds successively either in a:


• Agglomerative fashion: a bottom-up approach where each observation starts in its
own cluster, and pairs of clusters are merged as one moves up the hierarchy

• Divisive fashion: a top-down approach where all observations start in one cluster,
and splits are performed recursively as one moves down the hierarchy.
Hierarchical clustering
Hierarchical clustering is a cluster technique which seeks to build
a hierarchy of clusters. The results of the hierarchical clustering are
usually presented in a dendrogram.

In Hierarchical Clustering, the aim is to produce a hierarchical series of nested clusters.


A diagram called Dendrogram (A Dendrogram is a tree-like diagram that statistics
the sequences of merges or splits) graphically represents this hierarchy and is an
inverted tree that describes the order in which factors are merged (bottom-up view) or
cluster are break up (top-down view).
Dendrogram
A dendrogram is a diagram that shows the hierarchical relationship between
objects. It is most commonly created as an output from hierarchical clustering.
The main use of a dendrogram is to work out the best way to allocate objects to
clusters.
Splits and Merges are typically performed based on a proximity matrix between clusters.

Proximity of two clusters is the average of the distances between the instances in the two
clusters. A Proximity matrix for clusters can be calculated from a Distance Matrix for
the instances. The ´proximity´ matrix is recalculated in each step of the algorithm.
Agglomerative
Initially consider every data point as an individual Cluster and at
every step, merge the nearest pairs of the cluster. (It is a bottom-up
method). At first every data set is considered as individual entity or
cluster. At every iteration, the clusters merge with different clusters
until one cluster is formed.
Algorithm for Agglomerative Hierarchical Clustering is:
•Calculate the similarity of one cluster with all the other
clusters (calculate proximity matrix)
•Consider every data point as a individual cluster
•Merge the clusters which are highly similar or close to each other.
•Recalculate the proximity matrix for each cluster
•Repeat Step 3 and 4 until only a single cluster remains.
What is a natural grouping among these objects?
(How-to) Hierarchical Clustering
The number of dendrograms with n leafs
=
2n  Since we cannot test all possible trees

3!2 n
we will have to heuristic search of all
n-2
possible trees. We could do this..

 2 !
Number Number of
Possible of Leafs Bottom-Up (agglomerative): Starting
Dendrograms
2 1 with each item in its own cluster,
3 3 find the best pair to merge into a
4 15
5 105 new cluster. Repeat until all clusters
... … are fused together.
10 34,459,425

Top-Down (divisive): Starting with all


the data in a single cluster, consider
every possible way to divide the cluster
into two. Choose the best division and
recursively operate on both sides.
We begin with a distance matrix which
contains the distances between every
pair of objects in our database.

0 8 8 7 7

0 2 4 4

0 3 3

D( , ) = 0 1

8 D( , ) = 0
Bottom-Up (agglomerative): Starting
with each item in its own cluster, find
the best pair to merge into a new
cluster. Repeat until all clusters are
fused together.

Consider all Choose


possible … the best
merges…
Bottom-Up (agglomerative): Starting
with each item in its own cluster, find
the best pair to merge into a new
cluster. Repeat until all clusters are
fused together.

Consider all
Choose
possible
… the best
merges…

Consider all Choose


possible … the best
merges…
Bottom-Up (agglomerative): Starting
with each item in its own cluster, find
the best pair to merge into a new
cluster. Repeat until all clusters are
fused together.

Consider all
Choose
possible
… the best
merges…

Consider all
Choose
possible
… the best
merges…

Consider all Choose


possible … the best
merges…
Bottom-Up (agglomerative): Starting
with each item in its own cluster, find
the best pair to merge into a new
cluster. Repeat until all clusters are
fused together.

Consider all
Choose
possible
… the best
merges…

Consider all
Choose
possible
… the best
merges…

Consider all Choose


possible … the best
merges…
There is only one dataset that can be
perfectly clustered using a hierarchy…
Example
Problem: Assume that the database D is given by the table below.
Follow single link technique to find clusters in D. Use Euclidean
distance measure
x y
p1 0.40 0.53
p2 0.22 0.38
p3 0.35 0.32
p4 0.26 0.19
p5 0.08 0.41
p6 0.45 0.30
Step 1. Plot the objects in n-dimensional space (where n is the number
of attributes). In our case we have 2 attributes – x and y, so we
plot the objects p1, p2, … p6 in 2-dimensional space:

x y
p1 0.40 0.53
p2 0.22 0.38
p3 0.35 0.32
p4 0.26 0.19
p5 0.08 0.41
p6 0.45 0.30
Step 2. Calculate the distance from each object (point) to all other
points, using Euclidean distance measure, and place the numbers in a
distance matrix.

x y
p1 0.40 0.53
p2 0.22 0.38
p3 0.35 0.32
p4 0.26 0.19
p5 0.08 0.41
p6 0.45 0.30
Step 3 Identify the two clusters with the shortest distance in the
matrix, and merge them together. Re-compute the distance matrix,
as those two clusters are now in a single cluster, (no longer exist by
themselves).
By looking at the distance matrix above, we see that p3 and p6
have the smallest distance from all - 0.11 So, we merge those
two in a single cluster, and re-compute the distance matrix.
Since, we have merged (p3, p6) together in a cluster, we now have one entry for
(p3, p6) in the table, and no longer have p3 or p6 separately. Therefore, we need
to re-compute the distance from each point to our new cluster - (p3, p6).
dist( (p3, p6), p1 ) = MIN ( dist(p3, p1) , dist(p6, p1) ) = MIN ( 0.22 , 0.23 )
= 0.22

dist( (p3, p6), p2 )=MIN(dist(p3, p2) , dist(p6, p2) ) = MIN ( 0.15 , 0.25 )
= 0.15

dist( (p3, p6), p4 )=MIN(dist(p3, p4) , dist(p6, p4) ) = MIN ( 0.15 , 0.22 )
= 0.15
dist( (p3, p6), p5 )=MIN(dist(p3, p5) , dist(p6, p5) ) = MIN ( 0.28 , 0.39)
= 0.28
dist( (p3, p6), (p2, p5) ) = MIN ( dist(p3, p2) , dist(p6, p2), dist(p3, p5), dist(p6, p5) )
= MIN ( 0.15 , 0.25, 0.28, 0.39) =
0.15
dist( (p2, p5), p1 ) = MIN ( dist(p2, p1) , dist(p5, p1) ) = MIN ( 0.24 , 0.34 )
= 0.24
dist( (p2, p5), p4 ) = MIN ( dist(p2, p14) , dist(p5, p4) ) = MIN ( 0.2 ,
0.29 )
= 0.2
dist( ((p2, p5),(p3,p6)), p1) = MIN ( dist(p2, p5),p1 , dist(p3, p6),p1
= MIN ( 0.24 , 0.22) = 0.22
dist( ((p2, p5),(p3,p6)), p4) = MIN ( dist(p2, p5),p4 , dist(p3, p6),p4
= MIN ( 0.2 , 0.15) = 0.15
dist( ((p2, p5,p3,p6,p4), p1) =
0.22
We know how to measure the distance between two
objects, but defining the distance between an object
and a cluster, or defining the distance between two
clusters is non obvious.

•Single linkage (nearest neighbor): In this method the distance between two
clusters is determined by the distance of the two closest objects (nearest
neighbors) in the different clusters.
•Complete linkage (furthest neighbor): In this method, the distances between
clusters are determined by the greatest distance between any two objects in the
different clusters (i.e., by the "furthest neighbors").
•Group average linkage: In this method, the distance between two clusters is
calculated as the average distance between all pairs of objects in the two different
clusters.
•Wards Linkage: In this method, we try to minimize the variance of the merged
clusters
Density-based clustering
Density-based clustering is a clustering technique which groups together instances
that are closely packed together (instances with many nearby neighbors), marking as
outliers instances that lie alone in low-density regions (whose nearest neighbors are
far away).

Properties of algorithms:
• Clusters are dense regions in the instance space, separated by regions of lower
instance density
• A cluster is defined as a set of connected instances with maximal density
• Does not need a predefined target value for # of clusters but
needs definitions of tresholds for reachability and density
• Discovers clusters of arbitrary shape.
• Is insensitive to noise.

Examples of algorithms:
• DBSCAN-
Density- Based Spatial Clustering of Applications with
Noise
Density-based clustering as exemplified with the approach in
DBSCAN
Instances are classified as core instances, reachable instances or outliers.
• A core instance has a minimum numbers of instances with a treshold radius.
• An instance is density reachable fram another instance if it is within a
treshold radius from a core instance.
• An instance is density connected to another instance if both instances are
density reachable from a third instance or if they are directly density
reachable from each other.
• All instances not reachable from any other instances are considered as
outliers
(possibly noise).
• If p is a core instance, then it forms a cluster together with all instances
that are reachable from it. Each cluster contains at least one core instance ;
non- core points can be part of a cluster, but they form its "edge“.
• All points within the cluster are mutually density-connected.
• If a point is density-reachable from any point of the cluster, it is part of
the cluster as well.
Density- Based Spatial Clustering
of Applications with Noise

•Core — This is a point that has at least m points within distance n from itself.
•Border — This is a point that has at least one Core point at a distance n.
•Noise — This is a point that is neither a Core nor a Border. And it has less than m
points within distance n from itself.
Point A and the other red instances are core instances, because
the area surrounding these instances in an ε radius contain a
specified minimum of 4 points Because they are all reachable
from one another, they form a single cluster. Points B and C are
not core points, but are reachable from A (via other core
points) and thus belong to the cluster as well. Point N is a
noise point that is neither a core point nor directly-reachable.
Density-Based Clustering:

Locates regions of high density that are separated from one another
by regions of low density.

The optimal value for the Epsilon(EPS/eps)

Eps: Two points are considered neighbors if the distance between


the two points is below threshold epsilon.
Min Pts: The minimum number of neighbors a given point should
have in order to be classified as a core point.
The Algorithm

1. Randomly select a point p


2.Retrieve all points density-reachable from p wrt. Eps
and MinPts
3. If p is a core point, a cluster is formed
4.If p is a border point, no points are density-reachable
from p
visit the next data point
5.Continue the process until all points have been
processed
Why DBSCAN

• Partioning and hierarchical clustering work for


finding spherical shaped or convex clusters.
These methods are severely affected by the
presence of noise and outliers in the data
DBSCAN- K-Means Method
Automatic Border Detection in
dermoscopy images-DBSCAN

You might also like