Cosine Similarity
Cosine Similarity
Clustering consists of grouping certain objects that are similar to each other, it can be used to
decide if two items are similar or dissimilar in their properties.
In a Data Mining sense, the similarity measure is a distance with dimensions describing object
features. That means if the distance among two data points is small, then there is a high degree
of similarity among the objects and vice versa. The similarity is subjective and depends heavily
on the context and application. For example, similarity among vegetables can be determined
from their taste, size, color, etc.
Most clustering approaches use distance measures to assess the similarities or differences
between a pair of objects, the most popular distance measures used are:
1. Euclidean distance
2. Manhattan distance
3. Jaccard index(similarity)
4. Minkowski distance
5. Cosine index(similarity)
The first four are given four others(my classmates) to briefly discuss it. Let I introduce the 5 th
one. That is about Cosine Similarity.
Cosine similarity is a metric, helpful in determining, how similar the data objects are
irrespective of their size. We can measure the similarity between two sentences in
Python using Cosine Similarity. In cosine similarity, data objects in a dataset are treated as a
vector. The formula to find the cosine similarity between two vectors is –
where,
||x|| * ||y|| = cross product of the two vectors ‘x’ and ‘y’.
Example:
Consider an example to find the similarity between two vectors ‘x’ and ‘y’, using Cosine
Similarity.
The dissimilarity between the two vectors ‘x’ and ‘y’ is given by:
If θ = 0°, the ‘x’ and ‘y’ vectors overlap, thus proving they are similar.
Advantages :
The cosine similarity is beneficial because even if the two similar data objects are far
apart by the Euclidean distance because of the size, they could still have a smaller angle
between them. Smaller the angle, higher the similarity.
When plotted on a multi-dimensional space, the cosine similarity captures the orientation
(the angle) of the data objects and not the magnitude.
Here (theta) gives the angle between two vectors and A, B are n-dimensional vectors.