0% found this document useful (0 votes)
32 views

Cosine Similarity

The document discusses different distance measures used for clustering in data mining including Euclidean distance, Manhattan distance, Jaccard index, Minkowski distance, and Cosine similarity. It provides details on calculating Cosine similarity between vectors including an example. Cosine distance measure determines the cosine of the angle between vectors.

Uploaded by

Sami Bulti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Cosine Similarity

The document discusses different distance measures used for clustering in data mining including Euclidean distance, Manhattan distance, Jaccard index, Minkowski distance, and Cosine similarity. It provides details on calculating Cosine similarity between vectors including an example. Cosine distance measure determines the cosine of the angle between vectors.

Uploaded by

Sami Bulti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Jimma University

Jimma Institute of Technology


Faculty of Computing and Informatics
Department of Data Science
Course Title: Natural Language Processing(NLP)
Course Code: CIMDS 51206
Assignment title: Distance functions

Submitted to: Workineh Tessema(PhDC)


Submitted by: Samuel Bulti(RM0230/15-0)
Email: [email protected]

Academic Year: 2023, Year 1, Sem II

Submission date: /apr/2023


Jimma, Oromia, Ethiopia
Measures of Distance in Data Mining

Clustering consists of grouping certain objects that are similar to each other, it can be used to
decide if two items are similar or dissimilar in their properties.

In a Data Mining sense, the similarity measure is a distance with dimensions describing object
features. That means if the distance among two data points is small, then there is a high degree
of similarity among the objects and vice versa. The similarity is subjective and depends heavily
on the context and application. For example, similarity among vegetables can be determined
from their taste, size, color, etc.

Most clustering approaches use distance measures to assess the similarities or differences
between a pair of objects, the most popular distance measures used are:

1. Euclidean distance
2. Manhattan distance
3. Jaccard index(similarity)
4. Minkowski distance
5. Cosine index(similarity)

The first four are given four others(my classmates) to briefly discuss it. Let I introduce the 5 th
one. That is about Cosine Similarity.

Cosine similarity is a metric, helpful in determining, how similar the data objects are
irrespective of their size. We can measure the similarity between two sentences in
Python using Cosine Similarity. In cosine similarity, data objects in a dataset are treated as a
vector. The formula to find the cosine similarity between two vectors is –

Cos(x, y) = x.y / ||x|| * ||y||

where,

x .y = product (dot) of the vectors ‘x’ and ‘y’.

||x|| and ||y|| = length of the two vectors ‘x’ and ‘y’.

||x|| * ||y|| = cross product of the two vectors ‘x’ and ‘y’.

Example:
Consider an example to find the similarity between two vectors  ‘x’ and ‘y’, using Cosine
Similarity.

The ‘x’ vector has values, x = { 3, 2, 0, 5 }


The ‘y’ vector has values, y = { 1, 0, 0, 0 }

Distance functions Page 1


The formula for calculating the cosine similarity is : Cos(x, y) = x . y / ||x|| * ||y||, Hence:

x .y = 3*1 + 2*0 + 0*0 + 5*0 = 3

||x|| = √ (3)^2 + (2)^2 + (0)^2 + (5)^2 = 6.16

||y|| = √ (1)^2 + (0)^2 + (0)^2 + (0)^2 = 1

∴ Cos(x, y) = 3 / (6.16 * 1) = 0.49

The dissimilarity between the two vectors ‘x’ and ‘y’ is given by:

∴ Dis(x, y) = 1 - Cos(x, y) = 1 - 0.49 = 0.51

 The cosine similarity between two vectors is measured in ‘θ’.

 If θ = 0°, the ‘x’ and ‘y’ vectors overlap, thus proving they are similar.

 If θ = 90°, the ‘x’ and ‘y’ vectors are dissimilar.

Cosine Similarity between two vectors.

Advantages :

The cosine similarity is beneficial because even if the two similar data objects are far
apart by the Euclidean distance because of the size, they could still have a smaller angle
between them. Smaller the angle, higher the similarity.

When plotted on a multi-dimensional space, the cosine similarity captures the orientation
(the angle) of the data objects and not the magnitude.

Distance functions Page 2


Cosine distance measure for clustering determines the cosine of the angle between two vectors
given by the following formula.

Here (theta) gives the angle between two vectors and A, B are n-dimensional vectors.

Figure – Cosine Distance

Distance functions Page 3

You might also like