0% found this document useful (0 votes)

32 views

Cosine Similarity

The document discusses different distance measures used for clustering in data mining including Euclidean distance, Manhattan distance, Jaccard index, Minkowski distance, and Cosine similarity. It provides details on calculating Cosine similarity between vectors including an example. Cosine distance measure determines the cosine of the angle between vectors.

Uploaded by

Sami Bulti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views

Cosine Similarity

Uploaded by

Sami Bulti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

Jimma University

Jimma Institute of Technology

Faculty of Computing and Informatics
Department of Data Science
Course Title: Natural Language Processing(NLP)
Course Code: CIMDS 51206
Assignment title: Distance functions

Submitted to: Workineh Tessema(PhDC)

Submitted by: Samuel Bulti(RM0230/15-0)
Email: [email protected]

Academic Year: 2023, Year 1, Sem II

Submission date: /apr/2023

Jimma, Oromia, Ethiopia
Measures of Distance in Data Mining

Clustering consists of grouping certain objects that are similar to each other, it can be used to
decide if two items are similar or dissimilar in their properties.

In a Data Mining sense, the similarity measure is a distance with dimensions describing object
features. That means if the distance among two data points is small, then there is a high degree
of similarity among the objects and vice versa. The similarity is subjective and depends heavily
on the context and application. For example, similarity among vegetables can be determined
from their taste, size, color, etc.

Most clustering approaches use distance measures to assess the similarities or differences
between a pair of objects, the most popular distance measures used are:

1. Euclidean distance
2. Manhattan distance
3. Jaccard index(similarity)
4. Minkowski distance
5. Cosine index(similarity)

The first four are given four others(my classmates) to briefly discuss it. Let I introduce the 5 th
one. That is about Cosine Similarity.

Cosine similarity is a metric, helpful in determining, how similar the data objects are
irrespective of their size. We can measure the similarity between two sentences in
Python using Cosine Similarity. In cosine similarity, data objects in a dataset are treated as a
vector. The formula to find the cosine similarity between two vectors is –

Cos(x, y) = x.y / ||x|| * ||y||

where,

x .y = product (dot) of the vectors ‘x’ and ‘y’.

||x|| and ||y|| = length of the two vectors ‘x’ and ‘y’.

||x|| * ||y|| = cross product of the two vectors ‘x’ and ‘y’.

Example:
Consider an example to find the similarity between two vectors ‘x’ and ‘y’, using Cosine
Similarity.

The ‘x’ vector has values, x = { 3, 2, 0, 5 }

The ‘y’ vector has values, y = { 1, 0, 0, 0 }

Distance functions Page 1

The formula for calculating the cosine similarity is : Cos(x, y) = x . y / ||x|| * ||y||, Hence:

x .y = 31 + 20 + 00 + 50 = 3

||x|| = √ (3)^2 + (2)^2 + (0)^2 + (5)^2 = 6.16

||y|| = √ (1)^2 + (0)^2 + (0)^2 + (0)^2 = 1

∴ Cos(x, y) = 3 / (6.16 * 1) = 0.49

The dissimilarity between the two vectors ‘x’ and ‘y’ is given by:

∴ Dis(x, y) = 1 - Cos(x, y) = 1 - 0.49 = 0.51

 The cosine similarity between two vectors is measured in ‘θ’.

 If θ = 0°, the ‘x’ and ‘y’ vectors overlap, thus proving they are similar.

 If θ = 90°, the ‘x’ and ‘y’ vectors are dissimilar.