Measures of Distance in Data Mining
Last Updated :
30 Jul, 2024
Clustering
consists of grouping certain objects that are similar to each other, it can be used to decide if two items are similar or dissimilar in their properties. In a
Data Mining sense, the similarity measure is a distance with dimensions describing object features. That means if the distance among two data points is
small
then there is a
high
degree of similarity among the objects and vice versa. The similarity is
subjective
and depends heavily on the context and application. For example, similarity among vegetables can be determined from their taste, size, colour etc. Most clustering approaches use distance measures to assess the similarities or differences between a pair of objects, the most popular distance measures used are:
1. Euclidean Distance:
Euclidean distance is considered the traditional metric for problems with geometry. It can be simply explained as the
ordinary distance
between two points. It is one of the most used algorithms in the cluster analysis. One of the algorithms that use this formula would be
K-mean
. Mathematically it computes the
root of squared differences
between the coordinates between two objects.
\begin{aligned}
d(\mathbf{p}, \mathbf{q})=d(\mathbf{q}, \mathbf{p}) &=\sqrt{\left(q_{1}-p_{1}\right)^{2}+\left(q_{2}-p_{2}\right)^{2}+\cdots+\left(q_{n}-p_{n}\right)^{2}} \\
&=\sqrt{\sum_{i=1}^{n}\left(q_{i}-p_{i}\right)^{2}}
\end{aligned}
Figure -
Euclidean Distance
2. Manhattan Distance:
This determines the absolute difference among the pair of the coordinates. Suppose we have two points P and Q to determine the distance between these points we simply have to calculate the perpendicular distance of the points from X-Axis and Y-Axis. In a plane with P at coordinate (x1, y1) and Q at (x2, y2). Manhattan distance between P and Q = |x1 – x2| + |y1 – y2|

Here the total distance of the
Red
line gives the Manhattan distance between both the points.
3. Jaccard Index:
The Jaccard distance measures the similarity of the two data set items as the
intersection
of those items divided by the
union
of the data items.
J(A, B)=\frac{|A \cap B|}{|A \cup B|}=\frac{|A \cap B|}{|A|+|B|-|A \cap B|} 
Figure -
Jaccard Index
4. Minkowski distance:
It is the
generalized
form of the Euclidean and Manhattan Distance Measure. In an
N-dimensional space
, a point is represented as,
(x1, x2, ..., xN)
Consider two points P1 and P2:
P1: (X1, X2, ..., XN)
P2: (Y1, Y2, ..., YN)
Then, the Minkowski distance between P1 and P2 is given as:
\sqrt[p]{(x 1-y 1)^{p}+(x 2-y 2)^{p}+\ldots+(x N-y N)^{p}} - When p = 2, Minkowski distance is same as the Euclidean distance.
- When p = 1, Minkowski distance is same as the Manhattan distance.
5. Cosine Index:
Cosine distance measure for clustering determines the
cosine
of the angle between two vectors given by the following formula.
\operatorname{sim}(A, B)=\cos (\theta)=\frac{A \cdot B}{\|A\| B \|} Here (
theta
) gives the angle between two vectors and A, B are n-dimensional vectors.

Figure -
Cosine Distance
Similar Reads
Data Integration in Data Mining
INTRODUCTION : Data integration in data mining refers to the process of combining data from multiple sources into a single, unified view. This can involve cleaning and transforming the data, as well as resolving any inconsistencies or conflicts that may exist between the different sources. The goal
5 min read
Data Reduction in Data Mining
Prerequisite - Data Mining The method of data reduction may achieve a condensed description of the original data which is much smaller in quantity but keeps the quality of the original data. INTRODUCTION: Data reduction is a technique used in data mining to reduce the size of a dataset while still p
7 min read
Measures in Data Mining - Categorization and Computation
In data mining, Measures are quantitative tools used to extract meaningful information from large sets of data. They help in summarizing, describing, and analyzing data to facilitate decision-making and predictive analytics. Measures assess various aspects of data, such as central tendency, variabil
5 min read
Measuring Clustering Quality in Data Mining
A cluster is the collection of data objects which are similar to each other within the same group. The data objects of a cluster are dissimilar to data objects of other groups or clusters. Clustering Approaches:1. Partitioning approach: The partitioning approach constructs various partitions and the
4 min read
Tasks and Functionalities of Data Mining
Data Mining functions are used to define the trends or correlations contained in data mining activities. In comparison, data mining activities can be divided into 2 categories: 1]Descriptive Data Mining: This category of data mining is concerned with finding patterns and relationships in the data th
10 min read
Data Preprocessing in Data Mining
Data preprocessing is the process of preparing raw data for analysis by cleaning and transforming it into a usable format. In data mining it refers to preparing raw data for mining by performing tasks like cleaning, transforming, and organizing it into a format suitable for mining algorithms. Goal i
6 min read
Data Mining in R
Data mining is the process of discovering patterns and relationships in large datasets. It involves using techniques from a range of fields, including machine learning, statistics and database systems, to extract valuable insights and information from data.In this article, we will provide an overvie
3 min read
Data Normalization in Data Mining
Data normalization is a technique used in data mining to transform the values of a dataset into a common scale. This is important because many machine learning algorithms are sensitive to the scale of the input features and can produce better results when the data is normalized. Normalization is use
5 min read
Data Transformation in Data Mining
Data transformation in data mining refers to the process of converting raw data into a format that is suitable for analysis and modeling. It also ensures that data is free of errors and inconsistencies. The goal of data transformation is to prepare the data for data mining so that it can be used to
5 min read
Partitioning Method (K-Mean) in Data Mining
Partitioning Method: This clustering method classifies the information into multiple groups based on the characteristics and similarity of the data. Its the data analysts to specify the number of clusters that has to be generated for the clustering methods. In the partitioning method when database(D
3 min read