Distance Measure
Distance Measure
Distance Measure
A set of points is called a space. A space is necessary to define any distance measure. Let
x and y be two points in the space, then a distance measure is defined as a function which
takes the two points x and y as input, and produces the distance between the two points
x and y as output. The distance function is denoted as : d(x, y)
− The output produced by the function d is a real number which satisfies the following
axioms:
1. Non-negativity : The distance between any two points can never be negative.
d(x, y) 0
2. Zero distance : The distance between a point and itself is zero.
d(x, y) = 0 iff x = y
3. Symmetry : The distance from x to y is same as the distance from y to x.
d(x, y) = d(y, x)
4. Triangle inequality : The direct distance between x and y is always smaller than or equal
to the distance between
x and y via another point z. In other words, distance measure is the length of the shortest
path between two
points x and y.
d(x, y) d(x, z) + d(z, y)
The Euclidean distance is the most popular out of all the different distance measures.
− The Euclidean distance is measured on the Euclidean space. If we consider an n-
dimensional Euclidean space then
each point in that space is a vector of n real numbers. For example, if we consider the two-
dimensional Euclidean
space then each point in the space is represented by (x 1, x2) where x1 and x2 are real
numbers.
Q . Consider the two points (10, 4) and (6, 7) in the two-dimensional Euclidean space. Find
the Euclidean distance
between them.
Soln. :
(1) L2– norm = (10 – 6)2 + (4 – 7)2
= 42 + 32
= 16 + 9
= 25
=5
(2) L1 – norm = |10 – 6| + |4 – 7|
=4+3
=7
(3) L∞– norm = max (|10 – 6|, |4 – 7|)
= max (4, 3)
=4
Jaccard Distance
Jaccard distance is measured in the space of sets. Jaccard distance between two sets is
defined as :
d(x, y) = 1 – SIM(x, y)
SIM(x, y) is the Jaccard similarity which measures the closeness of two sets. Jaccard
similarity is given by the ratio of
the size of the intersection and the size of the union of the sets x and y.
We can verify the distance axioms on the Jaccard distance :
1. Non-negativity : The size of the intersection of two sets can never be more than the size
of the union. This means
the ratio SIM(x, y) will always be a value less than or equal to 1. Thus d(x, y) will never be
negative.
2. Zero distance : If x = y, then x u x = x n x = x. In this case SIM(x, y) = x/x = 1. Hence, d(x,
y) = 1 – 1 = 0. In other
words the Jaccard distance between the same set and itself is zero.
Cosine Distance
The cosine distance is measured in those spaces which have dimensions. Examples of such
spaces are :
1. Euclidean spaces in which the vector components are real numbers, and
2. Discrete versions of Euclidean spaces in which the vector components are integers or
Boolean (0 and 1).
Cosine distance is the angle made by the two vectors from the origin to the two points in
the space. The range of this
angle is between 0 to 180 degrees.
The steps involved in calculating the cosine distance given two vectors x and y are :
1. Find the dot product x.y :
4. Triangle inequality : The sum of the rotations from x to z and then z to y can never
be less than the direct rotation from x to y.
Edit Distance
① ② ③ ④ ⑤ ⑥ Positions
Clearly, positions②,③and⑥are having different characters in x and y. So we need to
make the necessary insertions and deletions at these three positions.
K L –
L O P
2 3 6
From position ② of string x, we have to delete the character K. The characters following
K will be shifted one position to the left.
In the final step, the character P has to be inserted in the string x at position 6.
After the third and final edit operation (insertion) the status of the string x is :
X=J L O M N P
①②③④⑤⑥
Hamming Distance
Hamming distance is applicable in the space of vectors. Hamming distance between two
vectors is the number of components in which they differ from each other.
− For example, let us consider the following two vectors :
x=1 0 0 0 1 1
y=1 1 1 0 1 0
① ② ③ ④ ⑤ ⑥ → Positions
− The Hamming distance between the above two vectors is 3 because components at
positions 2, 3 and 6 are different.
− The distance axioms on the Hamming distance may be verified as follows :
1. Non-negativity : Any two vectors will differ in at least zero or more component
positions. So, the Hamming
distance can never be negative.
2. Zero distance : Only in the case of two identical vectors, the Hamming distance will be
zero.
3. Symmetry : The Hamming distance will be the same whether x is compared with y or y
is compared with x.
4. Triangle inequality : The number of differences between x and z, plus the number of
differences between z and y can never be less than the number of differences between x
and y.
Stream Computing
− Stream computing is useful in real time system like count of items placed on a conveyor belt.
− IBM announced stream computing system in 2007, which runs 800 microprocessors and it enables to
software
applications to get split to task and rearrange data into answer.
− AT1 technologies derives stream computing with Graphical Processors (GPUs) working with high
performance with
low latency CPU to resolve computational issues.
− AT1 preferred stream computing to run application on GPU instead of CPU.
− BDMO Algorithm has complex structures and it is designed in approach to give guaranteed performance
even in worst
case.
− BDMO designed by B. Bahcock, M. Datar, R. Motwani and L. OCallaghan.
A small size ‘p’ is chosen for bucket where p is power of 2. Timestamp of this bucket belongs to a
timestamp of most
recent points of bucket.
− Clustering of these points done by specific strategy. Method preferred for clustering at initial stage
provide the
centriod or clustroids, it becomes record for each cluster.
Let,
* ‘p’ be smallest bucket size.
* Every p point, creates a new bucket, where bucket is time stamped along with cluster points.
* Any bucket older than N is dropped
* If number of buckets are 3 of size p
p → merge oldest two
− Then propagated merge may be like (2p, 4p, …).
− While merging buckets a new bucked created by review of sequence of buckets.
− If any bucket with more timestamp than N time unit prior to current time, at such scenario nothing will
be in window
of the bucket such bucket will be dropped.
− If we created p bucket then two of three oldest bucket will get merged. The newly merged bucket size
nearly zp, as we
needed to merge buckets with increasing sizes.
− To merge two consecutive buckets we need size of bucket twice than size of 2 buckets going to merge.
Timestamp of
newly merged bucket is most recent timestamp from 2 consecutive buckets. By computing few
parameters decision of
cluster merging is taken.
− Let, k-means Euclidean. A cluster represent with number of points (n) and centriod (c).
Put p = k, or larger – k-means clustering while creating bucket
To merge, n = n1 + n2, c = n1c1 + n2c2
n1 + n2
− Let, a non Euclidean, a cluster represented using clusteroid and CSD. To choose new clusteroid while
merging, k-points
furthest are selected from clusteroids.
CSDm (P) = CSD1 (P) + N2 (d2 (P, c1) + d2 (c1, c2)) + CSD2 (c2)
Answering Queries
− Given m, choose the smallest set of bucket such that it covers the most recent m points. At most 2m
points.
− Bucket construction and solution generation are the two steps used for quarry rewriting in a shared –
variable bucket
algorithm, one of the efficient approaches for answering queries.