0% found this document useful (0 votes)
7 views

Similarity

Uploaded by

ahmed.sherif3400
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Similarity

Uploaded by

ahmed.sherif3400
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Similarity and Dissimilarity Measures

• Similarity measure
• Numerical measure of how alike two data objects are.
• Is higher when objects are more alike.
• Often falls in the range [0,1]
• Dissimilarity measure
• Numerical measure of how different two data objects are
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies
• Proximity refers to a similarity or dissimilarity
Similarity/Dissimilarity for Simple Attributes
The following table shows the similarity and dissimilarity
between two objects, x and y, with respect to a single, simple
attribute.
Euclidean Distance
• Euclidean Distance

where n is the number of dimensions (attributes) and xk and yk are,


respectively, the kth attributes (components) or data objects x and y.
• x = (3, 6, 0, 3, 6)
• y = (1, 2, 0, 1,2)
• dist x, y = (3 − 1)2 +(6 − 2)2 + (0 − 0)2 +(3 − 1)2 +(6 − 2)2
dist x, y = 6.324

Standardization is necessary, if scales differ.


Euclidean Distance

3
point x y
2 p1
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6

p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
Minkowski Distance

• Minkowski Distance is a generalization of


Euclidean Distance Where r is a parameter, n is the
number of dimensions (attributes) and xk and yk
are, respectively, the kth attributes (components) or
data objects x and y.


Minkowski Distance: Examples

• r = 1. City block (Manhattan, taxicab, L1 norm) distance.


• A common example of this for binary vectors is the Hamming
distance, which is just the number of bits that are different between
two binary vectors

• r = 2. Euclidean distance

• r → . “supremum” (Lmax norm, L norm) distance.


• This is the maximum difference between any component of the
vectors
Hamming Distance
• Hamming distance is the number of positions in
which bit-vectors differ.
• Example: p1 = 10101
p2 = 10011.
• d(p1, p2) = 2 because the bit-vectors differ in the 3rd and 4th
positions.
• The L1 norm for the binary vectors

8
Distances for real vectors
• Vectors 𝑥 = 𝑥1 , … , 𝑥𝑑 and 𝑦 = (𝑦1 , … , 𝑦𝑑 )
• Lp norms or Minkowski distance:
𝐿𝑝 𝑥, 𝑦 = 𝑥1 − 𝑦1 𝑝 + ⋯ + 𝑥𝑑 − 𝑦𝑑 𝑝 1ൗ𝑝

• L2 norm: Euclidean distance:


𝐿2 𝑥, 𝑦 = 𝑥1 − 𝑦1 2 + ⋯ + 𝑥𝑑 − 𝑦𝑑 2

• L1 norm: Manhattan distance:


𝐿1 𝑥, 𝑦 = 𝑥1 − 𝑦1 + ⋯ + |𝑥𝑑 − 𝑦𝑑 |

• L∞ norm:
𝐿∞ 𝑥, 𝑦 = max 𝑥1 − 𝑦1 , … , |𝑥𝑑 − 𝑦𝑑 |
• The limit of Lp as p goes to infinity.
Minkowski Distance
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
point x y
p1 0 2 L2 p1 p2 p3 p4
p2 2 0 p1 0 2.828 3.162 5.099
p3 3 1 p2 2.828 0 1.414 3.162
p4 5 1 p3 3.162 1.414 0 2
p4 5.099 3.162 2 0

L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0

Distance Matrix
Example of Distances
y = (9,8)
L2-norm:
𝑑𝑖𝑠𝑡(𝑥, 𝑦) = 42 + 32 = 5

5 3
L1-norm:
4 𝑑𝑖𝑠𝑡(𝑥, 𝑦) = 4 + 3 = 7
x = (5,5)

L∞-norm:
𝑑𝑖𝑠𝑡(𝑥, 𝑦) = max 3,4 = 4
11
Common Properties of a Distance
• Distances, such as the Euclidean distance, have
some well known properties.
1. d(x, y)  0 for all x and y and d(x, y) = 0 if and only if
x = y.
2. d(x, y) = d(y, x) for all x and y. (Symmetry)
3. d(x, z)  d(x, y) + d(y, z) for all points x, y, and z.
(Triangle Inequality)
where d(x, y) is the distance (dissimilarity) between points
(data objects), x and y.

• A distance that satisfies these properties is a metric


Similarity Between Binary Vectors
• Common situation is that objects, x and y, have only binary
attributes
• Compute similarities using the following quantities
f01 = the number of attributes where x was 0 and y was 1
f10 = the number of attributes where x was 1 and y was 0
f00 = the number of attributes where x was 0 and y was 0
f11 = the number of attributes where x was 1 and y was 1

• Simple Matching and Jaccard Coefficients


SMC = number of matches / number of attributes
= (f11 + f00) / (f01 + f10 + f11 + f00)
J = number of 11 matches / number of non-zero attributes
= (f11) / (f01 + f10 + f11)
SMC versus Jaccard: Example
x= 1000000000
y= 0000001001

f01 = 2 (the number of attributes where x was 0 and y was 1)


f10 = 1 (the number of attributes where x was 1 and y was 0)
f00 = 7 (the number of attributes where x was 0 and y was 0)
f11 = 0 (the number of attributes where x was 1 and y was 1)

SMC = (f11 + f00) / (f01 + f10 + f11 + f00)


= (0+7) / (2+1+0+7) = 0.7

J = (f11) / (f01 + f10 + f11) = 0 / (2 + 1 + 0) = 0


Jaccard: Example
Cosine Similarity
• If d1 and d2 are two document vectors, then
cos( d1, d2 ) = <d1,d2> / ||d1|| ||d2|| ,
where <d1,d2> indicates inner product or vector dot product of vectors, d1 and d2,
and || d || is the length of vector d.
X = x 2
i X •Y  (x i  yi )
i cos( X , Y ) = = i
X  y xi
2
i  y
i
2
i

• Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
<d1, d2> ( d1 • d2 ) =3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
|| d1 || = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
|| d2 || = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.449
cos(d1, d2 ) = 0.3150
Common Properties of a Similarity
• Similarities, also have some well known
properties.
1. s(x, y) = 1 (or maximum similarity) only if x = y.
2. s(x, y) = s(y, x) for all x and y. (Symmetry)

where s(x, y) is the similarity between points (data objects),


x and y.
Similarities into distances
• Jaccard distance:
𝐽𝐷𝑖𝑠𝑡(𝑋, 𝑌) = 1 – 𝐽𝑆𝑖𝑚(𝑋, 𝑌)

• Jaccard Distance is a metric

• Cosine distance:
𝐷𝑖𝑠𝑡(𝑋, 𝑌) = 1 − cos(𝑋, 𝑌)

You might also like