0% found this document useful (0 votes)

7 views

Similarity

Uploaded by

ahmed.sherif3400

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

Similarity

Uploaded by

ahmed.sherif3400

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Similarity and Dissimilarity Measures

• Similarity measure
• Numerical measure of how alike two data objects are.
• Is higher when objects are more alike.
• Often falls in the range [0,1]
• Dissimilarity measure
• Numerical measure of how different two data objects are
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies
• Proximity refers to a similarity or dissimilarity
Similarity/Dissimilarity for Simple Attributes
The following table shows the similarity and dissimilarity
between two objects, x and y, with respect to a single, simple
attribute.
Euclidean Distance
• Euclidean Distance

where n is the number of dimensions (attributes) and xk and yk are,

respectively, the kth attributes (components) or data objects x and y.
• x = (3, 6, 0, 3, 6)
• y = (1, 2, 0, 1,2)
• dist x, y = (3 − 1)2 +(6 − 2)2 + (0 − 0)2 +(3 − 1)2 +(6 − 2)2
dist x, y = 6.324

Standardization is necessary, if scales differ.

Euclidean Distance

3
point x y
2 p1
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6

p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
Minkowski Distance

• Minkowski Distance is a generalization of

Euclidean Distance Where r is a parameter, n is the
number of dimensions (attributes) and xk and yk
are, respectively, the kth attributes (components) or
data objects x and y.

•
Minkowski Distance: Examples

• r = 1. City block (Manhattan, taxicab, L1 norm) distance.

• A common example of this for binary vectors is the Hamming
distance, which is just the number of bits that are different between
two binary vectors

• r = 2. Euclidean distance

• r → . “supremum” (Lmax norm, L norm) distance.

• This is the maximum difference between any component of the
vectors
Hamming Distance
• Hamming distance is the number of positions in
which bit-vectors differ.
• Example: p1 = 10101
p2 = 10011.
• d(p1, p2) = 2 because the bit-vectors differ in the 3rd and 4th
positions.
• The L1 norm for the binary vectors

8
Distances for real vectors
• Vectors 𝑥 = 𝑥1 , … , 𝑥𝑑 and 𝑦 = (𝑦1 , … , 𝑦𝑑 )
• Lp norms or Minkowski distance:
𝐿𝑝 𝑥, 𝑦 = 𝑥1 − 𝑦1 𝑝 + ⋯ + 𝑥𝑑 − 𝑦𝑑 𝑝 1ൗ𝑝

• L2 norm: Euclidean distance:

𝐿2 𝑥, 𝑦 = 𝑥1 − 𝑦1 2 + ⋯ + 𝑥𝑑 − 𝑦𝑑 2

• L1 norm: Manhattan distance:

𝐿1 𝑥, 𝑦 = 𝑥1 − 𝑦1 + ⋯ + |𝑥𝑑 − 𝑦𝑑 |

• L∞ norm:
𝐿∞ 𝑥, 𝑦 = max 𝑥1 − 𝑦1 , … , |𝑥𝑑 − 𝑦𝑑 |
• The limit of Lp as p goes to infinity.
Minkowski Distance
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
point x y
p1 0 2 L2 p1 p2 p3 p4
p2 2 0 p1 0 2.828 3.162 5.099
p3 3 1 p2 2.828 0 1.414 3.162
p4 5 1 p3 3.162 1.414 0 2
p4 5.099 3.162 2 0

L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0

Distance Matrix
Example of Distances
y = (9,8)
L2-norm:
𝑑𝑖𝑠𝑡(𝑥, 𝑦) = 42 + 32 = 5

5 3
L1-norm:
4 𝑑𝑖𝑠𝑡(𝑥, 𝑦) = 4 + 3 = 7
x = (5,5)

L∞-norm:
𝑑𝑖𝑠𝑡(𝑥, 𝑦) = max 3,4 = 4
11
Common Properties of a Distance
• Distances, such as the Euclidean distance, have
some well known properties.
1. d(x, y)  0 for all x and y and d(x, y) = 0 if and only if
x = y.
2. d(x, y) = d(y, x) for all x and y. (Symmetry)
3. d(x, z)  d(x, y) + d(y, z) for all points x, y, and z.
(Triangle Inequality)
where d(x, y) is the distance (dissimilarity) between points
(data objects), x and y.

• A distance that satisfies these properties is a metric

Similarity Between Binary Vectors
• Common situation is that objects, x and y, have only binary
attributes
• Compute similarities using the following quantities
f01 = the number of attributes where x was 0 and y was 1
f10 = the number of attributes where x was 1 and y was 0
f00 = the number of attributes where x was 0 and y was 0
f11 = the number of attributes where x was 1 and y was 1

• Simple Matching and Jaccard Coefficients

SMC = number of matches / number of attributes
= (f11 + f00) / (f01 + f10 + f11 + f00)
J = number of 11 matches / number of non-zero attributes
= (f11) / (f01 + f10 + f11)
SMC versus Jaccard: Example
x= 1000000000
y= 0000001001

f01 = 2 (the number of attributes where x was 0 and y was 1)

f10 = 1 (the number of attributes where x was 1 and y was 0)
f00 = 7 (the number of attributes where x was 0 and y was 0)
f11 = 0 (the number of attributes where x was 1 and y was 1)

SMC = (f11 + f00) / (f01 + f10 + f11 + f00)

= (0+7) / (2+1+0+7) = 0.7

J = (f11) / (f01 + f10 + f11) = 0 / (2 + 1 + 0) = 0

Jaccard: Example
Cosine Similarity
• If d1 and d2 are two document vectors, then
cos( d1, d2 ) = <d1,d2> / ||d1|| ||d2|| ,
where <d1,d2> indicates inner product or vector dot product of vectors, d1 and d2,
and || d || is the length of vector d.
X = x 2
i X •Y  (x i  yi )
i cos( X , Y ) = = i
X  y xi
2
i  y
i
2
i

• Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
<d1, d2> ( d1 • d2 ) =3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
|| d1 || = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
|| d2 || = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.449
cos(d1, d2 ) = 0.3150
Common Properties of a Similarity
• Similarities, also have some well known
properties.
1. s(x, y) = 1 (or maximum similarity) only if x = y.
2. s(x, y) = s(y, x) for all x and y. (Symmetry)

where s(x, y) is the similarity between points (data objects),

x and y.
Similarities into distances
• Jaccard distance:
𝐽𝐷𝑖𝑠𝑡(𝑋, 𝑌) = 1 – 𝐽𝑆𝑖𝑚(𝑋, 𝑌)

• Jaccard Distance is a metric

• Cosine distance:
𝐷𝑖𝑠𝑡(𝑋, 𝑌) = 1 − cos(𝑋, 𝑌)

Chapter 1 Summary For Biochemistry
No ratings yet
Chapter 1 Summary For Biochemistry
16 pages
Similarity
No ratings yet
Similarity
20 pages
L13
No ratings yet
L13
19 pages
Materi 7.1. Distance Measurement
No ratings yet
Materi 7.1. Distance Measurement
14 pages
CS2209 Similarity Distances
No ratings yet
CS2209 Similarity Distances
23 pages
Measuring Data Similarity and Dissimilarity
No ratings yet
Measuring Data Similarity and Dissimilarity
20 pages
Lab 2
No ratings yet
Lab 2
21 pages
Dist
No ratings yet
Dist
14 pages
Similarity and Dissimilarity
No ratings yet
Similarity and Dissimilarity
34 pages
Similarity Analysis
No ratings yet
Similarity Analysis
85 pages
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
No ratings yet
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
57 pages
TE IT DMBI Module2 Data Preprocessing L8-L11
No ratings yet
TE IT DMBI Module2 Data Preprocessing L8-L11
73 pages
DMi_03-Proximity
No ratings yet
DMi_03-Proximity
51 pages
Lecture 3
No ratings yet
Lecture 3
58 pages
class 1c -DataFundamentals
No ratings yet
class 1c -DataFundamentals
27 pages
Class-Data Preprocessing-IV
No ratings yet
Class-Data Preprocessing-IV
28 pages
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
No ratings yet
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
30 pages
Similarty and Dissimilarity
No ratings yet
Similarty and Dissimilarity
11 pages
Similarity Measures
No ratings yet
Similarity Measures
11 pages
DMi 03 Proximity
No ratings yet
DMi 03 Proximity
9 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
26 pages
Data Mining: Similarity and Distance
No ratings yet
Data Mining: Similarity and Distance
6 pages
CSE-1-PPT-MiniTest-12feb24-Similarity (6)
No ratings yet
CSE-1-PPT-MiniTest-12feb24-Similarity (6)
11 pages
Similarity
No ratings yet
Similarity
19 pages
3 Unit PR NonParametric Decision Making
No ratings yet
3 Unit PR NonParametric Decision Making
78 pages
Ml unit 2
No ratings yet
Ml unit 2
11 pages
Lecture 2. Similarity Measures For Cluster Analysis
No ratings yet
Lecture 2. Similarity Measures For Cluster Analysis
31 pages
CS-DM MODULE- 3
No ratings yet
CS-DM MODULE- 3
27 pages
Clustering Lecture 1: Basics: Jing Gao
No ratings yet
Clustering Lecture 1: Basics: Jing Gao
62 pages
Lecture 4
No ratings yet
Lecture 4
33 pages
18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 2
No ratings yet
18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 2
4 pages
DS5 Statistics
No ratings yet
DS5 Statistics
67 pages
Lec-3. datamining-similarity-distance-ext
No ratings yet
Lec-3. datamining-similarity-distance-ext
104 pages
DistancesSimilarities
No ratings yet
DistancesSimilarities
39 pages
Unit 3
No ratings yet
Unit 3
13 pages
29.measuring Data Similarity and Dissimilarity Introduction
No ratings yet
29.measuring Data Similarity and Dissimilarity Introduction
43 pages
Lec09 466 PDF
No ratings yet
Lec09 466 PDF
5 pages
rsfinal (1)
No ratings yet
rsfinal (1)
30 pages
Lec 5
No ratings yet
Lec 5
22 pages
distance-and-similarity
No ratings yet
distance-and-similarity
33 pages
distance-and-similarity
No ratings yet
distance-and-similarity
33 pages
Data Science: Department of Computer Science & Engineering
No ratings yet
Data Science: Department of Computer Science & Engineering
31 pages
Measure of Proximity
No ratings yet
Measure of Proximity
11 pages
Module-3Conti.. Similarity& Dissimlarity
No ratings yet
Module-3Conti.. Similarity& Dissimlarity
29 pages
4.4-InstanceBasedLearning Part 1
No ratings yet
4.4-InstanceBasedLearning Part 1
16 pages
Lesson 6 Similarities KNN
No ratings yet
Lesson 6 Similarities KNN
25 pages
Distance Functions
No ratings yet
Distance Functions
10 pages
CS822-DataMining-Week4 (2)
No ratings yet
CS822-DataMining-Week4 (2)
45 pages
2 Similarity Disimilarity Measure
No ratings yet
2 Similarity Disimilarity Measure
35 pages
Cluster Analysis Introduction (Unit-6)
No ratings yet
Cluster Analysis Introduction (Unit-6)
20 pages
Lecture 7 - Distance Measures
No ratings yet
Lecture 7 - Distance Measures
38 pages
Introduction To Machine Learning: K-Nearest Neighbor Algorithm
No ratings yet
Introduction To Machine Learning: K-Nearest Neighbor Algorithm
25 pages
Lecture 10
No ratings yet
Lecture 10
26 pages
Clustering
0% (1)
Clustering
127 pages
Chapter_2
No ratings yet
Chapter_2
70 pages
Data Mining: Similarity and Distance
No ratings yet
Data Mining: Similarity and Distance
13 pages
APznzaaN7_CY3hhfhbJRXjYJ1BR6-NtGzIkO6tA99bBiITMP7edAeijYM4WIPHTX6qmgs05QF3M-ALsy0PRS_TYvyugVy6R2kjYnK0BCBRm9Wtq_9FaGq4pVaH_pFWQ-CutgWY_nI5HsUACQNIaD3Gu0gxaanUrACiGy2qvKlVDZgXatZgVnQ_WWUQGN5GK3MgGPyk7wNYpPtuWmopw0KMKDCQDXsrCNzmu9V5rqcPBmZE4z
No ratings yet
APznzaaN7_CY3hhfhbJRXjYJ1BR6-NtGzIkO6tA99bBiITMP7edAeijYM4WIPHTX6qmgs05QF3M-ALsy0PRS_TYvyugVy6R2kjYnK0BCBRm9Wtq_9FaGq4pVaH_pFWQ-CutgWY_nI5HsUACQNIaD3Gu0gxaanUrACiGy2qvKlVDZgXatZgVnQ_WWUQGN5GK3MgGPyk7wNYpPtuWmopw0KMKDCQDXsrCNzmu9V5rqcPBmZE4z
50 pages
Cosine Similarity
No ratings yet
Cosine Similarity
4 pages
Data Mining: Similarity and Distance
No ratings yet
Data Mining: Similarity and Distance
13 pages
Mathematics 1St First Order Linear Differential Equations 2Nd Second Order Linear Differential Equations Laplace Fourier Bessel Mathematics
From Everand
Mathematics 1St First Order Linear Differential Equations 2Nd Second Order Linear Differential Equations Laplace Fourier Bessel Mathematics
Andrew Igla
No ratings yet
Application of Derivatives Tangents and Normals (Calculus) Mathematics E-Book For Public Exams
From Everand
Application of Derivatives Tangents and Normals (Calculus) Mathematics E-Book For Public Exams
Mohmmad Khaja Shareef
5/5 (1)
Gear Design
100% (3)
Gear Design
53 pages
Low-Temperature Hysteresis Effects in Metal-Oxide-Silicon Capacitors Caused Surface-State Trapping
No ratings yet
Low-Temperature Hysteresis Effects in Metal-Oxide-Silicon Capacitors Caused Surface-State Trapping
6 pages
6639b77a2c4d48001847a9f6 - ## - Modern Physics - DPP 01 - Varun JEE Advanced 2024
No ratings yet
6639b77a2c4d48001847a9f6 - ## - Modern Physics - DPP 01 - Varun JEE Advanced 2024
3 pages
Enterrado-Manual Caesar II 2011
100% (1)
Enterrado-Manual Caesar II 2011
22 pages
A2 Photosynthesis Homework Study
No ratings yet
A2 Photosynthesis Homework Study
3 pages
Practice Test 1
No ratings yet
Practice Test 1
3 pages
Tutorial 4
No ratings yet
Tutorial 4
1 page
Get Introductory Physics For The Life Sciences Quantity Based Analysis 1st Edition David V. Guerra Free All Chapters
100% (4)
Get Introductory Physics For The Life Sciences Quantity Based Analysis 1st Edition David V. Guerra Free All Chapters
23 pages
Chandler M. - Chess Tactics For Kids - Gambit 2010
No ratings yet
Chandler M. - Chess Tactics For Kids - Gambit 2010
2 pages
ITA-AFTES - Guideline For The Design of Shield Tunnel Lining PDF
No ratings yet
ITA-AFTES - Guideline For The Design of Shield Tunnel Lining PDF
29 pages
Time Travelling, by Means of Time Dilation, Black Holes and Wormholes - Volume 1
100% (1)
Time Travelling, by Means of Time Dilation, Black Holes and Wormholes - Volume 1
13 pages
Booklet A KIM 101E Midterm 2: December 24, 2022
No ratings yet
Booklet A KIM 101E Midterm 2: December 24, 2022
4 pages
Bridge Failures & Disasters
No ratings yet
Bridge Failures & Disasters
17 pages
Lecture 18111
No ratings yet
Lecture 18111
31 pages
C4 NPII 403 Two Nuclear System (Deuteron Problem)
No ratings yet
C4 NPII 403 Two Nuclear System (Deuteron Problem)
17 pages
Uma Unimaster 70 250 Dust Collectors
No ratings yet
Uma Unimaster 70 250 Dust Collectors
12 pages
Newtons Rings
100% (1)
Newtons Rings
6 pages
Coordination Compound
0% (1)
Coordination Compound
17 pages
Chamical Bonding
No ratings yet
Chamical Bonding
12 pages
Variation of Parameters I I
No ratings yet
Variation of Parameters I I
3 pages
CMVA2010 Pump Cavitation Presentation
100% (2)
CMVA2010 Pump Cavitation Presentation
30 pages
All DPP
No ratings yet
All DPP
121 pages
Conductors and Insulators
No ratings yet
Conductors and Insulators
3 pages
Wave Final
0% (1)
Wave Final
22 pages
ChargedParallelMetalPlates PDF
No ratings yet
ChargedParallelMetalPlates PDF
2 pages
Fundamentals of Magnetism and Electricity-Split1
100% (1)
Fundamentals of Magnetism and Electricity-Split1
12 pages
CMOS Interview Questions
100% (1)
CMOS Interview Questions
3 pages
Matter Waves
No ratings yet
Matter Waves
25 pages
SCH 2100 Atomic Structure
No ratings yet
SCH 2100 Atomic Structure
3 pages

Similarity

Uploaded by

Similarity

Uploaded by

Similarity and Dissimilarity Measures

where n is the number of dimensions (attributes) and xk and yk are,

Standardization is necessary, if scales differ.

• Minkowski Distance is a generalization of

• r = 1. City block (Manhattan, taxicab, L1 norm) distance.

• r → . “supremum” (Lmax norm, L norm) distance.

• L2 norm: Euclidean distance:

• L1 norm: Manhattan distance:

• A distance that satisfies these properties is a metric

• Simple Matching and Jaccard Coefficients

f01 = 2 (the number of attributes where x was 0 and y was 1)

SMC = (f11 + f00) / (f01 + f10 + f11 + f00)

J = (f11) / (f01 + f10 + f11) = 0 / (2 + 1 + 0) = 0

where s(x, y) is the similarity between points (data objects),

• Jaccard Distance is a metric

You might also like