0% found this document useful (0 votes)
37 views34 pages

Distance Metrics - Billard

The document discusses symbolic data analysis focusing on dissimilarity, similarity, and distance measures for clustering, particularly in the context of veterinary data. It defines various distance measures, including dissimilarity measures, distance metrics, and ultrametric measures, along with their properties and examples. Additionally, it introduces the Gowda-Diday dissimilarity measure for multi-valued observations and provides examples of distance calculations for different types of variables.

Uploaded by

junkmsp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views34 pages

Distance Metrics - Billard

The document discusses symbolic data analysis focusing on dissimilarity, similarity, and distance measures for clustering, particularly in the context of veterinary data. It defines various distance measures, including dissimilarity measures, distance metrics, and ultrametric measures, along with their properties and examples. Additionally, it introduces the Gowda-Diday dissimilarity measure for multi-valued observations and provides examples of distance calculations for different types of variables.

Uploaded by

junkmsp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Symbolic Data Analysis:

Dissimilarity/Similarity/Distance Measures
(for Clustering)

Lynne Billard

Department of Statistics
University of Georgia
[email protected]

COMPSTAT - August 2010

Billard Symbolic Data


Distances Clustering

Consider Veterinary Data (Table 7.5)


ωu Animal Y1 Height Y2 Weight
ω1 Horse M [120.0, 180.0] [222.2, 354.0]
ω2 Horse F [158.0, 160.0] [322.0, 355.0]
ω3 Bear M [175.0, 185.0] [117.2, 152.0]
ω4 Deer M [37.9, 62.9] [22.2, 35.0]
ω5 Deer F [25.8, 39.6] [15.0, 36.2]
ω6 Dog F [22.8, 58.6] [15.0, 51.8]
ω7 Rabbit M [22.0, 45.0] [0.8, 11.0]
ω8 Rabbit F [18.0, 53.0] [0.4, 2.5]
ω9 Cat M [40.3, 55.8] [2.1, 4.5]
ω10 Cat F [38.4, 72.4] [2.5, 6.1]

All animals ωu , u = 1, . . . , 10 Animals ωu , u = 4, . . . , 10


Billard Symbolic Data
Dis/Similarity / Distance Measures

Distance Measures, Similarity/Dissimilarity Matrices:

Goal is to subdivide the complete set of observations E into subsets


Pr = (C1 , . . . , Cr ) ≡ E with ∪Ck = E , and Ck0 ∩ Ck = φ, k 0 6= k

Mathematically,
use distance measures to produce what we see visually in veterinary data:

Billard Symbolic Data


Dis/Similarity / Distance Measures

Let the dissimilarity measure between objects a and b be d(a, b), and the
corresponding similarity measure be s(a, b).

[Typically, d(a, b) and s(a, b) have reciprocal /inverse relationship,


e.g., d(a, b) = 1s(a, b). So, consider d(a, b).]
——–
Definition 7.1: Let a and b be any two objects in E . Then, a dissimilarity measure
d(a, b) is a measure that satisfies
(i) d(a, b) = d(b, a);
(ii) d(a, a) = d(b, b) < d(a, b) for all a 6= b;
(iii) d(a, a) = 0 for all a ∈ E .

Definition 7.2: A distance measure (or metric) is a dissimilarity measure as defined in


Definition 7.1 which further satisfies
(iv) d(a, b) = 0 implies a = b;
(v) d(a, b) ≤ d(a, c) + d(c, b) for all a, b, c ∈ E .

Then from property (i), dissimilarity d(a, b) is symmetric,


and (v) is the triangle property
Definition 7.3: An ultrametric measure is a distance measure as defined in Definition
7.2 which also satisfies
(vi) d(a, b) ≤ Max{d(a, c), d(c, b)} for all a, b, c ∈ E .
Billard Symbolic Data
Dis/Similarity / Distance Measures

Definition 7.3: An ultrametric measure is a distance measure as defined in Definition


7.2 which also satisfies
(vi) d(a, b) ≤ Max{d(a, c), d(c, b)} for all a, b, c ∈ E .
Ultrametrics and hierarchies are in 1-1 correspondence;
so need ultrametrics to compare hierarchies.

E.g., However,

d(a, b) ≤ max{d(a, c), d(b, c)} d(a, b) ≥ max{d(a, c), d(b, c)}
- ultrametric - NOT ultrametric

Billard Symbolic Data


Dis/Similarity / Distance Measures

Definition 7.4: For the collection of objects a1 , . . . , am ∈ E , the dissimilarity matrix


(or, distance matrix) is the m × m matrix D with elements d(ai , aj ), i, j = 1, . . . , m.

d(a, b) ≤ max{d(a, c), d(b, c)} d(a, b) ≥ max{d(a, c), d(b, c)}
- ultrametric - NOT ultrametric
   
0 2 3 0 2 1.5
D=  2 0 3  D=  . 0 1.2 
3 3 0 . . 0

Notice property (v) d(a, b) ≤ d(a, c) + d(c, b) for all a, b, c, holds.

Billard Symbolic Data


Dis/Similarity / Distance Measures

Definition 7.5: A dissimilarity (or distance) matrix whose elements d(a, b)


monotonically increase as they move away from the diagonal (by column and by row)
is called a Robinson matrix. (Some use monotonically non-decreasing)

Robinson matrices are in 1-1 correspondence with indexed pyramids.

- ultrametric - NOT ultrametric - ultrametric


     
0 2 3 0 2 1.5 0 2 3
D= 2 0 3  D=  . 0 1.2  D= . 0 2.5 
3 3 0 . . 0 . . 0
(Not ?) Robinson Not Robinson Robinson

Billard Symbolic Data


Dis/Similarity / Distance Measures
L L L
Definition 7.6: The Cartesian join A B = (A1 B1L, . . . , Ap Bp ) between two
sets A and B is their componentwise union where Aj Bj = ”Aj ∪ Bj ”. When A and
B are multi-valued objects with Aj = {aj1 , . . . , ajsj } and Bj = {bj1 , . . . , bjtj }, then
M
Aj Bj = {aj1 , . . . , bjtj }, j = 1, . . . , p, (7.1)

is the set of values in Aj , Bj or both. When A and B are interval-valued objects with
Aj = [ajA , bjA ] and Bj = [ajB , bjB ], then
M
Aj Bj = [Min(ajA , ajB ), Max(bjA , bjB )] (7.2)

N N N
Definition 7.7: The Cartesian meet A B = (A1 B1 , . . .N
, Ap Bp ) between two
sets A and B is their componentwise intersection
N where A j B j = ”Aj ∩ Bj ”. When
A and B are multi-valued objects, then Aj Bj is the list of possible values from Yj
common to both. When A and B are interval-valued objects forming overlapping
interval on Yj ,
O
Aj Bj = [Max(ajA , ajB ), Min(bjA , bjB )] (7.3)
N
and when Aj ∩ Bj = φ , then Aj Bj = 0.

Billard Symbolic Data


Dis/Similarity / Distance Measures

E.g.1, multi-valued variables . . .


A = ({blue, gray, pink, green}, {shirt, dress}, {small, large})
B = ({ blue, white}, {shirt, slacks, dress}, {small, medium})

Then,
L the join is
A B = ({blue, gray, pink, green, white}, {shirt, slacks, dress}, {small, medium,
large}),
and
Nthe meet is
A B = ({blue}, {shirt, dress}, {small}).

E.g.2, interval-valued variables . . .


A = ([6, 12], [16, 22]), B = ([8, 10], [18, 24])

Then
L the join is
A B = ([6, 12], [16, 24]),
and
Nthe meet is
A B = ([8, 10], [18, 22]).

E.g.3, mixed variables (multi- and interval-valued) . . .


Let A = ([6, 12], {shirt, dress}), B = ([8, 10], {shirt, slacks, dress}).
L N
Then, A B = ([6, 12], {shirt, slacks, dress}), A B = ([8, 10], {shirt, dress})

Billard Symbolic Data


Dis/Similarity / Distance Measures

Multi-valued Variables:
Write observations ξ(ωu ) as

ξ(ωu ) = ({Yu1k1 , k1 = 1, . . . , k1u }; . . . ; {Yu1kp , kp = 1, . . . , kpu }). (7.14)

Definition 7.15: The Gowda-Diday dissimilarity measure between two multi-valued


observations ξ(ω1 ) and ξ(ω2 ) of the form (7.14) is
p
X
D(ω1 , ω2 ) = [D1j (ω1 , ω2 ) + D2j (ω1 , ω2 )]
j=1

where

D1j (ω1 , ω2 ) = (|kj1 − kj2 |)/kj , j = 1, . . . , p, (7.15)


D2j (ω1 , ω2 ) = (kj1 + kj2 − 2kj∗ )/kj , j = 1, . . . , p, (7.16)

where kj is the number of values from Yj in the join and kj∗ is the number in the meet
of ξ(ω1 ) and ξ(ω2 ), respectively.

D1j (ω1 , ω2 ) is a span distance (relative sizes) component, and


D2j (ω1 , ω2 ) is a relative content component, of the distance
P
Write, D(ω1 , ω2 ) = j φj (ω1 , ω2 )

Billard Symbolic Data


Dis/Similarity / Distance Measures

E.g., Color and Habitat of Birds (Table 7.2)


Y1 = Color, Y2 = Habitat

ωu Species Y1 = Color Y2 = Habitat


ω1 species1 {red, black} {urban, rural}
ω2 species2 {red} {urban}
ω3 species3 {red, black, blue} {rural}
ω4 species4 {red, black,blue} {urban, rural}
Pp P
Recall D(ω1 , ω2 ) = j=1 [D1j (ω1 , ω2 ) + D2j (ω1 , ω2 )] = j φj (ω1 , ω2 )

D1j (ω1 , ω2 ) = (|kj1 −kj2 |)/kj , D2j (ω1 , ω2 ) = (kj1 +kj2 −2kj∗ )/kj , j = 1, . . . , p, (7.14−7.15)

where kj is the number of values from Yj in the join and kj∗ is the number in the meet
of ξ(ω1 ) and ξ(ω2 ), respectively, and kju is the number of values from Yj in ωu .

For Y1 : D11 (ω1 , ω3 ) = (|2 − 3|)/3 = 1/3; D21 (ω1 , ω3 ) = (2 + 3 − 2 × 2)/3 = 1/3.

For Y2 : D12 (ω1 , ω3 ) = (|2 − 1|)/2 = 1/2; D22 (ω1 , ω3 ) = (2 + 1 − 2 × 1)/2 = 1/2.

φ1 (ω1 , ω3 ) = D11 (ω1 , ω3 ) + D21 (ω1 , ω3 ) = 1/3 + 1/3 = 2/3;


φ2 (ω1 , ω3 ) = D12 (ω1 , ω3 ) + D22 (ω1 , ω3 ) = 1/2 + 1/2 = 1;
P
D(ω1 , ω3 ) = j φj (ω1 , ω3 ) = 2/3 + 1 = 5/3.

Billard Symbolic Data


Dis/Similarity / Distance Measures

The complete table of Gowda-Diday distances, D(ωu , ωu 0 ) ≡ φ(ωu , ωu 0 ):

Y1 = Color Y2 = Habitat (Y1 , Y2 )


(ωu , ωu 0 ) D1 (., .) D2 (., .) φ1 (ωu , ωu 0 ) D1 (., .) D2 (., .) φ2 (ωu , ωu 0 ) φ(ωu , ωu 0 )
(ω1 , ω2 ) 1/2 1/2 1 1/2 1/2 1 2
(ω1 , ω3 ) 1/3 1/3 2/3 1/2 1/2 1 5/3
(ω1 , ω4 ) 1/3 1/3 2/3 0 0 0 2/3
(ω2 , ω3 ) 2/3 2/3 4/3 0 1 1 7/3
(ω2 , ω4 ) 0 2/3 2/3 1/2 1/2 1 5/3
(ω3 , ω4 ) 0 0 0 1/2 1/2 1 1

 
0 2 5/3 2/3
 . 0 7/3 5/3 
Distance matrix is: D=
 
. . 0 1 
. . . 0

This is not normalized for scale differences.

To account for scale differences, use φ0 (ωu , ωu 0 ) = φ(ωu , ωu 0 )/|Y|


where |Y| is number of possible values from |Y| covered by E

Billard Symbolic Data


Dis/Similarity / Distance Measures

The complete table of Gowda-Diday distances, D(ωu , ωu 0 ) ≡ φ(ωu , ωu 0 ):

Y1 = Color Y2 = Habitat (Y1 , Y2 )


(ωu , ωu 0 ) φ1 (., .) φ01 (., .) φ2 (., .) φ02 (., .) φ(ωu , ωu 0 ) φ0 (ωu , ωu 0 )
(ω1 , ω2 ) 1 1/3 1 1/2 2 5/6
(ω1 , ω3 ) 2/3 2/9 1 1/2 5/3 13/18
(ω1 , ω4 ) 2/3 2/9 0 0 2/3 2/9
(ω2 , ω3 ) 4/3 4/9 1 1/2 7/3 17/18
(ω2 , ω4 ) 2/3 2/9 1 1/2 5/3 13/18
(ω3 , ω4 ) 0 0 1 1/2 1 1/2

|Y1 | = 3 and |Y2 | = 2

Gowda-Diday distance matrix:

Normalized : Non-Normalized:
   
0 5/6 13/18 2/9 0 2 5/3 2/3
 . 0 17/18 13/18   . 0 7/3 5/3
D0 = 

 .
 D= 
. 0 1/2   . . 0 1 
. . . 0 . . . 0

Billard Symbolic Data


Dis/Similarity / Distance Measures

Recall observations ξ(ωu ) written as


ξ(ωu ) = ({Yu1k1 , k1 = 1, . . . , k1u }; . . . ; {Yu1kp , kp = 1, . . . , kpu }). (7.14)
Definition 7.16: The Ichino-Yaguchi dissimilarity measure between two multi-valued
observations ξ(ω1 ) and ξ(ω2 ) of the form of Equation (7.14) for the variable Yj ,
j = 1, . . . , p, is
φj (ω1 , ω2 ) = kj − kj∗ + γ(2kj∗ − kj1 − kj2 ), j = 1, . . . , p, (7.17)
where kj is the number of values from Yj in the join and kj∗
is the number in the meet
of ξ(ω1 ) and ξ(ω2 ), respectively, with kju the number of values from Yj in observation
ωu ; and where 0 ≤ γ ≤ 0.5 is a prespecified constant.

For the Bird Data (Table 7.4)


φj (ωu , ωu 0 ) Non-Normalized Normalized†
(ωu , ωu 0 ) Y1 = Color Y2 = Habitat q=1 q=2 q=1 q=2
(ω1 , ω2 ) 1 + γ(−1) 1 + γ(−1) 0.500 0.707 0.208 0.300
(ω1 , ω3 ) 1 + γ(−1) 1 + γ(−1) 0.500 0.707 0.208 0.300
(ω1 , ω4 ) 1 + γ(−1) 0 0.250 0.500 0.083 0.167
(ω2 , ω3 ) 2 + γ(−2) 2 + γ(−2) 1.000 1.414 0.417 0.601
(ω2 , ω4 ) 2 + γ(−2) 1 + γ(−1) 0.750 1.118 0.181 0.417
(ω3 , ω4 ) 0 1 + γ(−1) 0.250 0.500 0.125 0.250
† Normalized by Yj

Billard Symbolic Data


Dis/Similarity / Distance Measures

Interval-valued data -
ξu ≡ ξ(ωu ) = ([auj , buj ], j = 1, . . . , p), u = 1, . . . , m

Definition 7.18: The Ichino-Yaguchi dissimilarity measure between two interval-valued


observations ξ(ωu1 ) and ξ(ωu2 ) ξ(ωu ) = [auj , buj ], u = 1, ..., m for the variable Yj ,
j = 1, . . . , p, is
φj (ωu1 , ωu2 ) = |ωu1 j ⊕ ωu2 j | − |ωu1 j ⊗ ωu2 j | + γ(2|ωu1 j ⊗ ωu2 j | − |ωu1 j | − |ωu2 j | (7.27)
where |A| is the length of the interval A = [a, b], i.e., |A| = b − a, and 0 ≤ γ ≤ 0.5 is
a prespecified constant.

Definition 7.19: The generalized Minkowski distance of order q ≥ 1 between two


interval-valued objects ωu1 and ωu2 is
p
X
dq (ωu1 , ωu2 ) = ( wj∗ [φj (ωu1 , ωu2 )]q )1/q (7.28)
j=1

where φj (ωu1 , ωu2 ) is the Ichino-Yaguchi distance (of Definition 7.18, eqn(7.27)) and
wj∗ is an appropriate weight function associated with Yj , j = 1, . . . , p.

When q = 1 → City Block distance


When q = 2 → Euclidean distance
Billard Symbolic Data
Dis/Similarity / Distance Measures

Take the first 3 observations only of veterinary data:

ωu Animal Y1 Height Y2 Weight


ω1 Horse M [120.0, 180.0] [222.2, 354.0]
ω2 Horse F [158.0, 160.0] [322.0, 355.0]
ω3 Bear M [175.0, 185.0] [117.2, 152.0]

φj (ωu1 , ωu2 ) = |ωu1 j ⊕ ωu2 j | − |ωu1 j ⊗ ωu2 j | + γ(2|ωu1 j ⊗ ωu2 j | − |ωu1 j | − |ωu2 j |
(7.27)
Aj ⊕ Bj = [Min(ajA , ajB ), Max(bjA , bjB )] (7.2)
Aj ⊗ Bj = [Max(ajA , ajB ), Min(bjA , bjB )] (7.3)

For (HorseF , BearM) and Y1 ,

φ1 (ωu1 , ωu2 ) = |Min(158, 175), Max(160, 185)| − |Max(158, 175), Min(160, 185)|
+ γ(2|Max(158, 175), Min(160, 185)| − |160 − 158| − |185 − 175|)
= |158, 185| − |175, 160| + γ(2 × 0 − 2 − 12)
= 27 − 0 + γ(2 × 0 − 12) = 27 + γ(−12)

Note, the meet |175, 160| is empty.


Billard Symbolic Data
Dis/Similarity / Distance Measures

For the first 3 observations only of veterinary data:

The complete set of Ichino-Yaguchi Dissimilarity measures is:


φj (ωu1 , ωu2 ) γ = 1/2
(ωu1 , ωu2 ) j =1 j =2 j =1 j =2
(HorseM, HorseF) 58 + γ(-58) 100.8 + γ(-100.8) 29 50.4
(HorseM, BearM) 60 + γ(-60) 236.8 + γ(-166.6) 30 153.5
(HorseF, BearM) 27 + γ(-12) 237.8 + γ(-67.8) 21 203.9

Billard Symbolic Data


Dis/Similarity / Distance Measures

Definition 7.19: The generalized Minkowski distance of order q ≥ 1 between two


interval-valued objects ωu1 and ωu2 is
p
X
dq (ωu1 , ωu2 ) = ( wj∗ [φj (ωu1 , ωu2 )]q )1/q (7.28)
j=1

where φj (ωu1 , ωu2 ) is the Ichino-Yaguchi distance (of Definition 7.18, eqn(7.27)) and
wj∗ is an appropriate weight function associated with Yj , j = 1, . . . , p.
q = 1 → City Block distance q = 2 → Euclidean distance

The normalized Euclidean distance of order q between two objects ωu1 and ωu2 is
p
X
d2 (ωu1 , ωu2 ) = ([1/p] wj∗ [φj (ωu1 , ωu2 )]q )1/q (7.30)
j=1

where φj (ωu1 , ωu2 ) is the Ichino-Yaguchi distance (of Definition 7.18, eqn(7.27)) and
wj∗ is an appropriate weight function associated with Yj , j = 1, . . . , p.

Billard Symbolic Data


Dis/Similarity / Distance Measures

φj (ωu1 , ωu2 ) γ = 1/2


(ωu1 , ωu2 ) j =1 j =2 j =1 j =2
(HorseM, HorseF) 58 + γ(-58) 100.8 + γ(-100.8) 29 50.4
(HorseM, BearM) 60 + γ(-60) 236.8 + γ(-166.6) 30 153.5
(HorseF, BearM) 27 + γ(-12) 237.8 + γ(-67.8) 21 203.9

φj (ωu1 , ωu2 ) = |ωu1 j ⊕ ωu2 j | − |ωu1 j ⊗ ωu2 j | + γ(2|ωu1 j ⊗ ωu2 j | − |ωu1 j | − |ωu2 j |
p
X
d2 (ωu1 , ωu2 ) = ([1/p] wj∗ [φj (ωu1 , ωu2 )]2 )1/2 , wj∗ = |Yj |
j=1

Unweighted (i.e., wj∗ = 1), the normalized Euclidean distance for (HorseF, BearM) is,
p
X
d2 (ωu1 , ωu2 ) = ([1/p] ωj∗ [φj (HorseF , BearM)]2 )1/2
j=1

= ((1/2)[(21)2 + (203.9)2 ])1/2 = 144.94

Weighted (i.e., wj∗ = Yj ), the normalized Euclidean distance for (HorseF, BearM) is,
p
X
d2 (ωu1 , ωu2 ) = ([1/p] wj∗ ωj∗ [φj (HorseF , BearM)]2 )1/2
j=1

= ((1/2)[(1/65)(21)2 + (1/237.8)(203.9)2 ])1/2 = 144.94


Billard Symbolic Data
Dis/Similarity / Distance Measures

Normalized Euclidean distances


using Ichino-Yaguchi Dissimilarity measures is (γ = 1/2):

φj (ωu1 , ωu2 ) d2 (ωu1 , ωu2 )


(ωu1 , ωu2 ) j =1 j =2 Unweighted Weighted
(HorseM, HorseF) 29 50.4 41.117 3.437
(HorseM, BearM) 30 153.5 110.594 7.514
(HorseF, BearM) 21 203.9 144.942 9.529

Normalized Euclidean Distance matrix:


   
0 41.117 110.595 0 3.437 7.514
0
D =  . 0 144.942  D= . 0 9.529 
. . 0 . . 0

Unweighted (wj∗ = 1) Weighted (wj∗ = 1/|Yj |)

Billard Symbolic Data


Dis/Similarity / Distance Measures

Normalized Weighted Euclidean Distance Matrix


using Ichino-Yaguchi Dissimilarity measures is (γ = 1/2):

0 2.47 5.99 11.16 11.76 11.28 12.37 12.45 12.06 11.85


 
 . 0 7.74 13.07 13.62 13.16 14.25 14.35 13.97 13.77 
. . 0 8.13 9.04 8.52 9.36 9.35 8.74 8.39
 
 
. . . 0 0.98 0.70 1.26 1.31 0.98 0.95
 
 
. . . . 0 0.67 0.78 1.08 1.19 1.48
 
D=
 
. . . . . 0 1.11 1.23 1.26 1.36

 
. . . . . . 0 0.37 0.81 1.21
 
 
. . . . . . . 0 0.69 1.09
 
 
 . . . . . . . . 0 0.51 
. . . . . . . . . 0

For the first 3 animals (HorseM, HorseF, BearM) we had:


 
0 3.437 7.514
D= . 0 9.529 
. . 0
– difference is due to differing weights

Billard Symbolic Data


Dis/Similarity / Distance Measures

Normalized Weighted Euclidean Distance Matrix


using Ichino-Yaguchi Dissimilarity measures is (γ = 1/2):
D
= Animal
0 2.47 5.99 11.16 11.76 11.28 12.37 12.45 12.06 11.85

Horse M
 . 0 7.74 13.07 13.62 13.16 14.25 14.35 13.97 13.77  HorseF
 . . 0 8.13 9.04 8.52 9.36 9.35 8.74 8.39
 
 BearM
 . . . 0 0.98 0.70 1.26 1.31 0.98 0.95
 
 DeerM
 . . . . 0 0.67 0.78 1.08 1.19 1.48
 
 DeerF
 . . . . . 0 1.11 1.23 1.26 1.36
 
 DogF
 . . . . . . 0 0.37 0.81 1.21
 
 RabbitM
 . . . . . . . 0 0.69 1.09
 
 RabbitF
 . . . . . . . . 0 0.51 CatM

. . . . . . . . . 0 CatF

Billard Symbolic Data


Dis/Similarity / Distance Measures

φj (ωu1 , ωu2 ) Euclidean:d2 (ωu1 , ωu2 ) City Block:d1 (ωu1 , ωu2 )


(ωu1 , ωu2 ) j =1 j =2 Unweighted Weighted Unweighted Weighted
(HorseM, HorseF) 29 50.4 41.117 3.437 39.70 0.329
(HorseM, BearM) 30 153.5 110.594 7.514 91.75 0.554
(HorseF, BearM) 21 203.9 144.942 9.529 112.45 0.590

Ichino-Yaguchi measures:

φj (ωu1 , ωu2 ) = |ωu1 j ⊕ ωu2 j | − |ωu1 j ⊗ ωu2 j | + γ(2|ωu1 j ⊗ ωu2 j | − |ωu1 j | − |ωu2 j |

Normalized weighted Minkowski distance:


p
X
dq (ωu1 , ωu2 ) = ([1/p] wj∗ [φj (ωu1 , ωu2 )]q )1/q
j=1

Unweighted: wj∗ = 1; Weighted wj∗ = 1/|Yj |: w1∗ = 1/65, w2∗ = 1/237.8

City Block:d1 (ωu1 , ωu2 ) = ([1/p] pj=1 cj wj∗ [φj (ωu1 , ωu2 )])
P

City Block factor/weight: cj = 1/p = 1/2


Pp
Normalized Euclidean:d2 (ωu1 , ωu2 ) = ([1/p] j=1 wj∗ [φj (ωu1 , ωu2 )]2 )1/2

These are important for Divisive Clustering methodology

Billard Symbolic Data


Dis/Similarity / Distance Measures

φj (ωu1 , ωu2 ) Euclidean:d2 (ωu1 , ωu2 ) City Block:d1 (ωu1 , ωu2 )


(ωu1 , ωu2 ) j =1 j =2 Unweighted Weighted Unweighted Weighted
(HorseM, HorseF) 29 50.4 41.117 3.437 39.70 0.329
(HorseM, BearM) 30 153.5 110.594 7.514 91.75 0.554
(HorseF, BearM) 21 203.9 144.942 9.529 112.45 0.590

City Block Distance Matrix Euclidean Distance Matrix


D
=  D
=  D
=  D
= 
0 39.70 91.75 0 0.33 0.55 0 41.12 110.59 0 0.35 0.56
 . 0 112.45   . 0 0.59   . 0 144.94   . 0 0.65 
. . 0 . . 0 . . 0 . . 0
Unweighted Weighted Unweighted Weighted

None appear to be Robinson matrices


However,

D=
 D=
  D=
 D=
 
0 39.70 112.45 0 0.33 0.59 0 41.12 144.94 0 0.35 0.65
 . 0 91.75  . 0 0.55   . 0 110.59  . 0 0.56 
. . 0 . . 0 . . 0 . . 0

ALL are Robinson matrices


Billard Symbolic Data
Dis/Similarity / Distance Measures

Hausdorff Distances for interval-valued data:


Hausdorff
Euclidean Hausdorff
Normalized Euclidean Hausdorff
Span Normalized Euclidean Hausdorff

(Important for Divisive Clustering methodology)

Definition 7.20: The Hausdorff distance between two interval-valued objects ωu1 and
ωu2 , with ξuj = [auj , buj ], j = 1, . . . , p, u = 1, . . . , m, for Yj , is

φj (ωu1 , ωu2 ) = Max[|au1 j − au2 j |, |bu1 j − bu2 j | (7.31)

Definition 7.21: The Euclidean Hausdorff distance between two interval-valued objects
ωu1 and ωu2 , with ξuj = [auj , buj ], is
p
X
d(ωu1 , ωu2 ) = ( [φj (ωu1 , ωu2 )]2 )1/2 (7.32)
j=1

Billard Symbolic Data


Dis/Similarity / Distance Measures

Definition 7.22: The Normalized Euclidean Hausdorff distance between two


interval-valued objects ωu1 and ωu2 , with ξuj = [auj , buj ], is
p
X
d(ωu1 , ωu2 ) = ( [{φj (ωu1 , ωu2 )}/Hj ]2 )1/2 (7.33)
j=1
m X
X m
Hj2 = (1/[2m2 ]) [φj (ωu1 , ωu2 )]2 (7.34)
u1 =1 u2 =1

The Normalized Euclidean Hausdorff distance is also called a Dispersion Normalization

If the data are classical, then this Normalized Euclidean distance is equivalent to a
Euclidean distance on R2 , with Hj corresponding to the standard deviation of Yj .

Definition 7.23: The Span Normalized Euclidean Hausdorff distance between two
interval-valued objects ωu1 and ωu2 , with ξuj = [auj , buj ], is
p
X
d(ωu1 , ωu2 ) = ( [{φj (ωu1 , ωu2 )}/|Yj |]2 )1/2 (7.35)
j=1

where from (7.26) the span is |Yj | = maxu (buj ) − minu (auj ).

This Span Normalization is also called a maximum deviation distance.

Billard Symbolic Data


Dis/Similarity / Distance Measures

ωu Animal Y1 Height Y2 Weight


ω1 Horse M [120.0, 180.0] [222.2, 354.0]
ω2 Horse F [158.0, 160.0] [322.0, 355.0]
ω3 Bear M [175.0, 185.0] [117.2, 152.0]

Hausdorff distance: φj (ωu1 , ωu2 ) = Max[|au1 j − au2 j |, |bu1 j − bu2 j | (7.31)

For (HorseF, BearM) and Y1 , we have


φ1 (HorseF , BearM) = Max[|158 − 175|, |160 − 185|] = Max[17, 25] = 25

For (HorseF, BearM) and Y2 , we have


φ2 (HorseF , BearM) = Max[|322 − 117.2|, |355 − 152|] = Max[204.8, 203] = 204.8

Complete set of Hausdorff Distances – (First 3 animals) –

φj (ωu1 , ωu2 )
(ωu1 , ωu2 ) j =1 j =2
(HorseM, HorseF) 38 99.8
(HorseM, BearM) 55 202.0
(HorseF, BearM) 25 204.8

Billard Symbolic Data


Dis/Similarity / Distance Measures

Complete set of Hausdorff Distances – (First 3 animals) –

Normalized
φj (ωu1 , ωu2 ) Euclidean Euclidean
(ωu1 , ωu2 ) j =1 j =2 d(ωu1 , ωu2 ) d n (ωu1 , ωu2 )
(HorseM, HorseF) 38 99.8 106.790 2.653
(HorseM, BearM) 55 202.0 209.354 4.314
(HorseF, BearM) 25 204.8 206.320 3.217
Hausdorff distance: φj (ωu1 , ωu2 ) = Max[|au1 j − au2 j |, |bu1 j − bu2 j | (7.31)
d(ωu1 , ωu2 ) = ( pj=1 [φj (ωu1 , ωu2 )]2 )1/2 (7.32)
P
Euclidean Hausdorff distance:
Normalized Euclidean Hausdorff distance:
p
X
d n (ωu1 , ωu2 ) = ( [{φj (ωu1 , ωu2 )}/Hj ]2 )1/2 , (7.33)
j=1
m X
X m
Hj2 = (1/[2m2 ]) [φj (ωu1 , ωu2 )]2 (7.34)
u1 =1 u2 =1

H12 = (1/[2 × 32 ])[382 + 552 + 252 ] = 283 H1 = 16.823

H22 = (1/[2 × 32 ])[99.82 + 2022 + 204.82 ] = 5150.39; H2 = 71.766

For (HorseF, BearM), we have


d n (HorseF , BearM) = [(25/16.823)2 + (204.8/71.766)2 ]1/2 = 3.217
Billard Symbolic Data
Dis/Similarity / Distance Measures

Set of Span/Normalized/Euclidean Hausdorff Distances – Veterinary Clinic Data –

Normalized SpanNormalized
φj (ωu1 , ωu2 ) Euclidean Euclidean Euclidean
(ωu1 , ωu2 ) j =1 j =2 d(ωu1 , ωu2 ) d n (ωu1 , ωu2 ) d s (ωu1 , ωu2 )
(HorseM, HorseF) 38 99.8 106.790 2.653 0.720
(HorseM, BearM) 55 202.0 209.354 4.314 1.199
(HorseF, BearM) 25 204.8 206.320 3.217 0.943

Hausdorff distance: φj (ωu1 , ωu2 ) = Max[|au1 j − au2 j |, |bu1 j − bu2 j | (7.31)

Euclidean Hausdorff Distance Matrix D1 :


Normalized Euclidean Hausdorff Distance Matrix D2 :
Span Normalized Euclidean Hausdorff Distance Matrix D3 :

D1 =
  D2 =
  D3 =
 
0 106.790 206.354 0 2.653 4.314 0 0.720 1.199
 . 0 206.320   . 0 3.217   . 0 0.943 
. . 0 . . 0 . . 0

ALL Robinson matrices

Billard Symbolic Data


Dis/Similarity / Distance Measures

Definition 7.17: The Gowda-Diday dissimilarity measure between two interval-valued


observations ξ(ωu1 ) and ξ(ωu2 ) of the form ξ(ωu ) = [auj , buj ] is
p
X
D(ω1 , ω2 ) = [Dj1 (ω1 , ω2 ) + Dj2 (ω1 , ω2 ) + Dj3 (ω1 , ω2 )]
j=1

where, for j = 1, . . . , p,

Dj1 (ω1 , ω2 ) = (||bu1 j − au1 j | − |bu2 j − au2 j |)/kj , (7.23)


Dj2 (ω1 , ω2 ) = (|bu1 j − au1 j | + |bu2 j − au2 j | − 2Ij )/kj , (7.24)
Dj3 (ω1 , ω2 ) = (|au1 j − au2 j |)/|Yj | (7.25)

where

kj = |Max(bu1 j , bu2 j ), Min(au1 j , au2 j )|


Ij = |Max(au1 j , au2 j ) − Min(bu1 j , bu2 j |
|Yj | = maxu (buj ) − minu (auj ).

Here, kj is the length of the entire distance spanned by ωu1 and ωu2 , Ij is the length of
the intersection of the intervals [au1 j , bu1 j ] and [au2 j , bu2 j ], and |Yj | is the total length
in Y covered by observed values of Yj .
So, Dj1 (ω1 , ω2 ) is the span component, Dj2 (ω1 , ω2 ) is the relative content component,
and Dj3 (ω1 , ω2 ) is the relative position component of the distance measure.

Billard Symbolic Data


Dis/Similarity / Distance Measures

Gowda-Diday distances:

Y1 = Height Y2 = Weight (Y1 , Y2 )


(ωu1 , ωu2 ) D11 D12 D13 D1 D21 D22 D23 D2 D
(HorseM, HorseF) .967 .967 .584 2.518 .744 .759 .442 1.922 4.440
(HorseM, BearM) .769 .923 .846 2.538 .409 .703 .021 1.554 4.093
(HorseF, BearM) .296 .444 .262 1.002 .008 .285 .861 1.154 2.156

 
0 4.440 4.093
D= . 0 2.156 
. . 0

Billard Symbolic Data


Clustering

Clustering:

Use the Distance matrices, D, calculated from symbolic data in the same way as the
Distance matrices, D, calculated from classical data are used to

construct
partitions
hierarchies
pyramids

Billard Symbolic Data


Clustering

E.g., Veterinary dataset –

Denote rth partition by Pr = (C1 , . . . , Cr ).

P1 = C1 : E ≡ C1 = {1, . . . , 10} =
{HorseM,HorseF,BearM,DeerM,DeerF,DogF,RabbitM,RabbitF,CatM,CatF}
P4 = (C1 , . . . , C4 ) : C1 = {1, 2}, C2 = {3}, C3 = {4, 5, 6}, C4 = {7, 8, 9, 10}
P5 = (C1 , . . . , C5 ) : C1 = {1, 2}, C2 = {3}, C3 = {4, 5, 6}, C4 = {7, 8}, C5 = {9, 10}
OR, P50 = (C1 , . . . , C5 ) :
C1 = {1, 2}, C2 = {3}, C3 = {4, 5, 6}, C4 = {7, 8}, C5 = {8, 9, 10}
P5 is a hierarchy; and P50 is a pyramid
Billard Symbolic Data
Clustering

Veterinary dataset:
{HorseM,HorseF,BearM,DeerM,DeerF,DogF,RabbitM,RabbitF,CatM,CatF}

Hierarchy Pyramid

Billard Symbolic Data

You might also like