Distance Metrics - Billard
Distance Metrics - Billard
Dissimilarity/Similarity/Distance Measures
(for Clustering)
Lynne Billard
Department of Statistics
University of Georgia
[email protected]
Mathematically,
use distance measures to produce what we see visually in veterinary data:
Let the dissimilarity measure between objects a and b be d(a, b), and the
corresponding similarity measure be s(a, b).
E.g., However,
d(a, b) ≤ max{d(a, c), d(b, c)} d(a, b) ≥ max{d(a, c), d(b, c)}
- ultrametric - NOT ultrametric
d(a, b) ≤ max{d(a, c), d(b, c)} d(a, b) ≥ max{d(a, c), d(b, c)}
- ultrametric - NOT ultrametric
0 2 3 0 2 1.5
D= 2 0 3 D= . 0 1.2
3 3 0 . . 0
is the set of values in Aj , Bj or both. When A and B are interval-valued objects with
Aj = [ajA , bjA ] and Bj = [ajB , bjB ], then
M
Aj Bj = [Min(ajA , ajB ), Max(bjA , bjB )] (7.2)
N N N
Definition 7.7: The Cartesian meet A B = (A1 B1 , . . .N
, Ap Bp ) between two
sets A and B is their componentwise intersection
N where A j B j = ”Aj ∩ Bj ”. When
A and B are multi-valued objects, then Aj Bj is the list of possible values from Yj
common to both. When A and B are interval-valued objects forming overlapping
interval on Yj ,
O
Aj Bj = [Max(ajA , ajB ), Min(bjA , bjB )] (7.3)
N
and when Aj ∩ Bj = φ , then Aj Bj = 0.
Then,
L the join is
A B = ({blue, gray, pink, green, white}, {shirt, slacks, dress}, {small, medium,
large}),
and
Nthe meet is
A B = ({blue}, {shirt, dress}, {small}).
Then
L the join is
A B = ([6, 12], [16, 24]),
and
Nthe meet is
A B = ([8, 10], [18, 22]).
Multi-valued Variables:
Write observations ξ(ωu ) as
where
where kj is the number of values from Yj in the join and kj∗ is the number in the meet
of ξ(ω1 ) and ξ(ω2 ), respectively.
D1j (ω1 , ω2 ) = (|kj1 −kj2 |)/kj , D2j (ω1 , ω2 ) = (kj1 +kj2 −2kj∗ )/kj , j = 1, . . . , p, (7.14−7.15)
where kj is the number of values from Yj in the join and kj∗ is the number in the meet
of ξ(ω1 ) and ξ(ω2 ), respectively, and kju is the number of values from Yj in ωu .
For Y1 : D11 (ω1 , ω3 ) = (|2 − 3|)/3 = 1/3; D21 (ω1 , ω3 ) = (2 + 3 − 2 × 2)/3 = 1/3.
For Y2 : D12 (ω1 , ω3 ) = (|2 − 1|)/2 = 1/2; D22 (ω1 , ω3 ) = (2 + 1 − 2 × 1)/2 = 1/2.
0 2 5/3 2/3
. 0 7/3 5/3
Distance matrix is: D=
. . 0 1
. . . 0
Normalized : Non-Normalized:
0 5/6 13/18 2/9 0 2 5/3 2/3
. 0 17/18 13/18 . 0 7/3 5/3
D0 =
.
D=
. 0 1/2 . . 0 1
. . . 0 . . . 0
Interval-valued data -
ξu ≡ ξ(ωu ) = ([auj , buj ], j = 1, . . . , p), u = 1, . . . , m
where φj (ωu1 , ωu2 ) is the Ichino-Yaguchi distance (of Definition 7.18, eqn(7.27)) and
wj∗ is an appropriate weight function associated with Yj , j = 1, . . . , p.
φj (ωu1 , ωu2 ) = |ωu1 j ⊕ ωu2 j | − |ωu1 j ⊗ ωu2 j | + γ(2|ωu1 j ⊗ ωu2 j | − |ωu1 j | − |ωu2 j |
(7.27)
Aj ⊕ Bj = [Min(ajA , ajB ), Max(bjA , bjB )] (7.2)
Aj ⊗ Bj = [Max(ajA , ajB ), Min(bjA , bjB )] (7.3)
φ1 (ωu1 , ωu2 ) = |Min(158, 175), Max(160, 185)| − |Max(158, 175), Min(160, 185)|
+ γ(2|Max(158, 175), Min(160, 185)| − |160 − 158| − |185 − 175|)
= |158, 185| − |175, 160| + γ(2 × 0 − 2 − 12)
= 27 − 0 + γ(2 × 0 − 12) = 27 + γ(−12)
where φj (ωu1 , ωu2 ) is the Ichino-Yaguchi distance (of Definition 7.18, eqn(7.27)) and
wj∗ is an appropriate weight function associated with Yj , j = 1, . . . , p.
q = 1 → City Block distance q = 2 → Euclidean distance
The normalized Euclidean distance of order q between two objects ωu1 and ωu2 is
p
X
d2 (ωu1 , ωu2 ) = ([1/p] wj∗ [φj (ωu1 , ωu2 )]q )1/q (7.30)
j=1
where φj (ωu1 , ωu2 ) is the Ichino-Yaguchi distance (of Definition 7.18, eqn(7.27)) and
wj∗ is an appropriate weight function associated with Yj , j = 1, . . . , p.
φj (ωu1 , ωu2 ) = |ωu1 j ⊕ ωu2 j | − |ωu1 j ⊗ ωu2 j | + γ(2|ωu1 j ⊗ ωu2 j | − |ωu1 j | − |ωu2 j |
p
X
d2 (ωu1 , ωu2 ) = ([1/p] wj∗ [φj (ωu1 , ωu2 )]2 )1/2 , wj∗ = |Yj |
j=1
Unweighted (i.e., wj∗ = 1), the normalized Euclidean distance for (HorseF, BearM) is,
p
X
d2 (ωu1 , ωu2 ) = ([1/p] ωj∗ [φj (HorseF , BearM)]2 )1/2
j=1
Weighted (i.e., wj∗ = Yj ), the normalized Euclidean distance for (HorseF, BearM) is,
p
X
d2 (ωu1 , ωu2 ) = ([1/p] wj∗ ωj∗ [φj (HorseF , BearM)]2 )1/2
j=1
Ichino-Yaguchi measures:
φj (ωu1 , ωu2 ) = |ωu1 j ⊕ ωu2 j | − |ωu1 j ⊗ ωu2 j | + γ(2|ωu1 j ⊗ ωu2 j | − |ωu1 j | − |ωu2 j |
City Block:d1 (ωu1 , ωu2 ) = ([1/p] pj=1 cj wj∗ [φj (ωu1 , ωu2 )])
P
D=
D=
D=
D=
0 39.70 112.45 0 0.33 0.59 0 41.12 144.94 0 0.35 0.65
. 0 91.75 . 0 0.55 . 0 110.59 . 0 0.56
. . 0 . . 0 . . 0 . . 0
Definition 7.20: The Hausdorff distance between two interval-valued objects ωu1 and
ωu2 , with ξuj = [auj , buj ], j = 1, . . . , p, u = 1, . . . , m, for Yj , is
Definition 7.21: The Euclidean Hausdorff distance between two interval-valued objects
ωu1 and ωu2 , with ξuj = [auj , buj ], is
p
X
d(ωu1 , ωu2 ) = ( [φj (ωu1 , ωu2 )]2 )1/2 (7.32)
j=1
If the data are classical, then this Normalized Euclidean distance is equivalent to a
Euclidean distance on R2 , with Hj corresponding to the standard deviation of Yj .
Definition 7.23: The Span Normalized Euclidean Hausdorff distance between two
interval-valued objects ωu1 and ωu2 , with ξuj = [auj , buj ], is
p
X
d(ωu1 , ωu2 ) = ( [{φj (ωu1 , ωu2 )}/|Yj |]2 )1/2 (7.35)
j=1
where from (7.26) the span is |Yj | = maxu (buj ) − minu (auj ).
φj (ωu1 , ωu2 )
(ωu1 , ωu2 ) j =1 j =2
(HorseM, HorseF) 38 99.8
(HorseM, BearM) 55 202.0
(HorseF, BearM) 25 204.8
Normalized
φj (ωu1 , ωu2 ) Euclidean Euclidean
(ωu1 , ωu2 ) j =1 j =2 d(ωu1 , ωu2 ) d n (ωu1 , ωu2 )
(HorseM, HorseF) 38 99.8 106.790 2.653
(HorseM, BearM) 55 202.0 209.354 4.314
(HorseF, BearM) 25 204.8 206.320 3.217
Hausdorff distance: φj (ωu1 , ωu2 ) = Max[|au1 j − au2 j |, |bu1 j − bu2 j | (7.31)
d(ωu1 , ωu2 ) = ( pj=1 [φj (ωu1 , ωu2 )]2 )1/2 (7.32)
P
Euclidean Hausdorff distance:
Normalized Euclidean Hausdorff distance:
p
X
d n (ωu1 , ωu2 ) = ( [{φj (ωu1 , ωu2 )}/Hj ]2 )1/2 , (7.33)
j=1
m X
X m
Hj2 = (1/[2m2 ]) [φj (ωu1 , ωu2 )]2 (7.34)
u1 =1 u2 =1
Normalized SpanNormalized
φj (ωu1 , ωu2 ) Euclidean Euclidean Euclidean
(ωu1 , ωu2 ) j =1 j =2 d(ωu1 , ωu2 ) d n (ωu1 , ωu2 ) d s (ωu1 , ωu2 )
(HorseM, HorseF) 38 99.8 106.790 2.653 0.720
(HorseM, BearM) 55 202.0 209.354 4.314 1.199
(HorseF, BearM) 25 204.8 206.320 3.217 0.943
D1 =
D2 =
D3 =
0 106.790 206.354 0 2.653 4.314 0 0.720 1.199
. 0 206.320 . 0 3.217 . 0 0.943
. . 0 . . 0 . . 0
where, for j = 1, . . . , p,
where
Here, kj is the length of the entire distance spanned by ωu1 and ωu2 , Ij is the length of
the intersection of the intervals [au1 j , bu1 j ] and [au2 j , bu2 j ], and |Yj | is the total length
in Y covered by observed values of Yj .
So, Dj1 (ω1 , ω2 ) is the span component, Dj2 (ω1 , ω2 ) is the relative content component,
and Dj3 (ω1 , ω2 ) is the relative position component of the distance measure.
Gowda-Diday distances:
0 4.440 4.093
D= . 0 2.156
. . 0
Clustering:
Use the Distance matrices, D, calculated from symbolic data in the same way as the
Distance matrices, D, calculated from classical data are used to
construct
partitions
hierarchies
pyramids
P1 = C1 : E ≡ C1 = {1, . . . , 10} =
{HorseM,HorseF,BearM,DeerM,DeerF,DogF,RabbitM,RabbitF,CatM,CatF}
P4 = (C1 , . . . , C4 ) : C1 = {1, 2}, C2 = {3}, C3 = {4, 5, 6}, C4 = {7, 8, 9, 10}
P5 = (C1 , . . . , C5 ) : C1 = {1, 2}, C2 = {3}, C3 = {4, 5, 6}, C4 = {7, 8}, C5 = {9, 10}
OR, P50 = (C1 , . . . , C5 ) :
C1 = {1, 2}, C2 = {3}, C3 = {4, 5, 6}, C4 = {7, 8}, C5 = {8, 9, 10}
P5 is a hierarchy; and P50 is a pyramid
Billard Symbolic Data
Clustering
Veterinary dataset:
{HorseM,HorseF,BearM,DeerM,DeerF,DogF,RabbitM,RabbitF,CatM,CatF}
Hierarchy Pyramid