0% found this document useful (0 votes)
148 views

Quiz 3 Solution (No 1-4)

1. The document computes various distance metrics between the points (1,2) and (3,4). The L1 distance is 4, the L2 distance is 2√2, and the L-infinity distance is 2. 2. It computes similarity measures between the sets {A,B,C} and {A,C,D,E}. The match-based similarity is 2/5, the cosine similarity is √3/3, and the Jaccard coefficient is 2/5. 3. The edit distances between the word pairs ababcabc and babcbc is 2, and between cbacbacba and acbacbacb is also 2.

Uploaded by

irham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
148 views

Quiz 3 Solution (No 1-4)

1. The document computes various distance metrics between the points (1,2) and (3,4). The L1 distance is 4, the L2 distance is 2√2, and the L-infinity distance is 2. 2. It computes similarity measures between the sets {A,B,C} and {A,C,D,E}. The match-based similarity is 2/5, the cosine similarity is √3/3, and the Jaccard coefficient is 2/5. 3. The edit distances between the word pairs ababcabc and babcbc is 2, and between cbacbacba and acbacbacb is also 2.

Uploaded by

irham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

1. Compute the Lp -norm between (1,2) and (3,4) for p = 1,2,∞.

(That is, Manhattan


distance, Euclidean distance, and Infinity norm)
d
L1 = ∑ |xi − y i | = |x1 − y 1 | + |x2 − y 2 | = |1 − 3| + |2 − 4| = 2 + 2 = 4
i=1


d
L2 = ∑ (xi − y i )2 =
i=1
√(x 1 − y 1 )2 + (x2 − y 2 )2 = √(1 − 3) 2
+ (2 − 4)2 = √4 + 4 = 2√2

L∞ = max(|xi − y i |) = max(|1 − 3|, |2 − 4|) = 2

2. Compute the match-based similarity, cosine similarity, and the Jaccard


coefficient between the two sets {A,B,C} and {A,C,D,E}. If the measure only
applies to numeric data, you can transform the data into numeric first.

Match-based similarity:
Discretize each dimension (A, B, … E) data into 1 equidepth bucket of range
[0,1]. Therefore, mi = 1, ni = 0
# A B C D E

1 1 1 1 0 0

2 1 0 1 1 1

matched elements in two sets (proximity set)= {A,B,C,D,E}


1/p
p
ˉ Yˉ , k d ) =
P Select(X, ∑
ˉ ˉ,k )
i∈S(X,Y d
( 1−
|xi −y i |
mi −ni )
1−1
For p=1: (1 − 1
) + (1 − 1−1
1
)=2
1−1 2 1−1 2 1/2
For p=2: [(1 − 1 ) + (1 − 1 ) ] = √2

Cosine similarity:
d
∑ xi .y i
(1)(1)+(1)(0)+(1)(1)+(0)(1)+(0)(1) 2 1 √3
i=1
= = = = 3
= 0.577
√1 +1 +1 +0 +0 √1 +0 +1 +1 +1
2 2 2 2 2 2 2 2 2 2 √3√4 √3

√ √
d d
∑ xi 2 ∑ yi 2
i=1 i=1

|S ⋂S | |{A,C}| 2
Jaccard coefficient: |S X ⋃S Y | = |{A,B,C,D,E}|
= 5
= 0.4
X Y
3. Compute the edit distance between: (a) ababcabc and babcbc and (b)
cbacbacba and acbacbacb.
Assume an equal cost of insertion, deletion, or replacement.
A. ababcabc → babcbc
babcabc: delete a at position 1
babcbc: delete a at position 6
Cost = 2
B. cbacbacba → acbacbacb
a​cbacbacba: insert a at position 0
acbacbac​b​: delete a at position 9
Cost = 2

4. Compute the normalized-cosine measure between the following two sentences:


(a) “The sly fox jumped over the lazy dog.”
(b) “The dog jumped at the intruder.”
For TF, use the raw count, while for IDF use the standard inverse document
frequency.
Possible Answer #1: “The” and“the” are treated as the same word

Word TF(a) TF(b) IDF IDF (standard) TF-IDF(a) TF-IDF(b)

the 2 2 2/2 log(2/2) 0 0

sly 1 0 2/1 log(2) 0.301 0

fox 1 0 2/1 log(2) 0.301 0

jumped 1 1 2/2 log(2/2) 0 0

over 1 0 2/1 log(2) 0.301 0

lazy 1 0 2/1 log(2) 0.301 0

dog 1 1 2/2 log(2/2) 0 0

at 0 1 2/1 log(2) 0 0.301

intruder 0 1 2/1 log(2) 0 0.301


d
∑ h(xi ).h(y i )
0
Normalized-cosine similarity: i=1
= =0

√ √ √ √
d d d d
2 2 2 2
∑ h(xi ) ∑ h(y i ) ∑ h(xi ) ∑ h(y i )
i=1 i=1 i=1 i=1

Possible Answer #2: “The” and“the” are treated as different words


(a) “The sly fox jumped over the lazy dog”
(b) “The dog jumped at the intruder.”

Word TF (a) TF (b) IDF IDF (standard) TF-IDF (a) TF IDF (b)

The 1 1 1 log(2/2) 0 0

sly 1 0 2 log(2/1) 0.301 0

fox 1 0 2 log(2/1) 0.301 0

jumped 1 1 1 log(2/2) 0 0

over 1 0 2 log(2/1) 0.301 0

the 1 1 1 log(2/2) 0 0

lazy 1 0 2 log(2/1) 0.301 0

dog 1 1 1 log(2/2) 0 0

at 0 1 2 log(2/1) 0 0.301

intruder 0 1 2 log(2/1) 0 0.301

d
∑ h(xi ).h(y i )
0
Normalized-cosine similarity: i=1
= =0

√ √ √ √
d d d d
2 2 2 2
∑ h(xi ) ∑ h(y i ) ∑ h(xi ) ∑ h(y i )
i=1 i=1 i=1 i=1

You might also like