class 1c -DataFundamentals
class 1c -DataFundamentals
Heitor S Lopes
Prof. Thiago H Silva
Can have
different types
in each column
Dataset
Collection of data objects and their attributes
Nominal
Sufficient information to order
Ex: ID number, eye color, zip code
Ordinal
Ex: grades, height {high, medium, low}
Numeric
Measure of dissimilarity
● Numerical measure of how different two objects are
● Minimal dissimilarity is usually 0
where n is the number of dimensions (attributes) and xk and yk are the kth
attributes of data objects x and y.
Distance matrix
Comparison
Distance matrix
Similarity between binary vectors
Common situation: objects p and q have only binary attributes
Compute the similarity like this:
f01 = # of attributes where p was 0 and q is 1
f10 = # of attributes where p was 1 and q is 0
f00 = # of attributes where p was 0 and q is 0
f11 = # of attributes where p was 1 and q is 1
Simple Matching (SMC) and Jaccard Coefficient (J)
SMC = number of matches “11” and “00” / number of attributes
= (f11 + f00) / (f01 + f10 + f11 + f00)
J = number of matches “11” / number of non-zero attributes
= (f11) / (f01 + f10 + f11)
SMC vs Jaccard
x= 1000000000
y= 0000001001
f01 = 2
f10 = 1
f00 = 7
f11 = 0
<d1, d2> = 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
|| d1 || = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
|| d2 || = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.449
cos(d1, d2 ) = 0.3150
Linear correlation measure
B = (0, 0, 1, 0, 0, 1, 1)
Binarization
Maps a continuous or categorical attribute to one or
more binary variables
Normalization (z-score)
Also known as standardization
where μ is the mean (average) and σ is the standard deviation of the mean