0% found this document useful (0 votes)
1 views

class 1c -DataFundamentals

The document discusses key aspects of data mining and knowledge discovery, focusing on the types of data attributes, including categorical, numeric, discrete, and continuous attributes. It also covers methods for measuring similarity and dissimilarity between data objects, such as Euclidean distance, cosine similarity, and correlation measures. Additionally, it addresses normalization techniques and their importance in preparing data for analysis.

Uploaded by

eltcarva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

class 1c -DataFundamentals

The document discusses key aspects of data mining and knowledge discovery, focusing on the types of data attributes, including categorical, numeric, discrete, and continuous attributes. It also covers methods for measuring similarity and dissimilarity between data objects, such as Euclidean distance, cosine similarity, and correlation measures. Additionally, it addresses normalization techniques and their importance in preparing data for analysis.

Uploaded by

eltcarva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Prof.

Heitor S Lopes
Prof. Thiago H Silva

Data Mining &


Knowledge Discovery

1c - Data - Important Aspects


Data -> Knowledge
Appropriate languages help
In this course examples in:
Dataframe
Support variables of various types and simplify manipulation

Can have
different types
in each column
Dataset
Collection of data objects and their attributes

An attribute is a property of an object

Examples: eye color, temperature, etc.

Attribute is also known as variable, characteristic, feature


Attribute types
Categorical They are just different names

Nominal
Sufficient information to order
Ex: ID number, eye color, zip code
Ordinal
Ex: grades, height {high, medium, low}
Numeric

Data has a natural zero point


Interval, Ratio (significant).

Allows comparisons of the


Dates, temperature in Celsius type (x is twice as much as y)
(intervals between each value
are equally divided.) Monetary amounts, weight.
Attribute types
User ID in an e-mail system

– Nominal, Ordinal, or Interval?


Attribute types
Discrete attribute
● Has only a finite or countably infinite set of values
● Ex: zip codes or the set of words in a collection
● Typically represented as integer variables
Continuous attributes
● Has real numbers as attribute values
● Ex: temperature, height or weight.
● Typically represented as floating point variables

Is age continuous or discrete?


Typical and complex datasets
Matrix data
Structured Text: DNA/Protein Sequences
Complex datasets
Transactions

A special type of record, where:


● Each record (transaction) involves a set of items
● Ex: supermarket
Complex datasets
Unstructured text: Spatio-temporal data:
Complex datasets
Time series Graph:
Proximity notion
Measure of similarity
● Numerical measure of how similar two data objects are
● It is larger when they are more similar
● Usually in the range [0,1]

Measure of dissimilarity
● Numerical measure of how different two objects are
● Minimal dissimilarity is usually 0

For convenience, proximity refers to similarity or dissimilarity


Euclidean distance

where n is the number of dimensions (attributes) and xk and yk are the kth
attributes of data objects x and y.

Standardization is necessary if the scale differs


Euclidean distance

Distance matrix
Comparison

Distance matrix
Similarity between binary vectors
Common situation: objects p and q have only binary attributes
Compute the similarity like this:
f01 = # of attributes where p was 0 and q is 1
f10 = # of attributes where p was 1 and q is 0
f00 = # of attributes where p was 0 and q is 0
f11 = # of attributes where p was 1 and q is 1
Simple Matching (SMC) and Jaccard Coefficient (J)
SMC = number of matches “11” and “00” / number of attributes
= (f11 + f00) / (f01 + f10 + f11 + f00)
J = number of matches “11” / number of non-zero attributes
= (f11) / (f01 + f10 + f11)
SMC vs Jaccard
x= 1000000000
y= 0000001001

f01 = 2
f10 = 1
f00 = 7
f11 = 0

SMC = (f11 + f00) / (f01 + f10 + f11 + f00)

= (0+7) / (2+1+0+7) = 0.7

J = (f11) / (f01 + f10 + f11) = 0 / (2 + 1 + 0) = 0


Cosine similarity Does not take into account 0-0
matches, as in Jaccard, and works
If d1 and d2 are numeric vectors, then for non-binary vectors
cos( d1, d2 ) = <d1,d2> / ||d1|| ||d2|| ,
where <d1,d2> indicates the dot product of the vectors, d1 and d2, and || d || is the
magnitude of the vector d.
Eg.:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2

<d1, d2> = 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
|| d1 || = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
|| d2 || = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.449
cos(d1, d2 ) = 0.3150
Linear correlation measure

corr(x,y)=1 It means a perfect positive correlation between the two variables.


corr(x,y)=-1 It means a perfect negative correlation between the two variables - That is, if
one increases, the other always decreases.
corr(x,y)=0 It means that the two variables do not depend linearly on each other. However,
there may be a non-linear dependence.
What about categorical data?
A = (-3, -2, -1, 0, 1, 2, 3)
B = (a, a, b, a, a, b, b)
What about categorical data?
A = (-3, -2, -1, 0, 1, 2, 3) Codes
B = (a, a, b, a, a, b, b) a=0
b=1

B = (0, 0, 1, 0, 0, 1, 1)
Binarization
Maps a continuous or categorical attribute to one or
more binary variables
Normalization (z-score)
Also known as standardization

where μ is the mean (average) and σ is the standard deviation of the mean

Standardizes features so that they are centered around 0 with a


standard deviation of 1.
This can be a general requirement for many machine learning algorithms.
Normalization (MIN-MAX)
Typically

In this approach, data is scaled to a fixed range - usually from 0 to 1.


Normalization - Example
[[ 3.9 5. 3000. ]
[ 5. 5.5 3500. ]
[ 10. 6. 3500. ]]
Distances between non-normalized objects
[[0. 500.00 500.038]
[500.0014 0. 5.0249]
[500.03820 5.024 0. ]]
Distances between normalized objects
[[0. 1.13248317 1.7320]
[1.13248317 0. 0.96013]
[1.73205081 0.96013 0. ]]
References
Tan, P. N., Steinbach, M., & Kumar, V. (2016). Introduction to data mining.
Pearson Education India.

Thanks to Professors Josh Starmer, Yi Zhang, Vincent Spruyt for some


images that were used

Official documentation of the Scikit Learn library scikit-learn.org

You might also like