0% found this document useful (0 votes)

1 views

class 1c -DataFundamentals

The document discusses key aspects of data mining and knowledge discovery, focusing on the types of data attributes, including categorical, numeric, discrete, and continuous attributes. It also covers methods for measuring similarity and dissimilarity between data objects, such as Euclidean distance, cosine similarity, and correlation measures. Additionally, it addresses normalization techniques and their importance in preparing data for analysis.

Uploaded by

eltcarva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views

class 1c -DataFundamentals

Uploaded by

eltcarva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Prof.

Heitor S Lopes
Prof. Thiago H Silva

Data Mining &

Knowledge Discovery

1c - Data - Important Aspects

Data -> Knowledge
Appropriate languages help
In this course examples in:
Dataframe
Support variables of various types and simplify manipulation

Can have
diﬀerent types
in each column
Dataset
Collection of data objects and their attributes

An attribute is a property of an object

Examples: eye color, temperature, etc.

Attribute is also known as variable, characteristic, feature

Attribute types
Categorical They are just diﬀerent names

Nominal
Suﬃcient information to order
Ex: ID number, eye color, zip code
Ordinal
Ex: grades, height {high, medium, low}
Numeric

Data has a natural zero point

Interval, Ratio (signiﬁcant).

Allows comparisons of the

Dates, temperature in Celsius type (x is twice as much as y)
(intervals between each value
are equally divided.) Monetary amounts, weight.
Attribute types
User ID in an e-mail system

– Nominal, Ordinal, or Interval?

Attribute types
Discrete attribute
● Has only a finite or countably infinite set of values
● Ex: zip codes or the set of words in a collection
● Typically represented as integer variables
Continuous attributes
● Has real numbers as attribute values
● Ex: temperature, height or weight.
● Typically represented as floating point variables

Is age continuous or discrete?

Typical and complex datasets
Matrix data
Structured Text: DNA/Protein Sequences
Complex datasets
Transactions

A special type of record, where:

● Each record (transaction) involves a set of items
● Ex: supermarket
Complex datasets
Unstructured text: Spatio-temporal data:
Complex datasets
Time series Graph:
Proximity notion
Measure of similarity
● Numerical measure of how similar two data objects are
● It is larger when they are more similar
● Usually in the range [0,1]

Measure of dissimilarity
● Numerical measure of how diﬀerent two objects are
● Minimal dissimilarity is usually 0

For convenience, proximity refers to similarity or dissimilarity

Euclidean distance

where n is the number of dimensions (attributes) and xk and yk are the kth
attributes of data objects x and y.

Standardization is necessary if the scale diﬀers

Euclidean distance

Distance matrix
Comparison

Distance matrix
Similarity between binary vectors
Common situation: objects p and q have only binary attributes
Compute the similarity like this:
f01 = # of attributes where p was 0 and q is 1
f10 = # of attributes where p was 1 and q is 0
f00 = # of attributes where p was 0 and q is 0
f11 = # of attributes where p was 1 and q is 1
Simple Matching (SMC) and Jaccard Coeﬃcient (J)
SMC = number of matches “11” and “00” / number of attributes
= (f11 + f00) / (f01 + f10 + f11 + f00)
J = number of matches “11” / number of non-zero attributes
= (f11) / (f01 + f10 + f11)
SMC vs Jaccard
x= 1000000000
y= 0000001001

f01 = 2
f10 = 1
f00 = 7
f11 = 0

SMC = (f11 + f00) / (f01 + f10 + f11 + f00)

= (0+7) / (2+1+0+7) = 0.7

J = (f11) / (f01 + f10 + f11) = 0 / (2 + 1 + 0) = 0

Cosine similarity Does not take into account 0-0
matches, as in Jaccard, and works
If d1 and d2 are numeric vectors, then for non-binary vectors
cos( d1, d2 ) = <d1,d2> / ||d1|| ||d2|| ,
where <d1,d2> indicates the dot product of the vectors, d1 and d2, and || d || is the
magnitude of the vector d.
Eg.:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2

<d1, d2> = 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
|| d1 || = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
|| d2 || = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.449
cos(d1, d2 ) = 0.3150
Linear correlation measure

corr(x,y)=1 It means a perfect positive correlation between the two variables.

corr(x,y)=-1 It means a perfect negative correlation between the two variables - That is, if
one increases, the other always decreases.
corr(x,y)=0 It means that the two variables do not depend linearly on each other. However,
there may be a non-linear dependence.
What about categorical data?
A = (-3, -2, -1, 0, 1, 2, 3)
B = (a, a, b, a, a, b, b)
What about categorical data?
A = (-3, -2, -1, 0, 1, 2, 3) Codes
B = (a, a, b, a, a, b, b) a=0
b=1

B = (0, 0, 1, 0, 0, 1, 1)
Binarization
Maps a continuous or categorical attribute to one or
more binary variables
Normalization (z-score)
Also known as standardization

where μ is the mean (average) and σ is the standard deviation of the mean

Standardizes features so that they are centered around 0 with a

standard deviation of 1.
This can be a general requirement for many machine learning algorithms.
Normalization (MIN-MAX)
Typically

In this approach, data is scaled to a ﬁxed range - usually from 0 to 1.

Normalization - Example
[[ 3.9 5. 3000. ]
[ 5. 5.5 3500. ]
[ 10. 6. 3500. ]]
Distances between non-normalized objects
[[0. 500.00 500.038]
[500.0014 0. 5.0249]
[500.03820 5.024 0. ]]
Distances between normalized objects
[[0. 1.13248317 1.7320]
[1.13248317 0. 0.96013]
[1.73205081 0.96013 0. ]]
References
Tan, P. N., Steinbach, M., & Kumar, V. (2016). Introduction to data mining.
Pearson Education India.

Thanks to Professors Josh Starmer, Yi Zhang, Vincent Spruyt for some

images that were used

Oﬃcial documentation of the Scikit Learn library scikit-learn.org

Processing Bulk Natural Wood Into A High-Performance Structural Material
100% (1)
Processing Bulk Natural Wood Into A High-Performance Structural Material
16 pages
9-2 Data analysis and pre-processing part 2.pdf
No ratings yet
9-2 Data analysis and pre-processing part 2.pdf
27 pages
Similarity
No ratings yet
Similarity
19 pages
L13
No ratings yet
L13
19 pages
Similarity Measures
No ratings yet
Similarity Measures
11 pages
Clustering Lecture 1: Basics: Jing Gao
No ratings yet
Clustering Lecture 1: Basics: Jing Gao
62 pages
DS5 Statistics
No ratings yet
DS5 Statistics
67 pages
Lab 2
No ratings yet
Lab 2
21 pages
Similarity
No ratings yet
Similarity
20 pages
Similarity
No ratings yet
Similarity
20 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
26 pages
Materi 7.1. Distance Measurement
No ratings yet
Materi 7.1. Distance Measurement
14 pages
CS2209 Similarity Distances
No ratings yet
CS2209 Similarity Distances
23 pages
Class-Data Preprocessing-IV
No ratings yet
Class-Data Preprocessing-IV
28 pages
DMi_03-Proximity
No ratings yet
DMi_03-Proximity
51 pages
02data Part4
No ratings yet
02data Part4
28 pages
29.measuring Data Similarity and Dissimilarity Introduction
No ratings yet
29.measuring Data Similarity and Dissimilarity Introduction
43 pages
TE IT DMBI Module2 Data Preprocessing L8-L11
No ratings yet
TE IT DMBI Module2 Data Preprocessing L8-L11
73 pages
Similarity and Dissimilarity
No ratings yet
Similarity and Dissimilarity
34 pages
X Chapter 02 Data
No ratings yet
X Chapter 02 Data
67 pages
CSE-1-PPT-MiniTest-12feb24-Similarity (6)
No ratings yet
CSE-1-PPT-MiniTest-12feb24-Similarity (6)
11 pages
Chapter - 2 Data Mining
No ratings yet
Chapter - 2 Data Mining
21 pages
Lec 5
No ratings yet
Lec 5
24 pages
Lecture 2. Similarity Measures For Cluster Analysis
No ratings yet
Lecture 2. Similarity Measures For Cluster Analysis
31 pages
CS822-DataMining-Week4 (2)
No ratings yet
CS822-DataMining-Week4 (2)
45 pages
Knowing Your Data
No ratings yet
Knowing Your Data
43 pages
Data Science: Department of Computer Science & Engineering
No ratings yet
Data Science: Department of Computer Science & Engineering
31 pages
CS-DM MODULE- 3
No ratings yet
CS-DM MODULE- 3
27 pages
Data Similarity
0% (1)
Data Similarity
18 pages
Lecture 4
No ratings yet
Lecture 4
33 pages
Lecture 3-Know Your Data - M
No ratings yet
Lecture 3-Know Your Data - M
19 pages
Lec 5
No ratings yet
Lec 5
22 pages
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
No ratings yet
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
30 pages
2 Similarity Disimilarity Measure
No ratings yet
2 Similarity Disimilarity Measure
35 pages
Sess02 Data
No ratings yet
Sess02 Data
96 pages
Measuring Data Similarity and Dissimilarity
No ratings yet
Measuring Data Similarity and Dissimilarity
20 pages
DMi 03 Proximity
No ratings yet
DMi 03 Proximity
9 pages
Similarty and Dissimilarity
No ratings yet
Similarty and Dissimilarity
11 pages
Chapter 2: Getting To Know Your Data
No ratings yet
Chapter 2: Getting To Know Your Data
30 pages
Unit 3
No ratings yet
Unit 3
13 pages
Lecture 8-9 - Clustering
No ratings yet
Lecture 8-9 - Clustering
43 pages
Pattern Recognition - Clustering - Classification
No ratings yet
Pattern Recognition - Clustering - Classification
177 pages
Similarity Analysis
No ratings yet
Similarity Analysis
85 pages
DWDM Unit6-Data Similarity Measures
No ratings yet
DWDM Unit6-Data Similarity Measures
40 pages
Lec09 466 PDF
No ratings yet
Lec09 466 PDF
5 pages
3 Unit PR NonParametric Decision Making
No ratings yet
3 Unit PR NonParametric Decision Making
78 pages
ML Co4 Session 29
No ratings yet
ML Co4 Session 29
36 pages
Cluster Analysis Introduction (Unit-6)
No ratings yet
Cluster Analysis Introduction (Unit-6)
20 pages
CSC_522_Lecture10_5f0e8c83dce359ee001691c737303b46
No ratings yet
CSC_522_Lecture10_5f0e8c83dce359ee001691c737303b46
30 pages
Clustering
0% (1)
Clustering
127 pages
Machile Learning Mid Note
No ratings yet
Machile Learning Mid Note
7 pages
Introduction To Data Science: Tom A S Horv Ath
No ratings yet
Introduction To Data Science: Tom A S Horv Ath
39 pages
Data Mining: Similarity and Distance
No ratings yet
Data Mining: Similarity and Distance
6 pages
Week 3 - Similarity Distance Measures
No ratings yet
Week 3 - Similarity Distance Measures
42 pages
Module-3Conti.. Similarity& Dissimlarity
No ratings yet
Module-3Conti.. Similarity& Dissimlarity
29 pages
UNIT V DWM Notes
No ratings yet
UNIT V DWM Notes
18 pages
DWM UNIT-VI (2)
No ratings yet
DWM UNIT-VI (2)
30 pages
Dist
No ratings yet
Dist
14 pages
Ml unit 2
No ratings yet
Ml unit 2
11 pages
lec01-dataprep
No ratings yet
lec01-dataprep
67 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
SSH Into A Container
No ratings yet
SSH Into A Container
2 pages
Cabadbaran City Time Capsule Laid To Rest - Edit
No ratings yet
Cabadbaran City Time Capsule Laid To Rest - Edit
6 pages
Bar Graph Unit Month
No ratings yet
Bar Graph Unit Month
2 pages
Form Validasi CAF WANASARI WANAYASA NTS OPRT TSEL
No ratings yet
Form Validasi CAF WANASARI WANAYASA NTS OPRT TSEL
8 pages
Connector
No ratings yet
Connector
7 pages
Especificaciones Omega Bos200
No ratings yet
Especificaciones Omega Bos200
3 pages
TR - Rubber Production NC II
No ratings yet
TR - Rubber Production NC II
58 pages
Final Summative Test For Music 10 Quarter 3
100% (1)
Final Summative Test For Music 10 Quarter 3
2 pages
The Voter
No ratings yet
The Voter
5 pages
Detection of Adulteration
0% (1)
Detection of Adulteration
21 pages
Chartered Governance and Accountancy Institute in Zimbabwe (Cgi Zimbabwe)
No ratings yet
Chartered Governance and Accountancy Institute in Zimbabwe (Cgi Zimbabwe)
2 pages
NCERT Solutions For Class 5 Maths 9 May Chapter 2 Shapes and Angles
No ratings yet
NCERT Solutions For Class 5 Maths 9 May Chapter 2 Shapes and Angles
16 pages
Tutorial - Synchronizing SQL Server and SQL Server Compact (Sync Framework) PDF
No ratings yet
Tutorial - Synchronizing SQL Server and SQL Server Compact (Sync Framework) PDF
7 pages
623c5167e6932197335531 SMT GRC Analyst
No ratings yet
623c5167e6932197335531 SMT GRC Analyst
3 pages
Leptospirosis
No ratings yet
Leptospirosis
15 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
3 pages
Practice Test 6.2 Property Plant and Equipment - Attempt Review - College 1st Semester S.Y. 2023 2024
No ratings yet
Practice Test 6.2 Property Plant and Equipment - Attempt Review - College 1st Semester S.Y. 2023 2024
8 pages
1h Ai Automation Agency Course Aaa
No ratings yet
1h Ai Automation Agency Course Aaa
9 pages
Spring Boot Annotation
No ratings yet
Spring Boot Annotation
6 pages
Indian Lakes and Waterfall1 - 25131817 - 2024 - 08 - 21 - 08 - 48
No ratings yet
Indian Lakes and Waterfall1 - 25131817 - 2024 - 08 - 21 - 08 - 48
21 pages
Download Sculpture and Archaeology 1st Edition Paul Bonaventura ebook file with all chapters
100% (5)
Download Sculpture and Archaeology 1st Edition Paul Bonaventura ebook file with all chapters
78 pages
HRD Climate
No ratings yet
HRD Climate
4 pages
CANTER FE85 Drop Side Body Euro 4 With 3 Ton Boom Crane - DPWH IBA
No ratings yet
CANTER FE85 Drop Side Body Euro 4 With 3 Ton Boom Crane - DPWH IBA
3 pages
Off-Grid Solar Power Bible - (5 - Erik Sipes
No ratings yet
Off-Grid Solar Power Bible - (5 - Erik Sipes
85 pages
Catalogue Safety Spray Shields
No ratings yet
Catalogue Safety Spray Shields
14 pages
B - 8.6 - ZSS004 STD Spec For Fabrication Erection of Piping
No ratings yet
B - 8.6 - ZSS004 STD Spec For Fabrication Erection of Piping
23 pages
Hotel Marketing Plan
No ratings yet
Hotel Marketing Plan
29 pages
Scientific Programme
No ratings yet
Scientific Programme
16 pages
What's New?: Towards A New Geological Highway Map of Manitoba
No ratings yet
What's New?: Towards A New Geological Highway Map of Manitoba
1 page

class 1c -DataFundamentals

Uploaded by

class 1c -DataFundamentals

Uploaded by

Prof.

Data Mining &

1c - Data - Important Aspects

An attribute is a property of an object

Examples: eye color, temperature, etc.

Attribute is also known as variable, characteristic, feature

Data has a natural zero point

Allows comparisons of the

– Nominal, Ordinal, or Interval?

Is age continuous or discrete?

A special type of record, where:

For convenience, proximity refers to similarity or dissimilarity

Standardization is necessary if the scale diﬀers

SMC = (f11 + f00) / (f01 + f10 + f11 + f00)

= (0+7) / (2+1+0+7) = 0.7

J = (f11) / (f01 + f10 + f11) = 0 / (2 + 1 + 0) = 0

corr(x,y)=1 It means a perfect positive correlation between the two variables.

Standardizes features so that they are centered around 0 with a

In this approach, data is scaled to a ﬁxed range - usually from 0 to 1.

Thanks to Professors Josh Starmer, Yi Zhang, Vincent Spruyt for some

Oﬃcial documentation of the Scikit Learn library scikit-learn.org

You might also like