0% found this document useful (0 votes)

20 views40 pages

How To Work On Data You Haev

This document provides an overview of data mining concepts, focusing on data attributes, types of data, data quality, and similarity measures. It discusses various types of attributes, their properties, and the importance of data quality in data mining processes. Additionally, it covers different data structures, including record data, graph data, and ordered data, along with the implications of data quality issues such as noise, outliers, and missing values.

Uploaded by

wusmantech

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views40 pages

How To Work On Data You Haev

Uploaded by

wusmantech

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 40

Data Mining: Data

Lecture 2

Data Mining: Data

Outline

 Attributes and Objects

 Types of Data

 Data Quality

 Similarity and Distance

 Data Preprocessing
What is Data?

 Collection of data objects Attributes

and their attributes
 An attribute is a property or Tid Refund Marital Taxable
Status Income Cheat
characteristic of an object
1 Yes Single 125K No
– Examples: eye color of a
person, temperature, etc. 2 No Married 100K No

– Attribute is also known as 3 No Single 70K No

Objects
variable, field, characteristic, 4 Yes Married 120K No
dimension, or feature 5 No Divorced 95K Yes
 A collection of attributes 6 No Married 60K No
describe an object 7 Yes Divorced 220K No
– Object is also known as 8 No Single 85K Yes
record, point, case, sample, 9 No Married 75K No
entity, or instance
10 No Single 90K Yes
10
A More Complete View of Data

 Data may have parts

 The different parts of the data may have

relationships

 More generally, data may have structure

 Data can be incomplete

 We will discuss this in more detail later

Attribute Values

 Attribute values are numbers or symbols

assigned to an attribute for a particular object

 Distinction between attributes and attribute values

– Same attribute can be mapped to different attribute
values
 Example: height can be measured in feet or meters

– Different attributes can be mapped to the same set of

values
 Example: Attribute values for ID and age are integers
 But properties of attribute values can be different
Types of Attributes

 There are different types of attributes

– Nominal
 Examples: ID numbers, eye color, zip codes
– Ordinal
 Examples: rankings (e.g., taste of potato chips on a
scale from 1-10), grades, height {tall, medium, short}
– Interval
 Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
– Ratio
 Examples: temperature in Kelvin, length, time, counts
Properties of Attribute Values

 The type of an attribute depends on which of the

following properties/operations it possesses:
– Distinctness: = 
– Order: < >
– Differences are + -
meaningful :
– Ratios are * /
meaningful

– Nominal attribute: distinctness

– Ordinal attribute: distinctness & order
– Interval attribute: distinctness, order & meaningful differences
– Ratio attribute: all 4 properties/operations
Attribute Description Examples Operations
Type
Nominal Nominal attribute zip codes, employee mode, entropy,
values only ID numbers, eye contingency
distinguish. (=, ) color, sex: {male, correlation, 2
Categorical
Qualitative

female} test

Ordinal Ordinal attribute hardness of minerals, median,

values also order {good, better, best}, percentiles, rank
objects. grades, street correlation, run
(<, >) numbers tests, sign tests
Interval For interval calendar dates, mean, standard
attributes, temperature in deviation,
differences between Celsius or Fahrenheit Pearson's
Quantitative
Numeric

values are correlation, t and

meaningful. (+, - ) F tests
Ratio For ratio variables, temperature in Kelvin, geometric mean,
both differences and monetary quantities, harmonic mean,
ratios are counts, age, mass, percent variation
meaningful. (*, /) length, current
Discrete and Continuous
Attributes

 Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a
collection of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete
attributes
 Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and
represented using a finite number of digits.
– Continuous attributes are typically represented as floating-
point variables.
Asymmetric Attributes
 Only presence (a non-zero attribute value) is regarded as
important
 Words present in documents
 Items present in customer transactions

 We need two asymmetric binary attributes to represent

one ordinary binary attribute
– Association analysis uses asymmetric attributes

 Asymmetric attributes typically arise from objects that

are sets
Types of data sets
 Record
– Data Matrix
– Document Data
– Transaction Data
 Graph
– World Wide Web
– Molecular Structures
 Ordered
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data
Record Data

 Data that consists of a collection of records, each

of which consists of a fixed set of attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No

2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Data Matrix

 If data objects have the same fixed set of numeric

attributes, then the data objects can be thought of as
points in a multi-dimensional space, where each
dimension represents a distinct attribute

 Such data set can be represented by an m by n matrix,

where there are m rows, one for each object, and n
columns, one for each attribute
Projection Projection Distance Load Thickness
of x Load of y load

10.23 5.27 15.22 2.7 1.2

12.65 6.25 16.22 2.2 1.1
Document Data

 Each document becomes a ‘term’ vector

– Each term is a component (attribute) of the vector
– The value of each component is the number of times
the corresponding term occurs in the document.

timeout

season
coach

game
score
play
team

win
ball

lost
Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data

 A special type of record data, where

– Each record (transaction) involves a set of items.
– For example, consider a grocery store. The set of
products purchased by a customer during one
shopping trip constitute a transaction, while the
individual products that were purchased are the items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph Data
 Examples: Generic graph, a molecule, and webpages

2
5 1
2
5

Benzene Molecule: C6H6

Ordered Data

 Sequences of transactions
Items/Events

An element of
the sequence
Ordered Data

 Genomic sequence data

GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Data Quality

 Poor data quality negatively affects many data processing

efforts
“The most important point is that poor data quality is an unfolding
disaster.
– Poor data quality costs the typical company at least ten
percent (10%) of revenue; twenty percent (20%) is
probably a better estimate.”
Thomas C. Redman, DM Review, August 2004
 Data mining example: a classification model for detecting
people who are loan risks is built using poor data
– Some credit-worthy candidates are denied loans
– More loans are given to individuals that default
Data Quality …

 What kinds of data quality problems exist?

 How can we detect problems with the data?
 What can we do about these problems?

 Examples of data quality problems:

– Noise and outliers
– Missing values
– Duplicate data
– Wrong data
Noise

 For objects, noise is an extraneous object

 For attributes, noise refers to modification of original values
– Examples: distortion of a person’s voice when talking on a poor
phone and “snow” on television screen

Two Sine Waves Two Sine Waves + Noise

Outliers

 Outliers are data objects with characteristics that

are considerably different than most of the other
data objects in the data set
– Case 1: Outliers are
noise that interferes
with data analysis

– Case 2: Outliers are

the goal of our analysis
 Credit card fraud
 Intrusion detection
Missing Values

 Reasons for missing values

– Information is not collected
(e.g., people decline to give their age and weight)
– Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

 Handling missing values

– Eliminate data objects or variables
– Estimate missing values
– Ignore the missing value during analysis
Missing Values …

 Missing completely at random (MCAR)

– Missingness of a value is independent of attributes
– Fill in values based on the attribute
– Analysis may be unbiased overall
 Missing at Random (MAR)
– Missingness is related to other variables
– Fill in values based other values
– Almost always produces a bias in the analysis
 Missing Not at Random (MNAR)
– Missingness is related to unobserved measurements
– Informative or non-ignorable missingness
 Not possible to know the situation from the data
Duplicate Data

 Data set may include data objects that are

duplicates, or almost duplicates of one another
– Major issue when merging data from heterogeneous
sources

 Examples:
– Same person with multiple email addresses

 Data cleaning
– Process of dealing with duplicate data issues

 When should duplicate data not be removed?

Similarity and Dissimilarity
Measures

 Similarity measure
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
 Dissimilarity measure
– Numerical measure of how different two data objects
are
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
 Proximity refers to a similarity or dissimilarity
Similarity/Dissimilarity for Simple
Attributes

The following table shows the similarity and dissimilarity

between two objects, x and y, with respect to a single, simple
attribute.
Euclidean Distance

 Euclidean Distance

and yk are, respectively, the kth attributes

where n is the number of dimensions (attributes) and xk

(components) or data objects x and y.

 Standardization is necessary, if scales differ.

Euclidean Distance

3
point x y
2 p1
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6

p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
Minkowski Distance

 Minkowski Distance is a generalization of Euclidean

Distance

Where r is a parameter, n is the number of dimensions

(attributes) and xk and yk are, respectively, the kth
attributes (components) or data objects x and y.
Minkowski Distance: Examples

 r = 1. City block (Manhattan, taxicab, L1 norm) distance.

– A common example of this is the Hamming distance, which
is just the number of bits that are different between two
binary vectors

 r = 2. Euclidean distance

 r  . “supremum” (Lmax norm, L norm) distance.

– This is the maximum difference between any component of
the vectors

 Do not confuse r with n, i.e., all these distances are

defined for all numbers of dimensions.
Minkowski Distance

L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
point x y
p1 0 2 L2 p1 p2 p3 p4
p2 2 0 p1 0 2.828 3.162 5.099
p3 3 1 p2 2.828 0 1.414 3.162
p4 5 1 p3 3.162 1.414 0 2
p4 5.099 3.162 2 0

L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0

Distance Matrix
Common Properties of a Distance
 Distances, such as the Euclidean distance,
have some well known properties.
1. d(x, y)  0 for all x and y and d(x, y) = 0 only if
x = y. (Positive definiteness)
2. d(x, y) = d(y, x) for all x and y. (Symmetry)
3. d(x, z)  d(x, y) + d(y, z) for all points x, y, and z.
(Triangle Inequality)

where d(x, y) is the distance (dissimilarity) between

points (data objects), x and y.

 A distance that satisfies these properties is a

metric
Common Properties of a Similarity

 Similarities, also have some well known

properties.

1. s(x, y) = 1 (or maximum similarity) only if x = y.

2. s(x, y) = s(y, x) for all x and y. (Symmetry)

where s(x, y) is the similarity between points (data

objects), x and y.
Similarity Between Binary Vectors
 Common situation is that objects, p and q, have only
binary attributes
 Compute similarities using the following quantities
f01 = the number of attributes where p was 0 and q was 1
f10 = the number of attributes where p was 1 and q was 0
f00 = the number of attributes where p was 0 and q was 0
f11 = the number of attributes where p was 1 and q was 1

 Simple Matching and Jaccard Coefficients

SMC = number of matches / number of attributes
= (f11 + f00) / (f01 + f10 + f11 + f00)

J = number of 11 matches / number of non-zero attributes

= (f11) / (f01 + f10 + f11)
SMC versus Jaccard: Example

x= 1000000000
y= 0000001001

f01 = 2 (the number of attributes where p was 0 and q was 1)

f10 = 1 (the number of attributes where p was 1 and q was 0)
f00 = 7 (the number of attributes where p was 0 and q was 0)
f11 = 0 (the number of attributes where p was 1 and q was 1)

SMC = (f11 + f00) / (f01 + f10 + f11 + f00)

= (0+7) / (2+1+0+7) = 0.7

J = (f11) / (f01 + f10 + f11) = 0 / (2 + 1 + 0) = 0

Proximity Measures: Categorical
Attributes

Method 1: Simple matching

❑ m: # of matches, p: total # of variables

Method 2: Map it to binary variables

❑ create a new binary attribute for each of the M nominal
states of the attribute
Cosine Similarity

 If d1 and d2 are two document vectors, then

cos( d1, d2 ) = <d1,d2> / ||d1|| ||d2|| ,
where <d1,d2> indicates inner product or vector dot
product of vectors, d1 and d2, and || d || is the length of
vector d.
 Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
<d1, d2> = 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
| d1 || = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
|| d2 || = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.449
cos(d1, d2 ) = 0.3150
Correlation measures the linear relationship between objects
Visually Evaluating Correlation

Scatter plots
showing the
similarity from
–1 to 1.

Data
No ratings yet
Data
84 pages
Data and Attributes in Data Mining
No ratings yet
Data and Attributes in Data Mining
47 pages
Data Mining: Data
No ratings yet
Data Mining: Data
50 pages
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
No ratings yet
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
57 pages
Chap2 Data
No ratings yet
Chap2 Data
87 pages
S1 - 25 (NSP) - ML - CS 2 - 3rd Aug 2025
No ratings yet
S1 - 25 (NSP) - ML - CS 2 - 3rd Aug 2025
67 pages
Data Mining Lecture2-2
No ratings yet
Data Mining Lecture2-2
29 pages
Data Attributes & Types Explained
No ratings yet
Data Attributes & Types Explained
69 pages
Lect 2 DM Converted 1
No ratings yet
Lect 2 DM Converted 1
29 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
49 pages
DWDM Unit6-Data Similarity Measures
No ratings yet
DWDM Unit6-Data Similarity Measures
40 pages
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
No ratings yet
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
49 pages
Week 5 - Data Mining Exploring Data With R
No ratings yet
Week 5 - Data Mining Exploring Data With R
146 pages
Class 2 Introduction To Data
No ratings yet
Class 2 Introduction To Data
40 pages
Lec01 Dataprep
No ratings yet
Lec01 Dataprep
67 pages
Unit1 Data Preprocessing
No ratings yet
Unit1 Data Preprocessing
95 pages
Wk. 3. Data (12-05-2021)
No ratings yet
Wk. 3. Data (12-05-2021)
57 pages
2-Data Preprocessing
No ratings yet
2-Data Preprocessing
104 pages
Full
No ratings yet
Full
367 pages
A.I. Lecture 5 NEW
No ratings yet
A.I. Lecture 5 NEW
96 pages
Lect 2
No ratings yet
Lect 2
77 pages
Chapter 2 Data Issues
No ratings yet
Chapter 2 Data Issues
21 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
31 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
50 pages
Data Mining: Understanding Data Types
No ratings yet
Data Mining: Understanding Data Types
53 pages
Sess02 Data
No ratings yet
Sess02 Data
96 pages
Data Mining Techniques Overview
No ratings yet
Data Mining Techniques Overview
12 pages
Data Mining CH2
No ratings yet
Data Mining CH2
69 pages
ML - Data - Preprocessing For Machine Learning
No ratings yet
ML - Data - Preprocessing For Machine Learning
44 pages
Attributes
No ratings yet
Attributes
66 pages
Chap2 Data
No ratings yet
Chap2 Data
88 pages
CSC 452 DM Lecture02 Know Your Data A 13102020 014137pm
No ratings yet
CSC 452 DM Lecture02 Know Your Data A 13102020 014137pm
39 pages
1 Data Mining
No ratings yet
1 Data Mining
47 pages
Chapter-2 Getting To Know Your Data
No ratings yet
Chapter-2 Getting To Know Your Data
92 pages
2 What Is DATA ST
No ratings yet
2 What Is DATA ST
63 pages
Lect-2 Getting To Know Your Data-Part-I
No ratings yet
Lect-2 Getting To Know Your Data-Part-I
28 pages
Data Mining Process Overview
100% (1)
Data Mining Process Overview
51 pages
Lecture 2
No ratings yet
Lecture 2
27 pages
PREPROCESSING
No ratings yet
PREPROCESSING
122 pages
Data Preprocessing PDF
No ratings yet
Data Preprocessing PDF
57 pages
Chapter 2
No ratings yet
Chapter 2
57 pages
ML 1,2 Unit Peter Flach Machine Learning. The Art and Scienc
No ratings yet
ML 1,2 Unit Peter Flach Machine Learning. The Art and Scienc
22 pages
Tugas Data Mining Dan Data
No ratings yet
Tugas Data Mining Dan Data
20 pages
R21 DM Unit1
No ratings yet
R21 DM Unit1
77 pages
Week 2 - 3getting To Know Your Data
No ratings yet
Week 2 - 3getting To Know Your Data
67 pages
Data Wrangling and Descriptive Analytics: DR Sandipan Karmakar Department of Management Studies MNIT Jaipur
No ratings yet
Data Wrangling and Descriptive Analytics: DR Sandipan Karmakar Department of Management Studies MNIT Jaipur
57 pages
ML Unit-Ii
No ratings yet
ML Unit-Ii
100 pages
Data Mining: Statistical Analysis Techniques
No ratings yet
Data Mining: Statistical Analysis Techniques
24 pages
Chapter-2 (Data)
No ratings yet
Chapter-2 (Data)
95 pages
Clustering Vivek Saxena
No ratings yet
Clustering Vivek Saxena
169 pages
Introduction To Data
No ratings yet
Introduction To Data
26 pages
03 - Data Mining
No ratings yet
03 - Data Mining
37 pages
Lecture Notes For Chapter 2 Introduction To Data Mining
No ratings yet
Lecture Notes For Chapter 2 Introduction To Data Mining
34 pages
Data Mining and Warehousing Overview
No ratings yet
Data Mining and Warehousing Overview
68 pages
Data Preprocessing & Attributes
No ratings yet
Data Preprocessing & Attributes
33 pages
Lecture Notes For Chapter 2: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 2: by Tan, Steinbach, Kumar
25 pages
Lecture2 IntroData
No ratings yet
Lecture2 IntroData
16 pages
Data Preprocessing Guide
No ratings yet
Data Preprocessing Guide
40 pages
Turning Tools - Tooling Systems
No ratings yet
Turning Tools - Tooling Systems
94 pages
Ankit Singh CV
No ratings yet
Ankit Singh CV
3 pages
Solenoid Kicking Mechanism for Soccer Robots
No ratings yet
Solenoid Kicking Mechanism for Soccer Robots
8 pages
Forensic Chemistry Part 3
No ratings yet
Forensic Chemistry Part 3
3 pages
Experiment No. 1 Kirchhoff'S Law I. Objectives
No ratings yet
Experiment No. 1 Kirchhoff'S Law I. Objectives
6 pages
Tech Manual ZX200 PDF
100% (7)
Tech Manual ZX200 PDF
407 pages
Datasheet - Type K Thermocouple
No ratings yet
Datasheet - Type K Thermocouple
2 pages
Ancient Egyptian Meditation Posture Guide
100% (1)
Ancient Egyptian Meditation Posture Guide
5 pages
Industrial Pharmacy by Lachman 4th Edition
No ratings yet
Industrial Pharmacy by Lachman 4th Edition
1,311 pages
Greek Mythology: Dragon Combat Analysis
No ratings yet
Greek Mythology: Dragon Combat Analysis
13 pages
Using MPU 9250 IMU in Vertical Orientation
No ratings yet
Using MPU 9250 IMU in Vertical Orientation
3 pages
Advisor Circulary Painting Marking Ang Lighting of Vehicles On An Airport
No ratings yet
Advisor Circulary Painting Marking Ang Lighting of Vehicles On An Airport
14 pages
SP12 2y3 PC PP 993 007
100% (1)
SP12 2y3 PC PP 993 007
21 pages
Summer AC Plant Load Calculation Guide
No ratings yet
Summer AC Plant Load Calculation Guide
37 pages
The COVID 19 Genocide 2020
100% (10)
The COVID 19 Genocide 2020
11 pages
Geography
No ratings yet
Geography
28 pages
Trung Nguyen Slide TA5
No ratings yet
Trung Nguyen Slide TA5
15 pages
BattleTech Record Sheets (Flechs) : Devastator
No ratings yet
BattleTech Record Sheets (Flechs) : Devastator
7 pages
Spectrum Wallboard Installation Manual V2
No ratings yet
Spectrum Wallboard Installation Manual V2
13 pages
Advanced Carding 24
No ratings yet
Advanced Carding 24
3 pages
Bituproof
No ratings yet
Bituproof
2 pages
Kai Greene's Leg Workout Routine
No ratings yet
Kai Greene's Leg Workout Routine
2 pages
Parasitology Table Summary
No ratings yet
Parasitology Table Summary
3 pages
Charles Gibbs - Importance of Ag Sample Test
No ratings yet
Charles Gibbs - Importance of Ag Sample Test
2 pages
The Self in Western and Eastern Thought
No ratings yet
The Self in Western and Eastern Thought
18 pages
10 Minute English Typing Test - Dhandi Muhammad
No ratings yet
10 Minute English Typing Test - Dhandi Muhammad
2 pages
LS English 7 Workbook Answers PDF Ellipsis Homeschooling 2
No ratings yet
LS English 7 Workbook Answers PDF Ellipsis Homeschooling 2
3 pages
Solubility Product & PH
No ratings yet
Solubility Product & PH
5 pages
Bimetal Strip Steel Solutions
No ratings yet
Bimetal Strip Steel Solutions
14 pages
Assignment 2
No ratings yet
Assignment 2
7 pages

How To Work On Data You Haev

Uploaded by

How To Work On Data You Haev

Uploaded by

Data Mining: Data

Data Mining: Data

 Attributes and Objects

 Similarity and Distance

 Collection of data objects Attributes

– Attribute is also known as 3 No Single 70K No

 Data may have parts

 The different parts of the data may have

 More generally, data may have structure

 Data can be incomplete

 We will discuss this in more detail later

 Attribute values are numbers or symbols

 Distinction between attributes and attribute values

– Different attributes can be mapped to the same set of

 There are different types of attributes

 The type of an attribute depends on which of the

– Nominal attribute: distinctness

Ordinal Ordinal attribute hardness of minerals, median,

values are correlation, t and

 We need two asymmetric binary attributes to represent

 Asymmetric attributes typically arise from objects that

 Data that consists of a collection of records, each

1 Yes Single 125K No

 If data objects have the same fixed set of numeric

 Such data set can be represented by an m by n matrix,

10.23 5.27 15.22 2.7 1.2

 Each document becomes a ‘term’ vector

 A special type of record data, where

Benzene Molecule: C6H6

 Genomic sequence data

 Poor data quality negatively affects many data processing

 What kinds of data quality problems exist?

 Examples of data quality problems:

 For objects, noise is an extraneous object

Two Sine Waves Two Sine Waves + Noise

 Outliers are data objects with characteristics that

– Case 2: Outliers are

 Reasons for missing values

 Handling missing values

 Missing completely at random (MCAR)

 Data set may include data objects that are

 When should duplicate data not be removed?

The following table shows the similarity and dissimilarity

and yk are, respectively, the kth attributes

(components) or data objects x and y.

 Standardization is necessary, if scales differ.

 Minkowski Distance is a generalization of Euclidean

Where r is a parameter, n is the number of dimensions

 r = 1. City block (Manhattan, taxicab, L1 norm) distance.

 r  . “supremum” (Lmax norm, L norm) distance.

 Do not confuse r with n, i.e., all these distances are

where d(x, y) is the distance (dissimilarity) between

 A distance that satisfies these properties is a

 Similarities, also have some well known

1. s(x, y) = 1 (or maximum similarity) only if x = y.

2. s(x, y) = s(y, x) for all x and y. (Symmetry)

where s(x, y) is the similarity between points (data

 Simple Matching and Jaccard Coefficients

J = number of 11 matches / number of non-zero attributes

f01 = 2 (the number of attributes where p was 0 and q was 1)

SMC = (f11 + f00) / (f01 + f10 + f11 + f00)

J = (f11) / (f01 + f10 + f11) = 0 / (2 + 1 + 0) = 0

Method 1: Simple matching

Method 2: Map it to binary variables

 If d1 and d2 are two document vectors, then

You might also like