03preprocessing 1
03preprocessing 1
— Chapter 3 —
Slides
Curtesy
of
Textbook
1
Chapter 3: Data Preprocessing
n Data Quality
n Data Cleaning
n Data Integration
n Data Reduction
n Summary
2
Data Quality: Why Preprocess the Data?
n Data
reduction
n Dimensionality
reduction
n Numerosity reduction
n Data compression
4
Chapter 3: Data Preprocessing
n Data Quality
n Data Cleaning
n Data Integration
n Data Reduction
n Summary
6
Data in Real World is Dirty!
n From
various
reasons,
e.g.,
instrument
faulty,
human
or
computer
error,
transmission
error,
etc.
n incomplete:
lacking
attribute
values,
lacking
certain
attributes
of
n technology limitation
10
How to Handle Noisy Data?
n Binning
n first
sort
data
and
partition
into
(equal-‐frequency)
bins
n then one can smooth by bin means, smooth by bin median,
n Clustering
n detect
and
remove
outliers
12
Data Cleaning as a Process
n Step
1:
Discrepancy
detection
n Use
metadata
(e.g.,
domain,
range,
dependency,
distribution)
n Data Quality
n Data Cleaning
n Data Integration
n Data Reduction
n Summary
14
Data Integration
n Data
integration:
n Combines
data
from
multiple
sources
into
a
coherent
store
n Schema
integration:
e.g.,
A.cust-‐id
≡ B.cust-‐#
n Integrate
metadata
from
different
sources
n Entity
identification
problem:
n Identify
real
world
entities
from
multiple
data
sources,
e.g.,
Bill
Clinton
=
William
Clinton
n Detecting
and
resolving
data
value
conflicts
n For
the
same
real
world
entity,
attribute
values
from
different
sources
are
different
n Possible
reasons:
different
representations,
different
scales,
e.g.,
metric
vs.
British
units
15
Why Data Integration:
Handling Redundancies & Inconsistencies
n Numbers
in
parenthesis
are
expected
counts
calculated
based
on
the
data
distribution
in
the
two
categories
n Example:
Expected
count
of
people
playing
chest
and
liking
science
fiction:
450
*
300
/
1500
=
90
n Χ2 (chi-‐square)
calculation
2(250 − 90) 2 (50 − 210) 2 (200 − 360) 2 (1000 − 840) 2
χ = + + + = 507.93
90 210 360 840
n Degree
of
freedom
for
the
2x2
table:
(2-‐1)*(2-‐1)
=
1
-‐-‐>
By
looking
up
the
Chi-‐
Square
table,
we
can
reject
the
hypothesis
like_science_fiction and
play_chess are
independent
with
high
confidenceà they
are
correlated!
Correlation Analysis (for Numeric Data)
n Correlation
coefficient
(also
called
Pearson’s
product
moment
coefficient)
n n
∑ i =1
(ai − A)(bi − B) ∑ i =1
(ai bi ) − n AB
rA, B = =
nσ Aσ B nσ Aσ B
where
n
is
the
number
of
tuples,
A
and
B
are
the
respective
means
of
A
and
B,
σA
and
σB
are
the
respective
standard
deviations
of
A
and
B,
and
Σ(aibi)
is
the
sum
of
the
AB
cross-‐
product.
n If
rA,B >
0,
A
and
B
are
positively
correlated
(A’s
values
increase
as
B’s).
The
higher
rA,B,
the
stronger
correlation.
n rA,B =
0:
independent.
n rAB <
0:
negatively
correlated.
Visually Evaluating Correlation
Scatter plots
showing the
similarity from
–1 to 1.
Covariance (for Numeric Data)
n Covariance
is
similar
to
correlation
Correlation coefficient:
n Suppose
two
stocks
A
and
B
have
the
following
values
in
one
week:
(2,
5),
(3,
8),
(5,
10),
(4,
11),
(6,
14).
n Question:
If
the
stocks
are
affected
by
the
same
industry
trends,
will
their
prices
rise
or
fall
together?
n Data Quality
n Data Cleaning
n Data Integration
n Data Reduction
n Summary
23
Data Reduction Strategies
n Data
reduction:
Obtain
a
reduced
representation
of
the
data
set
that
is
much
smaller
in
volume
but
yet
produces
the
same
(or
almost
the
same)
analytical
results
n Why
data
reduction?
— A
database/data
warehouse
may
store
terabytes
of
data.
Complex
data
analysis
may
take
a
very
long
time
to
run
on
the
complete
data
set.
n Data
reduction
strategies
n Dimensionality
reduction,
e.g., remove
unimportant
attributes
n Wavelet transforms
n Data
compression
Data Reduction 1: Dimensionality Reduction
n Curse
of
dimensionality
n When
dimensionality
increases,
data
becomes
increasingly
sparse
n Density and distance between points, which is critical to clustering,
n Dimensionality
reduction
n Avoid
the
curse
of
dimensionality
n Redundant
attributes
n Duplicate
much
or
all
of
the
information
contained
in
one
or
more
other
attributes
n E.g.,
purchase
price
of
a
product
and
the
amount
of
sales
tax
paid
n Irrelevant
attributes
n Contain
no
information
that
is
useful
for
the
data
mining
task
at
hand
n E.g.,
students'
ID
is
often
irrelevant
to
the
task
of
predicting
students'
GPA
Attribute Subset Selection by Heuristic Search
n Domain-‐specific
29
Parametric Data Reduction: Regression Analysis
30
Non-parametric Data Reduction: Histogram Analysis
20000
30000
40000
50000
60000
70000
80000
90000
100000
31
Non-parametric Data Reduction: Clustering
n Partition
data
set
into
clusters
based
on
similarity,
and
store
cluster
representation
(e.g.,
centroid
and
diameter)
only
n Can
be
very
effective
if
data
is
clustered
but
not
if
data
is
“smeared”
n Can
have
hierarchical
clustering
and
be
stored
in
multi-‐
dimensional
index
tree
structures
n There
are
many
choices
of
clustering
definitions
and
clustering
algorithms
n Cluster
analysis
will
be
studied
in
depth
in
Chapter
10
32
Non-parametric Data Reduction: Sampling
33
Types of Sampling
n Stratified
sampling:
n Partition
the
data
set,
and
draw
samples
from
each
partition
34
Sampling: With or without Replacement
Raw Data
Sampling: Cluster or Stratified Sampling
36
Non-parametric Data Reduction: Data Cube Aggregation
without
expansion
n Audio/video
compression
n Typically
lossy compression,
with
progressive
refinement
Original Data
Approximated