Data Mining
Concepts and Techniques
Dr. Mohamad Shady Ahmad Alrahhal
1 1
Data Mining:
Concepts and Techniques
(3rd ed.)
— Chapter 3 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2013 Han, Kamber & Pei. All rights reserved.
Dr. Mohamad Shady Ahmad Alrahhal
2
Chapter 3: Data Preprocessing
■ Data Preprocessing: An Overview
■ Data Quality
■ Major Tasks in Data Preprocessing
■ Data Cleaning
■ Data Integration
■ Data Reduction
■ Data Transformation and Data Discretization
■ Summary
3
Data Quality: Why Preprocess the Data?
■ Measures for data quality: A multidimensional view
■ Accuracy: correct or wrong, accurate or not
■ Completeness: not recorded, unavailable, …
■ Consistency: some modified but some not, dangling, …
■ Timeliness: timely update?
■ Believability: how trustable the data are correct?
■ Interpretability: how easily the data can be
understood?
4
Major Tasks in Data Preprocessing
■ Data cleaning
■ Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
■ Data integration
■ Integration of multiple databases, data cubes, or files
■ Data reduction
■ Dimensionality reduction
■ Numerosity reduction
■ Data compression
■ Data transformation and data discretization
■ Normalization
■ Concept hierarchy generation
5
Chapter 3: Data Preprocessing
■ Data Preprocessing: An Overview
■ Data Quality
■ Major Tasks in Data Preprocessing
■ Data Cleaning
■ Data Integration
■ Data Reduction
■ Data Transformation and Data Discretization
■ Summary
6
Data Cleaning
■ Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
■ incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
■ e.g., Occupation = “ ” (missing data)
■ noisy: containing noise, errors, or outliers
■ e.g., Salary = “−10” (an error)
■ inconsistent: containing discrepancies in codes or names, e.g.,
■ Age = “42”, Birthday = “03/07/2010”
■ Was rating “1, 2, 3”, now rating “A, B, C”
■ discrepancy between duplicate records
■ Intentional (e.g., disguised missing data)
■ Jan. 1 as everyone’s birthday?
7
Incomplete (Missing) Data
■ Data is not always available
■ E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
■ Missing data may be due to
■ equipment malfunction
■ inconsistent with other recorded data and thus deleted
■ data not entered due to misunderstanding
■ certain data may not be considered important at the time
of entry
■ not register history or changes of the data
■ Missing data may need to be inferred
8
How to Handle Missing Data?
■ Ignore the tuple: usually done when class label is missing (when
doing classification)—not effective when the % of missing values
per attribute varies considerably
■ Fill in the missing value manually: tedious + infeasible?
■ Fill in it automatically with
■ a global constant : e.g., “unknown”, a new class?!
■ the attribute mean
■ the attribute mean for all samples belonging to the same
class: smarter
■ the most probable value: inference-based such as Bayesian
formula or decision tree
9
Noisy Data
■ Noise: random error or variance in a measured variable
■ Incorrect attribute values may be due to
■ faulty data collection instruments
■ data entry problems
■ data transmission problems
■ technology limitation
■ inconsistency in naming convention
■ Other data problems which require data cleaning
■ duplicate records
■ incomplete data
■ inconsistent data
10
How to Handle Noisy Data?
■ Binning
■ first sort data and partition into (equal-frequency) bins
■ then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
■ Regression
■ smooth by fitting the data into regression functions
■ Clustering
■ detect and remove outliers
■ Combined computer and human inspection
■ detect suspicious values and check by human (e.g., deal
with possible outliers)
11
Chapter 3: Data Preprocessing
■ Data Preprocessing: An Overview
■ Data Quality
■ Major Tasks in Data Preprocessing
■ Data Cleaning
■ Data Integration
■ Data Reduction
■ Data Transformation and Data Discretization
■ Summary
12
Data Integration
■ Data integration:
■ Combines data from multiple sources into a coherent store
■ Schema integration: e.g., [Link]-id ≡ [Link]-#
■ Integrate metadata from different sources
■ Entity identification problem:
■ Identify real world entities from multiple data sources, e.g., Bill Clinton =
William Clinton
■ Detecting and resolving data value conflicts
■ For the same real world entity, attribute values from different sources are
different
■ Possible reasons: different representations, different scales, e.g., metric
vs. British units
13
Handling Redundancy in Data Integration
■ Redundant data occur often when integration of multiple
databases
■ Object identification: The same attribute or object may
have different names in different databases
■ Derivable data: One attribute may be a “derived” attribute
in another table, e.g., annual revenue
■ Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
■ Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve
mining speed and quality
14
Chapter 3: Data Preprocessing
■ Data Preprocessing: An Overview
■ Data Quality
■ Major Tasks in Data Preprocessing
■ Data Cleaning
■ Data Integration
■ Data Reduction
■ Data Transformation and Data Discretization
■ Summary
15
Data Reduction Strategies
■ Data reduction: Obtain a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the same) analytical
results
■ Why data reduction? — A database/data warehouse may store terabytes of
data. Complex data analysis may take a very long time to run on the complete
data set.
■ Data reduction strategies
■ Dimensionality reduction, e.g., remove unimportant attributes
■ Wavelet transforms
■ Principal Components Analysis (PCA)
■ Feature subset selection, feature creation
■ Numerosity reduction (some simply call it: Data Reduction)
■ Regression and Log-Linear Models
■ Histograms, clustering, sampling
■ Data cube aggregation
■ Data compression
16
Data Reduction 1: Dimensionality
Reduction
■ Curse of dimensionality
■ When dimensionality increases, data becomes increasingly sparse
■ Density and distance between points, which is critical to clustering,
outlier analysis, becomes less meaningful
■ The possible combinations of subspaces will grow exponentially
■ Dimensionality reduction
■ Avoid the curse of dimensionality
■ Help eliminate irrelevant features and reduce noise
■ Reduce time and space required in data mining
■ Allow easier visualization
■ Dimensionality reduction techniques
■ Wavelet transforms
■ Principal Component Analysis
■ Supervised and nonlinear techniques (e.g., feature selection)
17
Mapping Data to a New Space
■ Fourier transform
■ Wavelet transform
Two Sine Waves Two Sine Waves + Noise Frequency
18
What Is Wavelet Transform?
■ Decomposes a signal into
different frequency subbands
■ Applicable to n-dimensional
signals
■ Data are transformed to
preserve relative distance
between objects at different
levels of resolution
■ Allow natural clusters to
become more distinguishable
■ Used for image compression
19
Principal Component Analysis (PCA)
■ Find a projection that captures the largest amount of variation in
data
■ The original data are projected onto a much smaller space, resulting
in dimensionality reduction. We find the eigenvectors of the
covariance matrix, and these eigenvectors define the new space
x2
x1 20
Attribute Subset Selection
■ Another way to reduce dimensionality of data
■ Redundant attributes
■ Duplicate much or all of the information contained in one or
more other attributes
■ E.g., purchase price of a product and the amount of sales
tax paid
■ Irrelevant attributes
■ Contain no information that is useful for the data mining
task at hand
■ E.g., students' ID is often irrelevant to the task of predicting
students' GPA
21
Attribute Creation (Feature Generation)
■ Create new attributes (features) that can capture the important
information in a data set more effectively than the original
ones
■ Three general methodologies
■ Attribute extraction
■ Domain-specific
■ Mapping data to new space (see: data reduction)
■ E.g., Fourier transformation, wavelet transformation,
manifold approaches (not covered)
■ Attribute construction
■ Combining features (see: discriminative frequent
patterns in Chapter on “Advanced Classification”)
■ Data discretization
22
Similarity and Dissimilarity
● Similarity
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
● Dissimilarity
– Numerical measure of how different are two data
objects
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
● Proximity refers to a similarity or dissimilarity
Euclidean Distance
● Euclidean Distance
Where n is the number of dimensions (attributes) and pk and qk
are, respectively, the kth attributes (components) or data
objects p and q.
● Standardization is necessary, if scales differ.
Euclidean Distance
Distance Matrix
Minkowski Distance
● Minkowski Distance is a generalization of Euclidean
Distance
Where r is a parameter, n is the number of dimensions
(attributes) and pk and qk are, respectively, the kth attributes
(components) or data objects p and q.
Minkowski Distance: Examples
● r = 1. City block (Manhattan, taxicab, L1 norm) distance.
– A common example of this is the Hamming distance, which is just the
number of bits that are different between two binary vectors
● r = 2. Euclidean distance
● r → ∞. “supremum” (Lmax norm, L∞ norm) distance.
– This is the maximum difference between any component of the vectors
● Do not confuse r with n, i.e., all these distances are defined
for all numbers of dimensions.
Similarity Between Binary Vectors
● Common situation is that objects, p and q, have only
binary attributes
● Compute similarities using the following quantities
M01 = the number of attributes where p was 0 and q was 1
M10 = the number of attributes where p was 1 and q was 0
M00 = the number of attributes where p was 0 and q was 0
M11 = the number of attributes where p was 1 and q was 1
● Simple Matching and Jaccard Coefficients
SMC = number of matches / number of attributes
= (M11 + M00) / (M01 + M10 + M11 + M00)
J = number of 11 matches / number of not-both-zero attributes values
= (M11) / (M01 + M10 + M11)
SMC versus Jaccard: Example
p= 1000000000
q= 0000001001
M01 = 2 (the number of attributes where p was 0 and q was 1)
M10 = 1 (the number of attributes where p was 1 and q was 0)
M00 = 7 (the number of attributes where p was 0 and q was 0)
M11 = 0 (the number of attributes where p was 1 and q was 1)
SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7) = 0.7
J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0
Cosine Similarity
● If d1 and d2 are two document vectors, then
cos( d1, d2 ) = (d1 ∙ d2) / ||d1|| ||d2|| ,
where ∙ indicates vector dot product and || d || is the length of vector d.
● Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
d1 ∙ d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245
cos( d1, d2 ) = ?
Data Reduction 2: Numerosity
Reduction
■ Reduce data volume by choosing alternative, smaller forms of
data representation
■ Parametric methods (e.g., regression)
■ Assume the data fits some model, estimate model
parameters, store only the parameters, and discard the
data (except possible outliers)
■ Ex.: Log-linear models—obtain value at a point in m-D
space as the product on appropriate marginal subspaces
■ Non-parametric methods
■ Do not assume models
■ Major families: histograms, clustering, sampling, …
31
Clustering
■ Partition data set into clusters based on similarity, and store
cluster representation (e.g., centroid and diameter) only
■ Can be very effective if data is clustered but not if data is
“smeared”
■ Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
■ There are many choices of clustering definitions and
clustering algorithms
■ Cluster analysis will be studied in depth in Chapter 10
32
Sampling
■ Sampling: obtaining a small sample s to represent the whole
data set N
■ Allow a mining algorithm to run in complexity that is potentially
sub-linear to the size of the data
■ Key principle: Choose a representative subset of the data
■ Simple random sampling may have very poor performance
in the presence of skew
■ Develop adaptive sampling methods, e.g., stratified
sampling:
■ Note: Sampling may not reduce database I/Os (page at a time)
33
Types of Sampling
■ Simple random sampling
■ There is an equal probability of selecting any particular item
■ Sampling without replacement
■ Once an object is selected, it is removed from the population
■ Sampling with replacement
■ A selected object is not removed from the population
■ Stratified sampling:
■ Partition the data set, and draw samples from each partition
(proportionally, i.e., approximately the same percentage of
the data)
■ Used in conjunction with skewed data
34
Sampling: With or without
Replacement
W O R
SRS le random
i m p h o u t
( s e wi t
l
s a m p m e nt )
p l a c e
re
SRSW
R
Raw Data
35
Sampling: Cluster or Stratified
Sampling
Raw Data Cluster/Stratified
Sample
36
Chapter 3: Data Preprocessing
■ Data Preprocessing: An Overview
■ Data Quality
■ Major Tasks in Data Preprocessing
■ Data Cleaning
■ Data Integration
■ Data Reduction
■ Data Transformation and Data Discretization
■ Summary
37
Data Transformation
■ A function that maps the entire set of values of a given attribute to a new
set of replacement values s.t. each old value can be identified with one of
the new values
■ Methods
■ Smoothing: Remove noise from data
■ Attribute/feature construction
■ New attributes constructed from the given ones
■ Aggregation: Summarization, data cube construction
■ Normalization: Scaled to fall within a smaller, specified range
■ min-max normalization
■ z-score normalization
■ normalization by decimal scaling
■ Discretization: Concept hierarchy climbing
38
Normalization
■ Min-max normalization: to [new_minA, new_maxA]
■ Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].
Then $73,000 is mapped to
■ Z-score normalization (μ: mean, σ: standard deviation):
■ Ex. Let μ = 54,000, σ = 16,000. Then
■ Normalization by decimal scaling
Where j is the smallest integer such that Max(|ν’|) < 1
39
Discretization
■ Three types of attributes
■ Nominal—values from an unordered set, e.g., color, profession
■ Ordinal—values from an ordered set, e.g., military or academic rank
■ Numeric—real numbers, e.g., integer or real numbers
■ Discretization: Divide the range of a continuous attribute into intervals
■ Interval labels can then be used to replace actual data values
■ Reduce data size by discretization
■ Supervised vs. unsupervised
■ Split (top-down) vs. merge (bottom-up)
■ Discretization can be performed recursively on an attribute
■ Prepare for further analysis, e.g., classification
40
Data Discretization Methods
■ Typical methods: All the methods can be applied recursively
■ Binning
■ Top-down split, unsupervised
■ Histogram analysis
■ Top-down split, unsupervised
■ Clustering analysis (unsupervised, top-down split or bottom-
up merge)
■ Decision-tree analysis (supervised, top-down split)
■ Correlation (e.g., χ2) analysis (unsupervised, bottom-up
merge)
41
Simple Discretization: Binning
■ Equal-width (distance) partitioning
■ Divides the range into N intervals of equal size: uniform grid
■ if A and B are the lowest and highest values of the attribute, the width of
intervals will be: W = (B –A)/N.
■ The most straightforward, but outliers may dominate presentation
■ Skewed data is not handled well
■ Equal-depth (frequency) partitioning
■ Divides the range into N intervals, each containing approximately same
number of samples
■ Good data scaling
■ Managing categorical attributes can be tricky
42
Binning Methods for Data Smoothing
❑ Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
43