AI351 Lecture 1
AI351 Lecture 1
1
Data Types and Forms
A1 A2 … An C
Attribute-value data:
Data types
numeric, categorical (see
(temporal)
Other kinds of data
distributed data
images, audio/video
2
Data Preprocessing
5
Multi-Dimensional Measure
of Data Quality
A well-accepted multi-dimensional
view:
Accuracy
Completeness
Consistency
Value added
Interpretability
Accessibility
6
Major Tasks in Data
Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or
remove outliers and noisy data, and resolve
inconsistencies
Data integration
Integration of multiple databases, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces
the same or similar analytical results
Data discretization (for numerical data) 7
Data Preprocessing
9
Missing Data
Data is not always available
E.g., many tuples have no recorded values for several
attributes, such as customer income in sales data
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the
time of entry
not register history or changes of the data
10
How to Handle Missing
Data?
11
Noisy Data
Noise: random error or variance in a measured
variable.
Incorrect attribute values may due to
faulty data collection instruments
data entry problems
data transmission problems
etc
Other data problems which requires data
cleaning
duplicate records, incomplete data, inconsistent data
12
How to Handle Noisy
Data?
Binning method:
first sort data and partition into (equi-depth) bins
then one can smooth by bin means, smooth by
bin median, smooth by bin boundaries, etc.
Clustering
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human
(e.g., deal with possible outliers)
13
Outlier Removal
Data points inconsistent with the majority of
data
Different outliers
Valid: CEO’s salary,
Noisy: One’s age = 200, widely deviated points
Removal methods
Clustering
Curve-fitting
Hypothesis-testing with a given model
14
Data Preprocessing
17
Data Transformation:
Normalization
min-max normalization
v min A
v' (new _ max new _ min ) new _ min
A A A
max min
A A
z-score normalization
v mean A
v'
stand _ dev A
18
Data Preprocessing
20
Dimensionality
Reduction
Feature selection (i.e., attribute subset
selection):
Select a minimum set of attributes (features) that is
sufficient for the data mining task.
Heuristic methods (due to exponential # of
choices):
step-wise forward selection
step-wise backward elimination
combining forward selection and backward elimination
etc
21
Histograms
40
35
A popular data
reduction technique 30
Divide data into 25
buckets and store 20
average (sum) for
15
each bucket
10
5
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
22
Clustering
Partition data set into clusters, and one
can store cluster representation only
Can be very effective if data is clustered
but not if data is “smeared”
There are many choices of clustering
definitions and clustering algorithms. We
will discuss them later.
23
Sampling
Choose a representative subset of the data
Simple random sampling may have poor
performance in the presence of skew.
Develop adaptive sampling methods
Stratified sampling:
Approximate the percentage of each class (or
subpopulation of interest) in the overall
database
Used in conjunction with skewed data
24
Sampling
Raw Data Cluster/Stratified Sample
25
Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization
Summary
26
Discretization
Three types of attributes:
Nominal — values from an unordered set
Ordinal — values from an ordered set
Continuous — real numbers
Discretization:
divide the range of a continuous attribute into
intervals because some data mining algorithms only
accept categorical attributes.
Some techniques:
Binning methods – equal-width, equal-frequency
Entropy-based methods
27
Discretization and
Concept Hierarchy
Discretization
reduce the number of values for a given
continuous attribute by dividing the range of the
attribute into intervals. Interval labels can then
be used to replace actual data values
Concept hierarchies
reduce the data by collecting and replacing low
level concepts (such as numeric values for the
attribute age) by higher level concepts (such as
young, middle-aged, or senior)
28
Binning
Attribute values (for one attribute e.g., age):
0, 4, 12, 16, 16, 18, 24, 26, 28
Equi-width binning – for bin width of e.g., 10:
Bin 1: 0, 4 [-,10) bin
Bin 2: 12, 16, 16, 18 [10,20) bin
Bin 3: 24, 26, 28 [20,+) bin
– denote negative infinity, + positive infinity
Equi-frequency binning – for bin density of e.g.,
3:
Bin 1: 0, 4, 12 [-, 14) bin
Bin 2: 16, 16, 18 [14, 21) bin
Bin 3: 24, 26, 28 [21,+] bin
29
Entropy-based (1)
Given attribute-value/class pairs:
(0,P), (4,P), (12,P), (16,N), (16,N), (18,P), (24,N), (26,N), (28,N)
Entropy-based binning via binarization:
Intuitively, find best split so that the bins are as pure as possible
Formally characterized by maximal information gain.
Let S denote the above 9 pairs, p=4/9 be fraction
of P pairs, and n=5/9 be fraction of N pairs.
Entropy(S) = - p log p - n log n.
Smaller entropy – set is relatively pure; smallest is 0.
Large entropy – set is mixed. Largest is 1.
30
Entropy-based (2)
Let v be a possible split. Then S is divided into two sets:
S1: value <= v and S2: value > v
Information of the split:
I(S1,S2) = (|S1|/|S|) Entropy(S1)+ (|S2|/|S|) Entropy(S2)
Information gain of the split:
Gain(v,S) = Entropy(S) – I(S1,S2)
Goal: split with maximal information gain.
Possible splits: mid points b/w any two consecutive values.
For v=14, I(S1,S2) = 0 + 6/9*Entropy(S2) = 6/9 * 0.65 =
0.433
Gain(14,S) = Entropy(S) - 0.433
maximum Gain means minimum I.
The best split is found after examining all possible splits.
31
Summary
Data preparation is a big issue for data
mining
Data preparation includes
Data cleaning and data integration
Data reduction and feature selection
Discretization
Many methods have been proposed but
still an active area of research
32