Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
• Graph-based data:
• Data with relationships among instances
• We can use graph to capture relationships
among data instances
• E.g., social network data
• Ordered data: the attributes of data instances have relationships
which can be ordered, e.g., in time or space.
6
Today’s lecture:
Data Preparation
Big data is good?
Real world Data: NOT
PERFECT
KDD process
Data preparation
Select datasets Pre-processing Transformation Data mining
Processed Transformed
Raw Data Data Data
Relevant Data
Raw Data
Raw Data
Patterns
Tasks for Data Preparation
• Data (set) Selection
• i.e. selection of relevant observations and variable associated with them
• Data cleaning
• i.e. remove missing or wrong variable values
• Data Transformation
• Changing the format of data according to the analysis
• Transforming the data’s dimensionality
• Data reduction and discretization
• Feature selection
• Sampling
Quality is important
• Statistical Imputation: Fill in with most likely value (using regression, decision
trees, most similar records, etc.)
• If A and B are the lowest and highest values of the attribute, the width
of intervals will be: W = (B – A)/N.
• Limitations?
• Example:
• [1, 2.2, 1.8, 3.6, 2.3, 1.9, 99, 3.6, 2.8, 3.2 ]
• 3 bins:
• Bin 1 [1-33]: 9 instances in Bin 1
• Bin 2 [34-66]: 0
• Bin 3 [67-99]: 1 (99)
• 3 bins:
• Bin 1 (3): (1, 1.8, 1.9)
• Bin 2 (3): (2.2, 2.3, 2.8)
• Bin 3 (4): (3.2, 3.6, 3.6, 99)
• Mean: Bin1: 1.57; Bin2: 2.43; Bin3: 27.35
• Median: Bin1: 1.8; Bin2: 2.3; Bin3: 3.6
• Example:
• 3 bins
Exercise:
Data = {0,4,12,16,16,18,24,26,28}, 3 bins
• Equal-width (distance) partitioning:
Bin1={?}, Bin2={?}, Bin3={?}
• Equal-depth (frequency) partitioning:
Bin1={?}, Bin2={?}, Bin3={?}
Noisy Data Handling (1): Binning
• Data = {0,4,12,16,16,18,24,26,28}
• Equal-width (distance) partitioning
• Equal-depth (frequency)[-,partitioning
• Bin1: 0, 4, 12 14)
• Bin2: 16, 16, 18 [14, 21)
• Bin3: 24, 26, 28 [21, +)
Binning Methods for Data Smoothing
• E.g: Sorted data for price (in dollars):
{4,8,9,15,21,21,24,25,26,28,29,34 }
Outliers
Correcting Inconsistent Data
• Some types of inconsistences are easy to detect.
–e.g., a person’s height should not be negative
• In other cases, it can be necessary to consult an external
source of information
Data Integration
Data Integration
• Data integration
• combines data from multiple sources into a coherent store
• Schema integration
• integrate metadata from different sources (e.g., A.cust-id ≡ B.cust-#)
• Entity identification problem
• identify real world entities from multiple data sources,
(e.g. UTAS = University of Tasmania)
• Detecting and resolving data value conflicts
• for the same real world entity, attribute values from different sources are different
• possible reasons: different representations, different scales, e.g., metric vs. British
units
Data integration: Redundancy Handling
• Redundant data occur often when integration of multiple databases
• Object identification: The same attribute or object may have different names
in different databases
• Derivable data: One attribute may be a “derived” attribute in another table,
• Redundant attributes may be able to be detected by correlation
analysis
• Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve mining
speed and quality
Data Transformation
Data Transformation
• Consolidate data into forms suitable for data mining
• Aggregation: (summarisation)
• Generalisation (replace data with higher level concepts, e.g. address details →
city)
• Normalisation (scale to within a specified range)
• Min-max (e.g. into 0...1 interval)
• Z-score or zero-mean (based on mean and standard deviation of an attribute)
• Decimal scaling (move decimal point for all values)
• Important to save normalisation parameters (in meta-data repository)
Aggregation Sometimes less is more
Average yearly
precipitation has less
variability
Average yearly precipitation Average monthly precipitation
Generalisation
• Data generalisation is a process that abstracts a large set of task-relevant data in a
database from a relatively low conceptual level to higher levels.
• Like super-classes and subclasses in OO programming
ID Address
1 Bennelong Point, Sydney ID City
NSW 2000
1 Sydney
2 Sydney Harbour Bridge,
Sydney NSW 2 Sydney
3 Hunter Street, Hobart, TAS 3 Hobart
7000
4 Hobart
4 17 Liverpool St, Hobart
Normalisation
o Min-max normalization: to [new_minA, new_maxA]
v minA
v' (new _ maxA new _ minA) new _ minA
maxA minA
73,600 12,000
(1.0 0) 0 0.716
98,000 12,000
o Z-score normalization, where μ: mean, σ: standard deviation (SD):
v A
v'
A
73,600 54,000
• Ex. Let μ = 54,000, σ = 16,000. Then 1.225
16,000
o Normalization by decimal scaling
v
v ' j where j is the smallest integer such that Max(|ν’|) < 1
10
• Find correlated, redundant or derived attributes (e.g. age and date of birth)
Accuracy
Sampling approaches
• Random sampling
• Progressive (or adaptive) sampling:
• Start from a small sample
• Then increase the sample size until sufficient
Sample size
Feature selection Selected Stopping
Evaluation
features criterion
• Select only a subset of the features
• Remove redundant features
• Remove irrelevant features
Selection Subset of
Features
strategy Features
• Feature weighting
• Assign weightings based on domain knowledge
• Many machine learning algorithms can determine the
weightings automatically.
Questions