0% found this document useful (0 votes)
55 views

Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania

There are several types of data that require preparation before analysis, including transactional, graph-based, ordered, and structured vs unstructured data. Real-world data often contains issues like incompleteness, inconsistencies, and noise. Common techniques for data preparation include data cleaning to handle missing values through imputation, resolving inconsistencies by integrating data from multiple sources, and smoothing noisy data using binning, regression, or clustering to identify and remove outliers. The goal of data preparation is to produce clean, consistent data suitable for analysis and mining patterns.

Uploaded by

Jason Zeng
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views

Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania

There are several types of data that require preparation before analysis, including transactional, graph-based, ordered, and structured vs unstructured data. Real-world data often contains issues like incompleteness, inconsistencies, and noise. Common techniques for data preparation include data cleaning to handle missing values through imputation, resolving inconsistencies by integrating data from multiple sources, and smoothing noisy data using binning, regression, or clustering to identify and remove outliers. The goal of data preparation is to produce clean, consistent data suitable for analysis and mining patterns.

Uploaded by

Jason Zeng
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 49

Data preparation

KIT306/606: Data Analytics


A/Prof. Quan Bai
University of Tasmania
Last week
Types of attributes
• What operations we can have for attributes?
• Distinctness: = , != Nominal
• Order: <, >, >=, =< Ordinal
• Addition: +, - Interval
• Multiplication: *, / Ratio
Types of data sets
• Transaction data: each record involved a set
of items.
• E.g., shopping basket data

• Graph-based data:
• Data with relationships among instances
• We can use graph to capture relationships
among data instances
• E.g., social network data
• Ordered data: the attributes of data instances have relationships
which can be ordered, e.g., in time or space.

• Sequence data “London is the capital city


• A data set that is a sequence of individual entities of the UK”
• E.g., natural language data

• Time series data:


• A special type of sequential data
• Each record is a time series, e.g., daily
• E.g., water level of a river in 2020

• Spatial data: data instances with special attributes, i.e.,


positions or areas.
• E.g., weather data
Structured vs. unstructured data
• Structured data is comprised of clearly defined data types whose
pattern makes them easily searchable; while unstructured data –
“everything else” – is comprised of data that is usually not as easily
searchable, including formats like audio, video, and social media
postings. (datamation.com )
• Structured data normally has pre-defined data schema; Unstructured
data has internal structure but is not structured via pre-defined data
models or schema.

6
Today’s lecture:

Data Preparation
Big data is good?
Real world Data: NOT
PERFECT
KDD process

Data preparation
Select datasets Pre-processing Transformation Data mining

Processed Transformed
Raw Data Data Data
Relevant Data
Raw Data
Raw Data

Patterns
Tasks for Data Preparation
• Data (set) Selection
• i.e. selection of relevant observations and variable associated with them
• Data cleaning
• i.e. remove missing or wrong variable values
• Data Transformation
• Changing the format of data according to the analysis
• Transforming the data’s dimensionality
• Data reduction and discretization
• Feature selection
• Sampling
Quality is important

No quality data, no quality results!


But what is data quality?
Data Quality Issue
• How can we say the data is “dirty” or “clean”?
The followings are data quality measures:
• Accuracy: correct or wrong, accurate or not
• Completeness: not recorded, unavailable
• Consistency: some modified but some not
• Timeliness: timely update
• Believability: how trustable the data are correct?
• Interpretability: how easily the data can be understood?
The purpose of Data Preparation
• Real world data is dirty
• Incomplete data:
missing attributes, missing attribute values,
only aggregated data, etc.
• Inconsistent data:
different coding, different naming,
impossible values or out-of-range values
• Noisy data:
data containing errors, outliers, not accurate values
• For quality mining results, quality data is needed
• Preprocessing is an important step for successful data mining
Dirty Data Problems

1.Parsing text into fields (separator issues)


2.Naming conventions: ER: NYC vs New York
3.Missing required field (e.g. key field)
4.Different representations (2 vs Two)
5.Fields too long (get truncated)
6.Redundant Records (exact match or other)
7.Formatting issues – especially dates
8.Licensing issues/Privacy/ keep you from using the data as you would like
Data Cleaning
Incomplete (Missing) Data
• Data is not always available
• Missing data may be due to:
• Equipment malfunction
• Inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of entry
• not register history or changes of the data
• Missing data may need to be interred
Incomplete (Missing) Data Handling
• Ignore the tuple (record) : usually done when class label (a value of
the classification attribute) is missing (assuming the tasks in
classification)
• It is not effective when the percentage of missing values per attribute varies
considerably.
• Fill in the missing value manually: tedious + often infeasible
• Imputation: Fill in the missing value using means or other analysis
(methods to follow)
Imputation: Incomplete (Missing) Data Handling
• Fill in the missing value automatically
• Fill in with a global constant (e.g. unknown, or n/a)- Not recommended (data
mining algorithm will see this as a normal value)
• Cold-Deck Imputation: Fill in with attribute mean or median

ID Age Height Gender Grade


001 23 180 M HD
002 20 165 F DN
003 31 -- M CR
004 29 167 F PP
005 21 175 M HD
006 23 178 F HD
• Hot-Deck Imputation: Identify the most similar case to the case with a
missing value and substitute the most similar case’s value for the missing
case’s value.

• Statistical Imputation: Fill in with most likely value (using regression, decision
trees, most similar records, etc.)

• Predictive Imputation : Use other attributes to predict value (e.g. if a


postcode is missing use suburb value and external look-up table)
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
• faulty data collection instruments
• data entry problems
• data transmission problems
• technology limitation
• inconsistency in naming convention
Noisy Data Handling
• Data smoothing: to create an approximating function that attempts
to capture important patterns in the data, while leaving out noise
or other fine-scale structures/rapid phenomena.
• Binning
• first sort data and partition into (equal-frequency) bins
• then one can smooth by bin means, smooth by bin median, smooth by
bin boundaries, etc.
• Regression
• Smooth by fitting the data into regression functions
• Clustering
• Detect and remove outliers
• Combined computer and human inspection
• Detect suspicious values and check by human
• E.g. Deal with possible outliers
Noisy Data Handling (1): Binning
• Binning method:
• first sort data (values of the attribute we consider) and partition them into
(equal-depth) bins
• then apply one of the methods
• smooth by bin means, (replace noisy values in the bin by the bin mean)
• smooth by bin median, (replace noisy values in the bin by the bin median)
• smooth by bin boundaries, (replace noisy values in the bin by the bin boundaries)
Noisy Data Handling (1): Binning
• Equal-width (distance) partitioning
• Divides the range into N intervals of equal size: uniform grid

• If A and B are the lowest and highest values of the attribute, the width
of intervals will be: W = (B – A)/N.

• The most straightforward

• Limitations?
• Example:
• [1, 2.2, 1.8, 3.6, 2.3, 1.9, 99, 3.6, 2.8, 3.2 ]
• 3 bins:
• Bin 1 [1-33]: 9 instances in Bin 1
• Bin 2 [34-66]: 0
• Bin 3 [67-99]: 1 (99)
• 3 bins:
• Bin 1 (3): (1, 1.8, 1.9)
• Bin 2 (3): (2.2, 2.3, 2.8)
• Bin 3 (4): (3.2, 3.6, 3.6, 99)
• Mean: Bin1: 1.57; Bin2: 2.43; Bin3: 27.35
• Median: Bin1: 1.8; Bin2: 2.3; Bin3: 3.6
• Example:

• [1, 2.2, 1.8, 3.6, 2.3, 1.9, 99,3.6, 2.8, 3.2 ]

• 3 bins

Outliners may impact a lot! (1-33), (34-66), (67-99)

How about remove 99 (the outliner)?


• Equal-depth (frequency) partitioning:  divides the range into N
intervals/bins, each containing approximately same number of
samples

Exercise:
Data = {0,4,12,16,16,18,24,26,28}, 3 bins
• Equal-width (distance) partitioning:
Bin1={?}, Bin2={?}, Bin3={?}
• Equal-depth (frequency) partitioning:
Bin1={?}, Bin2={?}, Bin3={?}
Noisy Data Handling (1): Binning
• Data = {0,4,12,16,16,18,24,26,28}
• Equal-width (distance) partitioning

• Bin1: 0, 4 [-, 10)


• Bin2: 12, 16, 16, 18 [10,20)
• Bin3: 24, 26, 28 [20,+)

• Equal-depth (frequency)[-,partitioning
• Bin1: 0, 4, 12 14)
• Bin2: 16, 16, 18 [14, 21)
• Bin3: 24, 26, 28 [21, +)
Binning Methods for Data Smoothing
• E.g: Sorted data for price (in dollars):
{4,8,9,15,21,21,24,25,26,28,29,34 }

Partition into (equal-depth) bins:


• Bin 1: 4, 8, 9, 15
• Bin 2: 21, 21, 24, 25
• Bin 3: 26, 28, 29, 34

Smoothing by bin means:


• Bin 1: 4, 8, 9, 15  9
• Bin 2: 21, 21, 24, 25  23
• Bin 3: 26, 28, 29, 34  29
Noisy Data Handling (3): Clustering
• Detecting outliers. Similar examples are organized into groups.
Outliers might be very interesting cases or simply noisy examples.

Outliers
Correcting Inconsistent Data
• Some types of inconsistences are easy to detect.
–e.g., a person’s height should not be negative
• In other cases, it can be necessary to consult an external
source of information
Data Integration
Data Integration
• Data integration
• combines data from multiple sources into a coherent store
• Schema integration
• integrate metadata from different sources (e.g., A.cust-id ≡ B.cust-#)
• Entity identification problem
• identify real world entities from multiple data sources,
(e.g. UTAS = University of Tasmania)
• Detecting and resolving data value conflicts
• for the same real world entity, attribute values from different sources are different
• possible reasons: different representations, different scales, e.g., metric vs. British
units
Data integration: Redundancy Handling
• Redundant data occur often when integration of multiple databases
• Object identification: The same attribute or object may have different names
in different databases
• Derivable data: One attribute may be a “derived” attribute in another table,
• Redundant attributes may be able to be detected by correlation
analysis
• Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve mining
speed and quality
Data Transformation
Data Transformation
• Consolidate data into forms suitable for data mining
• Aggregation: (summarisation)
• Generalisation (replace data with higher level concepts, e.g. address details →
city)
• Normalisation (scale to within a specified range)
• Min-max (e.g. into 0...1 interval)
• Z-score or zero-mean (based on mean and standard deviation of an attribute)
• Decimal scaling (move decimal point for all values)
• Important to save normalisation parameters (in meta-data repository)
Aggregation Sometimes less is more

• Data aggregation is the process in which information is gathered and


expressed in a summary form.
• Smaller data = less memory and processing time
• Provide a high-level view of the data
• Example: Australian rainfall

Average yearly
precipitation has less
variability
Average yearly precipitation Average monthly precipitation
Generalisation
• Data generalisation is a process that abstracts a large set of task-relevant data in a
database from a relatively low conceptual level to higher levels.
• Like super-classes and subclasses in OO programming

ID Address
1 Bennelong Point, Sydney ID City
NSW 2000
1 Sydney
2 Sydney Harbour Bridge,
Sydney NSW 2 Sydney
3 Hunter Street, Hobart, TAS 3 Hobart
7000
4 Hobart
4 17 Liverpool St, Hobart
Normalisation
o Min-max normalization: to [new_minA, new_maxA]
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA

Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].


Then $73,000 is mapped to

73,600  12,000
(1.0  0)  0  0.716
98,000  12,000
o Z-score normalization, where μ: mean, σ: standard deviation (SD):
v  A
v' 
 A

73,600  54,000
• Ex. Let μ = 54,000, σ = 16,000. Then  1.225
16,000
o Normalization by decimal scaling
v
v ' j where j is the smallest integer such that Max(|ν’|) < 1
10

Salary Formula Normalised after Decimal Scaling


480 480/1000 0.48
680 688/1000 0.68
980 980/1000 0.98
Data Transformation
• Attribute / feature construction
• Sometimes it is helpful or necessary to construct new attributes or features
• Helpful for understanding and accuracy
• For example: Create attribute volume based on attributes height, depth and width
• Construction is based on mathematical or logical operations
• Attribute construction can help to discover missing information about the
relationships between data attributes
Data Reduction
• Why? Databases or data warehouses often contain Terabytes of data,
resulting in (very) long run times for data mining algorithms
• Data Reduction Goal: Obtain a reduced representation of the data set that is
much smaller in volume but yet produce the same (or almost the same) analytical
results
• Data reduction techniques
• Dimensionality reduction
• Data compression
• Numerosity reduction
Data Reduction

• Dimensionality reduction – redundant and unimportant attributes removal


• Select a (minimum) sub-set of the available attributes (with similar probability
distribution of classes compared to the original data)

• Find correlated, redundant or derived attributes (e.g. age and date of birth)

• Step-wise forward selection (find and select best attribute) or backward


elimination (find and eliminate worst attribute)

• Use decision tree induction to find minimum attribute sub-set necessary


Data Reduction
• Data compression
• Data encoding or transformation

• Lossless or lossy encoding


• Examples:
• String compression (e.g. ZIP, only allow limited manipulation of data)
• Audio/video compression
• Time sequence is not audio

• Dimensionality and numerosity reduction may also be considered as forms of


data compression
Data Reduction
• Numerosity reduction – fit data into models
• Parametric methods (e.g. regression and log-linear models) (can
be computationally expensive)

• Non-parametric methods (histograms / binning, clustering,


sampling)
Data Discretisation
• Some data mining algorithms require that the data be in form of
categorical or binary attributes. Thus, it is often necessary to convert
continuous attributes into categorical attributes.
• Discretisation and concept hierarchy generation
• Reduce the number of values for a continuous attribute by dividing the range
into intervals
• Concept hierarchies for numerical attributes can be constructed automatically
• Binning (smoothing, distributing values into bins, then replace each
value with mean, median or boundaries of the bin)
• Clustering
• Segmentation by natural partitioning (partition into 3, 4, or 5 relatively
uniform intervals)
Sampling
• Sampling is for selecting a subset of the data to be analysed.
• Entire set of data can be expensive and too large

Accuracy

Sampling approaches
• Random sampling
• Progressive (or adaptive) sampling:
• Start from a small sample
• Then increase the sample size until sufficient
Sample size
Feature selection Selected Stopping
Evaluation
features criterion
• Select only a subset of the features
• Remove redundant features
• Remove irrelevant features
Selection Subset of
Features
strategy Features
• Feature weighting
• Assign weightings based on domain knowledge
• Many machine learning algorithms can determine the
weightings automatically.
Questions

You might also like