0% found this document useful (0 votes)

28 views

03preprocessing 1

The chapter discusses data preprocessing techniques. It covers data cleaning to handle missing, noisy, and inconsistent data through techniques like filling in missing values, smoothing noisy data, and resolving inconsistencies. It also discusses data integration to combine multiple data sources and data reduction to reduce data size through dimensionality reduction, numerosity reduction, and data compression. The chapter outlines major tasks in data preprocessing like data cleaning, integration, reduction, and transformation.

Uploaded by

Abood Fazil

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views

03preprocessing 1

Uploaded by

Abood Fazil

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

Data Mining:

Concepts and Techniques

(3rd ed.)

— Chapter 3 —
Slides Curtesy of Textbook

1
Chapter 3: Data Preprocessing

n Data Preprocessing: An Overview

n Data Quality

n Major Tasks in Data Preprocessing

n Data Cleaning

n Data Integration

n Data Reduction

n Data Transformation and Data Discretization

n Summary
2
Data Quality: Why Preprocess the Data?

n Measures for data quality: A multidimensional view

n Accuracy: correct or wrong, accurate or not
n Completeness: not recorded, unavailable, …
n Consistency: some modified but some not, dangling, …
n Timeliness: timely update?
n Believability: how trustable the data are correct?
n Interpretability: how easily the data can be
understood?
3
Major Tasks in Data Preprocessing
n Data cleaning
n Fill in missing values, smooth noisy data, identify or remove

outliers, and resolve inconsistencies

n Data integration
n Integration of multiple databases, data cubes, or files

n Data reduction
n Dimensionality reduction

n Numerosity reduction

n Data compression

n Data transformation and data discretization

n Normalization

n Concept hierarchy generation

4
Chapter 3: Data Preprocessing

n Data Preprocessing: An Overview

n Data Quality

n Major Tasks in Data Preprocessing

n Data Cleaning

n Data Integration

n Data Reduction

n Data Transformation and Data Discretization

n Summary
6
Data in Real World is Dirty!
n From various reasons, e.g., instrument faulty, human or computer
error, transmission error, etc.
n incomplete: lacking attribute values, lacking certain attributes of

interest, or containing only aggregated data

n e.g., Occupation = “ ” (missing data)

n noisy: containing noise, errors, or outliers

n e.g., Salary = “ −10” (an error)

n inconsistent: containing discrepancies in codes or names, e.g.,

n Age = “ 42”, B irthday = “ 03/07/2010”

n Was rating “ 1, 2, 3”, now rating “ A, B , C”

n discrepancy between duplicate records

n Intentional (e.g., disguised missing data)

n Jan. 1 as everyone’s birthday?

Incomplete (Missing) Data

n Data is not always available

n E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
n Missing data may be due to
n equipment malfunction
n inconsistent with other recorded data and thus deleted
n data not entered due to misunderstanding
n certain data may not be considered important at the time
of entry
n not register history or changes of the data
n Missing data may need to be inferred
How to Handle Missing Data?
n Ignore the tuple: usually done when class label is missing (when
doing classification)—not effective when the % of missing values
per attribute varies considerably
n Fill in the missing value manually: usually tedious + infeasible
n Fill in it automatically with
n a global constant : e.g., “unknown”, a new class?!
n the attribute mean
n the attribute mean for all samples belonging to the same
class: smarter
n the most probable value: inference-‐based such as Bayesian
formula or decision tree
Noisy Data
n Noise: random error or variance in a measured variable
n Incorrect attribute values may be due to
n faulty data c ollection instruments

n data entry problems

n data transmission problems

n technology limitation

n inconsistency in naming c onvention

10
How to Handle Noisy Data?

n Binning
n first sort data and partition into (equal-‐frequency) bins

n then one can smooth by bin means, smooth by bin median,

smooth by bin boundaries, etc.

n Regression
n smooth by fitting the data into regression functions

n Clustering
n detect and remove outliers

n Combined computer and human inspection

n detect suspicious values and c heck by human (e.g., deal

with possible outliers)

Other data problems requiring data cleaning
n Duplicate records
n Incomplete data
n Inconsistent data

12
Data Cleaning as a Process
n Step 1: Discrepancy detection
n Use metadata (e.g., domain, range, dependency, distribution)

n Check field overloading

n Check uniqueness rule, consecutive rule and null rule

n Use commercial tools

n Data scrubbing: use simple domain k nowledge (e.g., postal code,

spell-‐check) to detect errors and make corrections

n Data auditing: by analyzing data to discover rules and relationship to

detect violators (e.g., correlation and clustering to find outliers)

n Step 2: Data transformation (to correct the discrepancies)
n Data migration tools: allow transformations to be specified

n ETL (Extraction/Transformation/Loading) tools: allow users to specify

transformations through a graphical user interface

n Data cleaning process: the two steps iterate and reinforce
Chapter 3: Data Preprocessing

n Data Preprocessing: An Overview

n Data Quality

n Major Tasks in Data Preprocessing

n Data Cleaning

n Data Integration

n Data Reduction

n Data Transformation and Data Discretization

n Summary
14
Data Integration
n Data integration:
n Combines data from multiple sources into a coherent store
n Schema integration: e.g., A.cust-‐id ≡ B.cust-‐#
n Integrate metadata from different sources
n Entity identification problem:
n Identify real world entities from multiple data sources, e.g., Bill Clinton =
William Clinton
n Detecting and resolving data value conflicts
n For the same real world entity, attribute values from different sources
are different
n Possible reasons: different representations, different scales, e.g., metric
vs. British units
15
Why Data Integration:
Handling Redundancies & Inconsistencies

n Redundant data often occur when integrating multiple

databases
n Object identification: The same attribute or object may
have different names in different databases
n Derivable data: One attribute may be a “derived” attribute
in another table, e.g., annual revenue
n Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
n Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies to improve
mining speed and quality
16
Correlation Analysis (for Nominal Data)
n Χ2 (chi-‐square) test
2
(Observed − Expected )
χ2 = ∑
Expected
n The larger the Χ2 value, the more likely the variables are
related
n The cells that contribute the most to the Χ2 value are those
whose actual count is very different from the expected count
n Correlation does not imply causality
n # of hospitals and # of car-‐theft in a city are correlated
n Both are causally linked to the third variable: population
Chi-Square Calculation: An Example
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

n Numbers in parenthesis are expected counts calculated based on the data
distribution in the two categories
n Example: Expected count of people playing chest and liking science
fiction: 450 * 300 / 1500 = 90
n Χ2 (chi-‐square) calculation
2(250 − 90) 2 (50 − 210) 2 (200 − 360) 2 (1000 − 840) 2
χ = + + + = 507.93
90 210 360 840
n Degree of freedom for the 2x2 table: (2-‐1)*(2-‐1) = 1 -‐-‐> By looking up the Chi-‐
Square table, we can reject the hypothesis like_science_fiction and
play_chess are independent with high confidenceà they are correlated!
Correlation Analysis (for Numeric Data)
n Correlation coefficient (also called Pearson’s product moment
coefficient)
n n
∑ i =1
(ai − A)(bi − B) ∑ i =1
(ai bi ) − n AB
rA, B = =
nσ Aσ B nσ Aσ B
where n is the number of tuples, A and B are the respective
means of A and B, σA and σB are the respective standard
deviations of A and B, and Σ(aibi) is the sum of the AB cross-‐
product.
n If rA,B > 0, A and B are positively correlated (A’s values increase
as B’s). The higher rA,B, the stronger correlation.
n rA,B = 0: independent.
n rAB < 0: negatively correlated.
Visually Evaluating Correlation

Scatter plots
showing the
similarity from
–1 to 1.
Covariance (for Numeric Data)
n Covariance is similar to correlation

Correlation coefficient:

where n is the number of tuples, A

and B are the respective mean or
expected values of A and B, σA and σB are the respective standard deviation
of A and B
n Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their
expected values
n Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is
likely to be smaller than its expected value
n Independence: CovA,B = 0 but the converse is not true:
n Some pairs of random variables may have a covariance of 0 but are not
independent. Only under some additional assumptions (e.g., the data follow
multivariate normal distributions) does a covariance of 0 imply independence
Co-Variance: An Example

n It can be simplified in computation as

n Suppose two stocks A and B have the following values in one week: (2, 5), (3,
8), (5, 10), (4, 11), (6, 14).

n Question: If the stocks are affected by the same industry trends, will their
prices rise or fall together?

n E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4

n E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6

n Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4

n Thus, A and B rise together since Cov(A, B) > 0.

Chapter 3: Data Preprocessing

n Data Preprocessing: An Overview

n Data Quality

n Major Tasks in Data Preprocessing

n Data Cleaning

n Data Integration

n Data Reduction

n Data Transformation and Data Discretization

n Summary
23
Data Reduction Strategies
n Data reduction: Obtain a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the same) analytical
results
n Why data reduction? — A database/data warehouse may store terabytes of
data. Complex data analysis may take a very long time to run on the
complete data set.
n Data reduction strategies
n Dimensionality reduction, e.g., remove unimportant attributes

n Wavelet transforms

n Principal Components Analysis (PCA)

n Feature subset selection, feature creation

n Numerosity reduction (some simply call it: Data Reduction)

n Regression and Log-‐Linear Models

n Histograms, clustering, sampling

n Data cube aggregation

n Data compression
Data Reduction 1: Dimensionality Reduction
n Curse of dimensionality
n When dimensionality increases, data becomes increasingly sparse

n Density and distance between points, which is critical to clustering,

outlier analysis, becomes less meaningful

n The possible combinations of subspaces will grow exponentially

n Dimensionality reduction
n Avoid the curse of dimensionality

n Help eliminate irrelevant features and reduce noise

n Reduce time and space required in data mining

n Allow easier visualization

n Dimensionality reduction techniques

n Wavelet transforms

n Principal Component Analysis

n Supervised and nonlinear techniques (e.g., feature selection)

Dimensionality Reduction by Attribute Subset Selection

n Redundant attributes
n Duplicate much or all of the information contained in one or
more other attributes
n E.g., purchase price of a product and the amount of sales
tax paid
n Irrelevant attributes
n Contain no information that is useful for the data mining
task at hand
n E.g., students' ID is often irrelevant to the task of predicting
students' GPA
Attribute Subset Selection by Heuristic Search

n There are 2d possible attribute combinations of d attributes

n Typical heuristic attribute selection methods:
n Best single attribute under the attribute independence

assumption: choose by significance tests

n Best step-‐wise feature selection:

n The best single-‐attribute is picked first

n Then next best attribute condition to the first, ...

n Step-‐wise attribute elimination:

n Repeatedly eliminate the worst attribute

n Best c ombined attribute selection and elimination

n Optimal branch and bound:

n Use attribute elimination and backtracking

Attribute Subset Selection by Feature Generation

n Create new attributes (features) that can capture the

important information in a data set more effectively than the
original ones
n Three general methodologies
n Attribute extraction

n Domain-‐specific

n Mapping data to new space ( see: data reduction)

n E.g., Fourier transformation, wavelet transformation,

manifold approaches (not covered)

n Attribute construction

n Combining features (see: discriminative frequent

patterns in Chapter on “Advanced Classification”)

n Data discretization
28
Data Reduction 2: Numerosity Reduction
n Reduce data volume by choosing alternative, smaller forms of
data representation
n Parametric methods (e.g., regression)
n Assume the data fits some model, estimate model

parameters, store only the parameters, and discard the

data (except possible outliers)
n Ex.: L og-‐linear models—obtain value at a point in m-‐D

space as the product on appropriate marginal subspaces

n Non-‐parametric methods
n Do not assume models

n Major families: histograms, clustering, sampling, …

29
Parametric Data Reduction: Regression Analysis

n Regression analysis: A collective name for

techniques for the modeling and analysis of
y
numerical data consisting of values of a
dependent variable (also called response Y1
variable or measurement) and of one or more
independent variables (aka. explanatory Y1’
y=x+1
variables or predictors)
n The parameters are estimated so as to give a
"best fit" of the data X1 x
n Most commonly the best fit is evaluated by
Example: Linear regression
using the least squares method, but other when data fit a straight line
criteria have also been used

30
Non-parametric Data Reduction: Histogram Analysis

n Divide data into buckets and 40

store average (sum) for each 35
bucket
30
n Partitioning rules: 25
n Equal-‐width: equal bucket 20
range
15
n Equal-‐frequency (or equal-‐ 10
depth)
5
0
10000

20000

30000

40000

50000

60000

70000

80000

90000

100000
31
Non-parametric Data Reduction: Clustering

n Partition data set into clusters based on similarity, and store
cluster representation (e.g., centroid and diameter) only
n Can be very effective if data is clustered but not if data is
“smeared”
n Can have hierarchical clustering and be stored in multi-‐
dimensional index tree structures
n There are many choices of clustering definitions and
clustering algorithms
n Cluster analysis will be studied in depth in Chapter 10

32
Non-parametric Data Reduction: Sampling

n Sampling: obtaining a small sample s to represent the whole

data set N
n Allow a mining algorithm to run in complexity that is potentially
sub-‐linear to the size of the data
n Key principle: Choose a representative subset of the data
n Simple random sampling may have very poor performance
in the presence of skew
n Develop adaptive sampling methods, e.g., stratified
sampling:
n Note: Sampling may not reduce database I/Os (page at a time)

33
Types of Sampling

n Simple random sampling

n There is an equal probability of selecting any particular item

n Sampling without replacement

n Once an object is selected, it is removed from the population

n Sampling with replacement

n A selected object is not removed from the population

n Stratified sampling:
n Partition the data set, and draw samples from each partition

(proportionally, i.e., approximately the same percentage of

the data)
n Used in c onjunction with skewed data

34
Sampling: With or without Replacement

Raw Data
Sampling: Cluster or Stratified Sampling

Raw Data Cluster/Stratified Sample

36
Non-parametric Data Reduction: Data Cube Aggregation

n The lowest level of a data cube (base cuboid)

n The aggregated data for an individual entity of interest
n E.g., a customer in a phone calling data warehouse
n Multiple levels of aggregation in data cubes
n Further reduce the size of data to deal with
n Reference appropriate levels
n Use the smallest representation which is enough to solve
the task
n Queries regarding aggregated information should be answered
using data cube, when possible
Data Reduction 3: Data Compression
n String compression
n There are extensive theories and well-‐tuned algorithms

n Typically lossless, but only limited manipulation is possible

without expansion
n Audio/video compression
n Typically lossy compression, with progressive refinement

n Sometimes small fragments of signal c an be reconstructed

without reconstructing the whole

n Time sequence is not audio
n Typically short and vary slowly with time

n Dimensionality and numerosity reduction may also be

considered as forms of data compression
Data Compression

Original Data Compressed

Data
lossless

Original Data
Approximated

Unit I Chapter III
No ratings yet
Unit I Chapter III
71 pages
VIPDMTheoryChapter3
No ratings yet
VIPDMTheoryChapter3
87 pages
Correlation
No ratings yet
Correlation
14 pages
Pre Processing
No ratings yet
Pre Processing
52 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
30 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
63 pages
Mod2 DM
No ratings yet
Mod2 DM
86 pages
03preprocessing DMDW
No ratings yet
03preprocessing DMDW
81 pages
Unit2 Part2
No ratings yet
Unit2 Part2
67 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
03 Preprocessing
No ratings yet
03 Preprocessing
64 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
63 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
64 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
03preprocessing Part1
No ratings yet
03preprocessing Part1
21 pages
Chapter 3 - Tagged
No ratings yet
Chapter 3 - Tagged
63 pages
CH 3
No ratings yet
CH 3
68 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
54 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
66 pages
Module 2
No ratings yet
Module 2
62 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
03Preprocessing_20160222
No ratings yet
03Preprocessing_20160222
65 pages
DM Chapter 3
No ratings yet
DM Chapter 3
60 pages
2-Data Fundamentals for BI - Part1
No ratings yet
2-Data Fundamentals for BI - Part1
39 pages
DWDM-LS3-Fall-24-25
No ratings yet
DWDM-LS3-Fall-24-25
50 pages
Wk6 Preprocessing
No ratings yet
Wk6 Preprocessing
64 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
52 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
63 pages
03 Preprocessing
No ratings yet
03 Preprocessing
63 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
61 pages
Data Mining: Dosen: Dr. Vitri Tundjungsari
No ratings yet
Data Mining: Dosen: Dr. Vitri Tundjungsari
64 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
Unit - II
No ratings yet
Unit - II
56 pages
Chapter 3& 4 (3)
No ratings yet
Chapter 3& 4 (3)
60 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
56 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
IT446 Wk03.2 HanKamberPei 03preprocessing PDF
No ratings yet
IT446 Wk03.2 HanKamberPei 03preprocessing PDF
64 pages
GK NU CS 503 - Data Preprocessing
No ratings yet
GK NU CS 503 - Data Preprocessing
62 pages
02 Data_preprocessing -4,5,6
No ratings yet
02 Data_preprocessing -4,5,6
54 pages
Chapter3
No ratings yet
Chapter3
50 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Data_Preprocessing-1-19
No ratings yet
Data_Preprocessing-1-19
19 pages
03 Preprocessing
No ratings yet
03 Preprocessing
54 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
_03Preprocessing
No ratings yet
_03Preprocessing
60 pages
Chapter 3
No ratings yet
Chapter 3
63 pages
PPT1
No ratings yet
PPT1
93 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
62 pages
03 Preprocessing
No ratings yet
03 Preprocessing
63 pages
03Preprocessing
No ratings yet
03Preprocessing
65 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
From Everand
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
Steve Brown
No ratings yet
HairJr2021 Book PartialLeastSquaresStructuralE
100% (2)
HairJr2021 Book PartialLeastSquaresStructuralE
208 pages
GPMF English Brochure
No ratings yet
GPMF English Brochure
8 pages
Lecture 15
No ratings yet
Lecture 15
21 pages
Mathematical Review
No ratings yet
Mathematical Review
12 pages
01 Intro
No ratings yet
01 Intro
45 pages
02data InClass 20150827
No ratings yet
02data InClass 20150827
18 pages
Lab 4
No ratings yet
Lab 4
2 pages
Unit I DATA MINING AAGAC
No ratings yet
Unit I DATA MINING AAGAC
27 pages
Unit-5 ML Notes
No ratings yet
Unit-5 ML Notes
31 pages
PCA and LDA Assignment
No ratings yet
PCA and LDA Assignment
5 pages
Financial ML
No ratings yet
Financial ML
57 pages
1152cs191 Data Visualization Unit III
No ratings yet
1152cs191 Data Visualization Unit III
59 pages
Mdscan - An Explainable Artificial Intelligence Artifact For Menta
No ratings yet
Mdscan - An Explainable Artificial Intelligence Artifact For Menta
13 pages
Label Propagation With Structured Graph Learning For Semi-Supervised
No ratings yet
Label Propagation With Structured Graph Learning For Semi-Supervised
12 pages
A Review On Machine Learning Techniques
No ratings yet
A Review On Machine Learning Techniques
5 pages
Final Project Vaaghu
No ratings yet
Final Project Vaaghu
84 pages
CONN FMRI Functional Connectivity Toolbox Manual v15
No ratings yet
CONN FMRI Functional Connectivity Toolbox Manual v15
29 pages
Data-Driven Deep Learning To Design Pilot and Channel Estimator For Massive MIMO
No ratings yet
Data-Driven Deep Learning To Design Pilot and Channel Estimator For Massive MIMO
6 pages
B E - Computer-Engg
No ratings yet
B E - Computer-Engg
27 pages
Pattern Recognition
No ratings yet
Pattern Recognition
9 pages
AI Transformation Playbook v8
No ratings yet
AI Transformation Playbook v8
10 pages
Food Fraud Vulnerability Assessment in The Edible Vegetable Oil Supply Chain A Perspective of Chinese Enterprises
No ratings yet
Food Fraud Vulnerability Assessment in The Edible Vegetable Oil Supply Chain A Perspective of Chinese Enterprises
12 pages
Data Science Syllabus
No ratings yet
Data Science Syllabus
23 pages
3.2 Pca
No ratings yet
3.2 Pca
27 pages
Implementation of Dimensionality Reduction Techniques in Hospital Management
No ratings yet
Implementation of Dimensionality Reduction Techniques in Hospital Management
4 pages
Project-Team CQFD Quality Control and Dynamic Reliability: Ctivity
No ratings yet
Project-Team CQFD Quality Control and Dynamic Reliability: Ctivity
22 pages
Chapter 3 Big Data Analytics and Big Data Analytics Techniques PDF
No ratings yet
Chapter 3 Big Data Analytics and Big Data Analytics Techniques PDF
22 pages
3sample_mooc_report_FINAL[1]
No ratings yet
3sample_mooc_report_FINAL[1]
18 pages
CHBE413CDS Lecture 12 Unsupervised DimRed
No ratings yet
CHBE413CDS Lecture 12 Unsupervised DimRed
30 pages
Final 2006
No ratings yet
Final 2006
15 pages
A Survey On Video Based Human Action Recognition: Recent Updates, Datasets, Challenges, and Applications
No ratings yet
A Survey On Video Based Human Action Recognition: Recent Updates, Datasets, Challenges, and Applications
64 pages
6.036 Notes
No ratings yet
6.036 Notes
99 pages
10 ASAP Advanced Statistics Dimension Reduction
No ratings yet
10 ASAP Advanced Statistics Dimension Reduction
8 pages
UNIT 2 Bigdata Mining and Analytics
No ratings yet
UNIT 2 Bigdata Mining and Analytics
18 pages
MS Excel - Excercises - BA Lab Manual
No ratings yet
MS Excel - Excercises - BA Lab Manual
27 pages
AD8552-ML-UNIT-I (1)
No ratings yet
AD8552-ML-UNIT-I (1)
31 pages

03preprocessing 1

Uploaded by

03preprocessing 1

Uploaded by

Data Mining:

Concepts and Techniques

n Data Preprocessing: An Overview

n Major Tasks in Data Preprocessing

n Data Transformation and Data Discretization

n Measures for data quality: A multidimensional view

outliers, and resolve inconsistencies

n Data transformation and data discretization

n Concept hierarchy generation

n Data Preprocessing: An Overview

n Major Tasks in Data Preprocessing

n Data Transformation and Data Discretization

interest, or containing only aggregated data

n noisy: containing noise, errors, or outliers

n e.g., Salary = “ −10” (an error)

n inconsistent: containing discrepancies in codes or names, e.g.,

n Age = “ 42”, B irthday = “ 03/07/2010”

n Was rating “ 1, 2, 3”, now rating “ A, B , C”

n discrepancy between duplicate records

n Intentional (e.g., disguised missing data)

n Jan. 1 as everyone’s birthday?

n Data is not always available

n data entry problems

n data transmission problems

n inconsistency in naming c onvention

smooth by bin boundaries, etc.

n Combined computer and human inspection

with possible outliers)

n Check field overloading

n Check uniqueness rule, consecutive rule and null rule

n Use commercial tools

n Data scrubbing: use simple domain k nowledge (e.g., postal code,

spell-­‐check) to detect errors and make corrections

detect violators (e.g., correlation and clustering to find outliers)

n ETL (Extraction/Transformation/Loading) tools: allow users to specify

transformations through a graphical user interface

n Data Preprocessing: An Overview

n Major Tasks in Data Preprocessing

n Data Transformation and Data Discretization

n Redundant data often occur when integrating multiple

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

where n is the number of tuples, A

n It can be simplified in computation as

n E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4

n E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6

n Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4

n Thus, A and B rise together since Cov(A, B) > 0.

n Data Preprocessing: An Overview

n Major Tasks in Data Preprocessing

n Data Transformation and Data Discretization

n Principal Components Analysis (PCA)

n Feature subset selection, feature creation

n Numerosity reduction (some simply call it: Data Reduction)

n Regression and Log-­‐Linear Models

n Histograms, clustering, sampling

n Data cube aggregation

outlier analysis, becomes less meaningful

n Help eliminate irrelevant features and reduce noise

n Reduce time and space required in data mining

n Allow easier visualization

n Dimensionality reduction techniques

n Principal Component Analysis

n Supervised and nonlinear techniques (e.g., feature selection)

n There are 2d possible attribute combinations of d attributes

assumption: choose by significance tests

n The best single-­‐attribute is picked first

n Then next best attribute condition to the first, ...

n Step-­‐wise attribute elimination:

n Repeatedly eliminate the worst attribute

n Best c ombined attribute selection and elimination

n Optimal branch and bound:

n Use attribute elimination and backtracking

n Create new attributes (features) that can capture the

n Mapping data to new space ( see: data reduction)

n E.g., Fourier transformation, wavelet transformation,

spell-‐check) to detect errors and make corrections

n Regression and Log-‐Linear Models

n The best single-‐attribute is picked first

n Step-‐wise attribute elimination: