0% found this document useful (0 votes)
28 views

03preprocessing 1

The chapter discusses data preprocessing techniques. It covers data cleaning to handle missing, noisy, and inconsistent data through techniques like filling in missing values, smoothing noisy data, and resolving inconsistencies. It also discusses data integration to combine multiple data sources and data reduction to reduce data size through dimensionality reduction, numerosity reduction, and data compression. The chapter outlines major tasks in data preprocessing like data cleaning, integration, reduction, and transformation.

Uploaded by

Abood Fazil
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

03preprocessing 1

The chapter discusses data preprocessing techniques. It covers data cleaning to handle missing, noisy, and inconsistent data through techniques like filling in missing values, smoothing noisy data, and resolving inconsistencies. It also discusses data integration to combine multiple data sources and data reduction to reduce data size through dimensionality reduction, numerosity reduction, and data compression. The chapter outlines major tasks in data preprocessing like data cleaning, integration, reduction, and transformation.

Uploaded by

Abood Fazil
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Data Mining:

Concepts and Techniques


(3rd ed.)

— Chapter 3 —
Slides  Curtesy  of  Textbook

1
Chapter 3: Data Preprocessing

n Data  Preprocessing:  An  Overview

n Data  Quality

n Major  Tasks  in  Data  Preprocessing

n Data  Cleaning

n Data  Integration

n Data  Reduction

n Data  Transformation  and  Data  Discretization

n Summary
2
Data Quality: Why Preprocess the Data?

n Measures  for  data   quality:   A  multidimensional   view


n Accuracy:   correct   or   wrong,  accurate   or   not
n Completeness:   not   recorded,   unavailable,   …
n Consistency:   some   modified   but   some   not,   dangling,  …
n Timeliness:   timely  update?  
n Believability:   how  trustable   the   data   are  correct?
n Interpretability:   how  easily  the   data   can  be  
understood?
3
Major Tasks in Data Preprocessing
n Data  cleaning
n Fill  in  missing  values,  smooth  noisy  data,  identify  or  remove  

outliers,  and  resolve  inconsistencies


n Data  integration
n Integration  of  multiple  databases,  data  cubes,  or  files

n Data  reduction
n Dimensionality  reduction

n Numerosity reduction

n Data  compression

n Data  transformation   and  data  discretization


n Normalization  

n Concept  hierarchy  generation

4
Chapter 3: Data Preprocessing

n Data  Preprocessing:  An  Overview

n Data  Quality

n Major  Tasks  in  Data  Preprocessing

n Data  Cleaning

n Data  Integration

n Data  Reduction

n Data  Transformation  and  Data  Discretization

n Summary
6
Data in Real World is Dirty!
n From   various  reasons,   e.g.,  instrument  faulty,   human  or  computer  
error,   transmission  error,   etc.
n incomplete:   lacking   attribute   values,  lacking   certain   attributes   of  

interest,   or   containing   only  aggregated   data


n e.g.,  Occupation  =   “  ”  (missing  data)

n noisy:  containing   noise,   errors,  or  outliers

n e.g.,  Salary   =   “ −10”  (an   error)

n inconsistent:  containing   discrepancies  in   codes   or   names,   e.g.,

n Age   =   “ 42”,  B irthday  =   “ 03/07/2010”

n Was   rating   “ 1,  2,  3”,  now   rating   “ A,  B ,  C”

n discrepancy  between   duplicate   records

n Intentional (e.g.,  disguised  missing data)

n Jan.   1  as  everyone’s   birthday?


Incomplete (Missing) Data

n Data  is  not  always  available


n E.g.,  many  tuples  have  no  recorded  value  for  several  
attributes,  such  as  customer  income  in  sales  data
n Missing  data  may  be  due  to  
n equipment   malfunction
n inconsistent  with  other  recorded  data  and  thus  deleted
n data  not  entered   due  to  misunderstanding
n certain  data  may  not  be  considered  important  at  the  time  
of  entry
n not  register  history  or  changes  of  the  data
n Missing  data  may  need  to  be  inferred
How to Handle Missing Data?
n Ignore  the  tuple:  usually  done  when  class  label  is  missing  (when  
doing  classification)—not  effective  when  the  %  of  missing  values  
per  attribute  varies  considerably
n Fill  in  the  missing  value  manually:  usually  tedious  +  infeasible
n Fill  in  it  automatically  with
n a  global  constant  :  e.g.,  “unknown”,   a  new  class?!  
n the  attribute  mean
n the  attribute  mean  for  all  samples  belonging  to  the  same  
class:  smarter
n the  most  probable  value:  inference-­‐based  such  as  Bayesian  
formula  or  decision  tree
Noisy Data
n Noise:  random  error  or  variance  in  a  measured  variable
n Incorrect  attribute  values may  be  due  to
n faulty  data  c ollection  instruments

n data  entry  problems

n data  transmission  problems

n technology  limitation

n inconsistency  in  naming  c onvention  

10
How to Handle Noisy Data?

n Binning
n first  sort  data  and  partition  into  (equal-­‐frequency)  bins

n then  one  can  smooth  by  bin  means,  smooth  by  bin  median,  

smooth  by  bin  boundaries,  etc.


n Regression
n smooth  by  fitting  the  data  into  regression  functions

n Clustering
n detect  and  remove  outliers

n Combined  computer  and  human  inspection


n detect  suspicious  values  and  c heck  by  human  (e.g.,  deal  

with  possible  outliers)


Other  data  problems requiring   data  cleaning
n Duplicate  records
n Incomplete  data
n Inconsistent  data

12
Data Cleaning as a Process
n Step  1:  Discrepancy  detection
n Use  metadata  (e.g.,   domain,  range,  dependency,   distribution)

n Check  field  overloading  

n Check  uniqueness   rule,  consecutive   rule  and  null  rule

n Use  commercial  tools

n Data  scrubbing:  use  simple  domain  k nowledge  (e.g.,  postal  code,  

spell-­‐check)   to  detect  errors   and  make  corrections


n Data  auditing:  by  analyzing  data  to  discover   rules  and  relationship  to  

detect  violators  (e.g.,   correlation  and  clustering  to  find  outliers)


n Step  2:  Data  transformation  (to  correct  the  discrepancies)
n Data  migration  tools:  allow  transformations   to  be  specified

n ETL  (Extraction/Transformation/Loading)   tools:  allow  users  to  specify  

transformations   through  a  graphical  user  interface


n Data  cleaning  process:   the  two  steps  iterate   and  reinforce
Chapter 3: Data Preprocessing

n Data  Preprocessing:  An  Overview

n Data  Quality

n Major  Tasks  in  Data  Preprocessing

n Data  Cleaning

n Data  Integration

n Data  Reduction

n Data  Transformation  and  Data  Discretization

n Summary
14
Data Integration
n Data  integration:  
n Combines  data  from  multiple  sources  into  a  coherent   store
n Schema  integration:  e.g.,  A.cust-­‐id  ≡ B.cust-­‐#
n Integrate   metadata  from  different   sources
n Entity  identification  problem:  
n Identify  real  world  entities  from  multiple  data  sources,   e.g.,  Bill  Clinton  =  
William  Clinton
n Detecting  and  resolving  data  value  conflicts
n For  the  same  real  world  entity,  attribute   values  from  different   sources  
are  different
n Possible  reasons:  different   representations,   different   scales,  e.g.,  metric  
vs.  British  units
15
Why Data Integration:
Handling Redundancies & Inconsistencies

n Redundant  data  often  occur  when  integrating  multiple  


databases
n Object  identification:    The  same  attribute  or  object  may  
have  different  names  in  different  databases
n Derivable  data: One  attribute  may  be  a  “derived”   attribute  
in  another  table,  e.g.,  annual  revenue
n Redundant  attributes  may  be  able  to  be  detected   by  
correlation  analysis  and covariance  analysis
n Careful  integration  of  the  data  from  multiple  sources  may  help  
reduce/avoid  redundancies  and  inconsistencies  to  improve  
mining  speed  and  quality
16
Correlation Analysis (for Nominal Data)
n Χ2 (chi-­‐square)   test
2
(Observed − Expected )
χ2 = ∑
Expected
n The  larger  the  Χ2 value,  the  more  likely  the  variables  are  
related
n The  cells  that  contribute  the  most  to  the  Χ2 value  are  those  
whose  actual  count  is  very  different  from  the  expected  count
n Correlation  does  not  imply  causality
n #  of  hospitals  and  #  of  car-­‐theft  in  a  city  are  correlated
n Both  are  causally  linked  to  the  third  variable:  population
Chi-Square Calculation: An Example
Play  chess Not  play  chess Sum  (row)
Like  science  fiction 250(90) 200(360) 450

Not  like  science  fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

n Numbers  in  parenthesis   are  expected  counts  calculated  based  on  the  data  
distribution  in  the  two  categories
n Example:  Expected   count  of  people  playing  chest  and  liking  science  
fiction:  450  *  300  /  1500  =  90
n Χ2 (chi-­‐square)   calculation
2(250 − 90) 2 (50 − 210) 2 (200 − 360) 2 (1000 − 840) 2
χ = + + + = 507.93
90 210 360 840
n Degree   of  freedom   for  the  2x2  table:  (2-­‐1)*(2-­‐1)   =  1  -­‐-­‐>  By  looking  up  the  Chi-­‐
Square  table,  we  can  reject  the  hypothesis  like_science_fiction and  
play_chess are  independent   with  high  confidenceà they  are  correlated!
Correlation Analysis (for Numeric Data)
n Correlation  coefficient  (also  called  Pearson’s  product  moment  
coefficient)
n n
∑ i =1
(ai − A)(bi − B) ∑ i =1
(ai bi ) − n AB
rA, B = =
nσ Aσ B nσ Aσ B
where  n  is  the  number  of  tuples,    A          and    B        are  the  respective  
means  of  A  and  B,  σA  and  σB  are  the  respective  standard  
deviations  of  A  and  B,  and  Σ(aibi)  is  the  sum  of  the  AB  cross-­‐
product.
n If  rA,B >  0,  A  and  B  are  positively  correlated  (A’s  values  increase  
as  B’s).    The  higher  rA,B,  the  stronger  correlation.
n rA,B =  0:  independent.
n rAB <  0:  negatively  correlated.
Visually Evaluating Correlation

Scatter  plots  
showing  the  
similarity  from  
–1  to  1.
Covariance (for Numeric Data)
n Covariance   is  similar  to  correlation

Correlation  coefficient:

where  n  is  the  number  of  tuples,    A


       and      B      are  the  respective   mean  or  
expected  values of  A  and  B,  σA   and  σB  are  the  respective   standard  deviation  
of  A  and  B
n Positive  covariance:  If  CovA,B >  0,  then  A  and  B  both  tend  to  be  larger  than  their  
expected   values
n Negative  covariance:  If  CovA,B <  0  then  if  A  is  larger  than  its  expected   value,  B  is  
likely  to  be  smaller  than  its  expected  value
n Independence:  CovA,B =  0  but  the  converse   is  not  true:
n Some   pairs   of  random  variables   may   have  a   covariance  of   0  but  are   not  
independent.     Only  under  some  additional  assumptions  (e.g.,  the   data   follow  
multivariate  normal   distributions)  does   a   covariance   of  0  imply   independence
Co-Variance: An Example

n It  can  be  simplified  in  computation  as

n Suppose  two  stocks  A  and  B  have  the  following  values  in  one  week:    (2,  5),  (3,  
8),  (5,  10),  (4,  11),  (6,  14).  

n Question:    If  the  stocks  are  affected   by  the  same  industry  trends,   will  their  
prices  rise  or  fall  together?

n E(A)  =  (2  +  3  +  5  +  4  +  6)/  5 = 20/5 =  4

n E(B)  =  (5  +  8  +  10  +  11  +  14)  /5  = 48/5  = 9.6

n Cov(A,B)   =  (2×5+3×8+5×10+4×11+6×14)/5  −  4  × 9.6  =  4

n Thus,  A  and  B  rise  together   since  Cov(A,   B)  >  0.


Chapter 3: Data Preprocessing

n Data  Preprocessing:  An  Overview

n Data  Quality

n Major  Tasks  in  Data  Preprocessing

n Data  Cleaning

n Data  Integration

n Data  Reduction

n Data  Transformation  and  Data  Discretization

n Summary
23
Data Reduction Strategies
n Data  reduction:  Obtain  a  reduced   representation   of  the  data  set  that  is  much  
smaller  in  volume  but  yet  produces   the  same  (or  almost  the  same)  analytical  
results
n Why  data  reduction?  — A  database/data   warehouse   may  store   terabytes  of  
data.    Complex  data  analysis  may  take  a  very  long  time  to  run  on  the  
complete  data  set.
n Data  reduction  strategies
n Dimensionality  reduction,  e.g., remove   unimportant  attributes

n Wavelet  transforms

n Principal  Components  Analysis  (PCA)

n Feature  subset   selection,  feature   creation

n Numerosity reduction (some  simply  call  it:  Data  Reduction)

n Regression   and  Log-­‐Linear  Models

n Histograms,  clustering,  sampling

n Data  cube  aggregation

n Data  compression
Data Reduction 1: Dimensionality Reduction
n Curse  of  dimensionality
n When  dimensionality  increases,   data  becomes  increasingly  sparse

n Density  and  distance  between   points,  which  is  critical  to  clustering,  

outlier  analysis,  becomes  less  meaningful


n The  possible  combinations  of  subspaces   will  grow  exponentially

n Dimensionality   reduction
n Avoid  the  curse  of  dimensionality

n Help  eliminate  irrelevant   features   and  reduce  noise

n Reduce  time  and  space  required  in  data  mining

n Allow  easier   visualization

n Dimensionality   reduction  techniques


n Wavelet  transforms

n Principal  Component  Analysis

n Supervised   and  nonlinear  techniques  (e.g.,   feature   selection)


Dimensionality Reduction by Attribute Subset Selection

n Redundant  attributes  
n Duplicate  much  or  all  of  the  information  contained  in  one  or  
more  other  attributes
n E.g.,  purchase  price  of  a  product  and  the  amount  of  sales  
tax  paid
n Irrelevant  attributes
n Contain  no  information  that  is  useful  for  the  data  mining  
task  at  hand
n E.g.,  students'  ID  is  often  irrelevant  to  the  task  of  predicting  
students'  GPA
Attribute Subset Selection by Heuristic Search

n There  are  2d possible  attribute  combinations  of  d attributes


n Typical  heuristic  attribute  selection  methods:
n Best  single  attribute   under  the  attribute  independence  

assumption:  choose  by  significance  tests


n Best  step-­‐wise  feature  selection:

n The  best  single-­‐attribute  is  picked  first

n Then  next  best  attribute   condition  to  the  first,  ...

n Step-­‐wise  attribute  elimination:

n Repeatedly   eliminate  the  worst  attribute

n Best  c ombined  attribute  selection  and  elimination

n Optimal  branch  and  bound:

n Use  attribute   elimination  and  backtracking


Attribute Subset Selection by Feature Generation

n Create  new  attributes  (features)  that  can  capture  the  


important  information  in  a  data  set  more  effectively  than  the  
original  ones
n Three  general  methodologies
n Attribute   extraction

n Domain-­‐specific

n Mapping  data  to  new  space  ( see:  data  reduction)

n E.g.,  Fourier  transformation,  wavelet  transformation,  

manifold  approaches  (not  covered)


n Attribute   construction  

n Combining  features  (see:  discriminative  frequent  

patterns  in  Chapter  on  “Advanced   Classification”)


n Data  discretization
28
Data Reduction 2: Numerosity Reduction
n Reduce  data  volume  by  choosing  alternative,   smaller  forms of  
data  representation
n Parametric  methods (e.g.,  regression)
n Assume  the  data  fits  some  model,  estimate  model  

parameters,  store  only  the  parameters,  and  discard  the  


data  (except  possible  outliers)
n Ex.:  L og-­‐linear  models—obtain  value  at  a  point  in  m-­‐D  

space  as  the  product  on  appropriate  marginal  subspaces  


n Non-­‐parametric methods
n Do  not  assume  models

n Major  families:  histograms,  clustering,  sampling,  …  

29
Parametric Data Reduction: Regression Analysis

n Regression   analysis: A  collective  name  for  


techniques  for  the  modeling  and  analysis  of  
y
numerical  data  consisting  of  values  of  a  
dependent   variable (also  called  response   Y1
variable or  measurement)  and  of  one  or  more  
independent  variables (aka.  explanatory   Y1’
y=x+1
variables or  predictors)
n The  parameters   are  estimated  so  as  to  give  a  
"best  fit"  of  the  data X1 x
n Most  commonly  the  best  fit  is  evaluated   by  
Example:  Linear  regression  
using  the  least  squares   method,  but  other   when  data  fit  a  straight  line
criteria  have  also  been  used

30
Non-parametric Data Reduction: Histogram Analysis

n Divide  data  into  buckets  and   40


store  average  (sum)  for  each   35
bucket
30
n Partitioning  rules: 25
n Equal-­‐width:  equal  bucket   20
range
15
n Equal-­‐frequency   (or  equal-­‐ 10
depth)
5
0
10000

20000

30000

40000

50000

60000

70000

80000

90000

100000
31
Non-parametric Data Reduction: Clustering

n Partition  data  set  into  clusters  based  on  similarity,  and  store  
cluster  representation   (e.g.,  centroid  and  diameter)  only
n Can  be  very  effective  if  data  is  clustered  but  not  if  data  is  
“smeared”
n Can  have  hierarchical  clustering  and  be  stored  in  multi-­‐
dimensional  index  tree  structures
n There  are  many  choices  of  clustering  definitions  and  
clustering  algorithms
n Cluster  analysis  will  be  studied  in  depth  in  Chapter  10

32
Non-parametric Data Reduction: Sampling

n Sampling:  obtaining  a  small  sample  s to  represent  the  whole  


data  set  N
n Allow  a  mining  algorithm  to  run  in  complexity  that  is  potentially  
sub-­‐linear  to  the  size  of  the  data
n Key  principle:  Choose  a  representative subset  of  the  data
n Simple  random  sampling  may  have  very  poor  performance  
in  the  presence  of  skew
n Develop   adaptive  sampling  methods,  e.g.,  stratified  
sampling:  
n Note:  Sampling  may  not  reduce  database  I/Os  (page  at  a  time)

33
Types of Sampling

n Simple  random   sampling


n There  is  an  equal  probability  of  selecting  any  particular  item

n Sampling  without  replacement


n Once  an  object  is  selected,  it  is  removed  from  the  population

n Sampling  with  replacement


n A  selected  object  is  not  removed   from  the  population

n Stratified  sampling:  
n Partition  the  data  set,  and  draw  samples  from  each  partition  

(proportionally,  i.e.,  approximately  the  same  percentage  of  


the  data)  
n Used  in  c onjunction  with  skewed  data

34
Sampling:  With  or  without  Replacement

Raw Data
Sampling: Cluster or Stratified Sampling

Raw Data Cluster/Stratified Sample

36
Non-parametric Data Reduction: Data Cube Aggregation

n The  lowest  level  of  a  data  cube  (base  cuboid)


n The  aggregated  data  for  an  individual  entity  of  interest
n E.g.,  a  customer  in  a  phone  calling  data  warehouse
n Multiple  levels  of  aggregation  in  data  cubes
n Further  reduce  the  size  of  data  to  deal  with
n Reference  appropriate  levels
n Use  the  smallest  representation   which  is  enough  to  solve  
the  task
n Queries  regarding  aggregated  information  should  be  answered  
using  data  cube,  when  possible
Data Reduction 3: Data Compression
n String  compression
n There  are  extensive  theories  and  well-­‐tuned  algorithms

n Typically  lossless,  but  only  limited  manipulation  is  possible  

without  expansion
n Audio/video  compression
n Typically  lossy compression,  with  progressive  refinement

n Sometimes  small  fragments  of  signal  c an  be  reconstructed  

without  reconstructing  the  whole


n Time  sequence  is  not  audio
n Typically  short  and  vary  slowly  with  time

n Dimensionality  and  numerosity reduction  may  also  be  


considered  as  forms  of  data  compression
Data Compression

Original Data Compressed


Data
lossless

Original Data
Approximated

You might also like