0% found this document useful (0 votes)
21 views

DS5 Statistics

The document provides an overview of key concepts in data quality and preprocessing for data science. It discusses common data quality problems like noise, outliers, and missing values. It also covers preprocessing techniques such as data aggregation, sampling, dimensionality reduction, and data transformation. The goal is to detect and address data issues while preparing data for analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

DS5 Statistics

The document provides an overview of key concepts in data quality and preprocessing for data science. It discusses common data quality problems like noise, outliers, and missing values. It also covers preprocessing techniques such as data aggregation, sampling, dimensionality reduction, and data transformation. The goal is to detect and address data issues while preparing data for analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

Overview of statistics

Lecture Notes for Chapter 5

Data Science
By
Prof. Chih-Hsuan (Jason) Wang

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Data Quality

 What kinds of data quality problems?


 How can we detect problems with the data?
 What can we do about these problems?

 Examples of data quality problems:


– noise and outliers
– missing values
– duplicate data

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Noise

 Noise refers to modification of original values


– Examples: distortion of a person’s voice when
talking on a poor phone and “snow” on television
screen

Two Sine Waves Two Sine Waves + Noise

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Outliers

 Outliers are data objects with characteristics that are


considerably different than most of the other data
objects in the data set

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Missing Values
 Reasons for missing values
– Information is not collected
(e.g., people decline to give their age and weight)
– Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

 Handling missing values


– Eliminate Data Objects
– Estimate Missing Values
– Ignore the Missing Value During Analysis
– Replace with all possible values (weighted by their
probabilities)
Data Science, Department of IEM at NYCU, Hsinchu, Taiwan
Data Preprocessing

 Aggregation
 Sampling
 Dimensionality Reduction
 Feature subset selection
 Feature creation
 Discretization and Binarization
 Attribute Transformation

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Aggregation (roll-up)

 Combining two or more attributes (or objects) into a


single attribute (or object)

 Purpose
– Data reduction
 Reduce the number of attributes or objects
– Change of scale
 Cities aggregated into regions, states, countries, etc
– More “stable” data
 Aggregated data tends to have less variability

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Sampling

 Sampling is the main technique employed for data selection.


– It is often used for both the preliminary investigation of the data and the
final data analysis.

 Statisticians sample because obtaining the entire set of data of


interest is too expensive or time consuming.

 Sampling is used in data mining because processing the entire


set of data of interest is too expensive or time consuming.

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Types of Sampling

 Simple random sampling


– There is an equal probability of selecting any particular item

 Sampling without replacement


– As each item is selected, it is removed from the population

 Sampling with replacement


– Objects are not removed from the population as they are selected for the
sample.
 In sampling with replacement, the same object can be picked up more than
once
 Stratified sampling
– Split the data into several partitions; then draw random samples from
each partition

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Sampling is not needed if you have big data??

 The key principle for effective sampling is the


following:
– Using a sample will work almost as well as using
the entire data sets, if the sample is representative

– A sample is representative if it has approximately


the same statistical property (of interest) as the
original set of data

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Sample Size

8000 points 2000 Points 500 Points

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Curse of Dimensionality

 When dimensionality
increases, data becomes
increasingly sparse in the
space that it occupies

 Definitions of density and


distance between points,
which is critical for
clustering and outlier
detection, become less
meaningful • Randomly generate 500 points
• Compute difference between max and min
distance between any pair of points

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Dimensionality Reduction

 Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory required
– Allow data to be more easily visualized
– May help to eliminate irrelevant features or reduce
noise
 Techniques (PCA & SVD are called feature extraction)
– Principle Component Analysis (PCA)
– Singular Value Decomposition (SVD)
– Feature selection (filter, wrapper)
– Feature transformation (Fourier, Wavelet)

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Mapping Data to a New Space

 Fourier transform
 Wavelet transform

Two Sine Waves Two Sine Waves + Noise Frequency

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Attribute Transformation

 A function that maps the entire set of values of a given


attribute to a new set of replacement values such that
each old value can be identified with one of the new
values
– Simple functions: xk, log(x), ex, |x|
– Standardization and Normalization

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Similarity and Dissimilarity

 Similarity
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
 Dissimilarity
– Numerical measure of how different are two data objects
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
 Proximity refers to a similarity or dissimilarity

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Similarity/Dissimilarity measures

p and q are the attribute values for two data objects.

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Cosine Similarity

 If d1 and d2 are two document vectors, then


cos( d1, d2 ) = (d1  d2) / ||d1|| ||d2|| ,
where  indicates vector dot product and || d || is the length of vector d.

 Example:

d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2

d1  d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245

cos( d1, d2 ) = .3150

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Similarity Between Binary Vectors

 Common situation is that objects, p and q, have only binary


attributes
 Compute similarities using the following quantities
M01 = the number of attributes where p was 0 and q was 1
M10 = the number of attributes where p was 1 and q was 0
M00 = the number of attributes where p was 0 and q was 0
M11 = the number of attributes where p was 1 and q was 1

 Simple Matching (SM) and Jaccard Coefficients


SM = number of matches / number of attributes
= (M11 + M00) / (M01 + M10 + M11 + M00)

JC = number of 11 matches / number of not-both-zero attributes values


= (M11) / (M01 + M10 + M11)

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


SMC versus Jaccard: Example

p= 1000000000
q= 0000001001

M01 = 2 (the number of attributes where p was 0 and q was 1)


M10 = 1 (the number of attributes where p was 1 and q was 0)
M00 = 7 (the number of attributes where p was 0 and q was 0)
M11 = 0 (the number of attributes where p was 1 and q was 1)

SM = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7) = 0.7

JC = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Extended Jaccard Coefficient (Tanimoto)

 Variation of Jaccard for continuous or count attributes


– Reduces to Jaccard for binary attributes

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Euclidean Distance
 Euclidean Distance

n 2
dist  (
 kp  qk )
k 1

Where n is the number of dimensions (attributes) and pk and qk are,


respectively, the kth attributes (components) or data objects p and q.

 Standardization is necessary, if scales differ.

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Examples for Euclidean Distance

p1
point x y
2
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6

p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0

Distance Matrix

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Minkowski Distance
 Minkowski Distance is a generalization of Euclidean
Distance
1
n r r
dist  (  | pk  qk |)
k 1
Where r is a parameter, n is the number of dimensions (attributes) and
pk and qk are, respectively, the kth attributes (components) or data
objects p and q.

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Minkowski Distance: Examples

 r = 1. City block (Manhattan, taxicab, L1 norm) distance.


– A common example of this is the Hamming distance, which is just the number of
bits that are different between two binary vectors

 r = 2. Euclidean distance

 r  . “Supremum” (Lmax norm, L norm) distance.


– This is the maximum difference between any component of the vectors

 Do not confuse r with n, i.e., all these distances are defined for
all numbers of dimensions.

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Minkowski Distance
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
point x y
p1 0 2 L2 p1 p2 p3 p4
p2 2 0 p1 0 2.828 3.162 5.099
p3 3 1 p2 2.828 0 1.414 3.162
p4 5 1 p3 3.162 1.414 0 2
p4 5.099 3.162 2 0

L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0

Distance Matrix
Data Science, Department of IEM at NYCU, Hsinchu, Taiwan
Mahalanobis Distance

1 T
mahalanobi s ( p , q )  ( p  q )  ( p  q )
covariance

 is the covariance matrix of


the input data X

1 n
 j ,k  
n  1 i 1
( X ij  X j )( X ik  X k )

For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Mahalanobis Distance [chapter 2 . 4 . 6, page 82]

Covariance Matrix:

 0.3 0.2
 
 0. 2 0 .3
C

B A: (0.5, 0.5)
B: (0, 1)
A C: (1.5, 1.5)

Mahal(A,B) = 5
Mahal(A,C) = 4

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Basic statistics

 Mean, median, mode, variance, covariance,


correlation, skewness, kurtosis
 Discrete vs. continuous random variables
 PDF: probability density function
 CDF: cumulative density function
 Normality test, outlier detection
 Chi-square test, T-test (mean), F-test (variance)
 ANOVA
 Linear regression

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Visually evaluating correlation using scatter plot

Scatter plots
showing the
similarity from
–1 to 1.

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Box plot for recognizing outliers

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Correlation coefficient

 Correlation measures the linear relationship between


objects
 To compute correlation, we standardize data objects, p
and q, and then take their dot product

pk  ( pk  mean( p)) / std ( p)

qk  ( qk  mean( q)) / std ( q)


 correlation( p, q)  p  q (inner product)

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Covariance

 Correlation is actually covariance normalized by their


respective standard deviation

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Discrete probability distribution

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Continuous probability distribution

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Probability density function (PDF)

f (x)




PDF is a mathematical function that characterizes the shape of


histogram for a continuous random variable

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Normal (Gaussian) distribution

represents a standard normal with mean 0 and unity


standard deviation

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Continuous distribution
Integral

Gaussian

Differential

PDF CDF

Integral

Uniform

Differential

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Other probability distribution
Accumulative
PMF

Poisson probability

Difference

Integral

PDF Exponential CDF

Differential

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Specific probability distribution

Integral

Triangular

Differential

Integral
PDF

Weibull
CDF

Differential

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Unbiased estimation

Sample mean (first order)

Sample variance (second order)

Population mean

E[S2] = σ2 Population variance

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Skewness (third order)

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Kurtosis (forth order)

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Chi-square Test of Independence

 Goals: determine whether two categorical variables are


independent or not?
 Examples: gender vs. purchased car types (sedan, SUV,
truck, etc.), machine types vs. materials in terms of
productivity
total
Contingency Table

total

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Chi-Square independent test

 Assuming the row variable and the column variable


are independent, the expected value of occurrence in
the contingency table can be derived as follows:
OO OOj OO
i  O jO
Eij  n  i 
i
. j i  j
Eij  n  n n  n , i=1,2,…,r j=1,2,…,c
n n n

 Generally, to justify the validity, Eij 5; the larger the


Chi-square value, the higher probability for alternative
r c ( Eij  Oij ) 2
  
2

i 1 j 1 Eij
Data Science, Department of IEM at NYCU, Hsinchu, Taiwan
Statistical assumption

 Chi-Square Test
H0: row variable and column variable are independent
Ha: row variable and column variable are dependent

 χ2 measure with degree of freedom (r-1)(c-1)


 IF    r 1c 1, , then p  
2 2

 In this case, we reject null H0 (accept Ha)


 P value means the right-tail probability

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Acceptance or rejection is based on null hypothesis

P> alpha=0.05 P< alpha=0.05


Null hypothesis Alternative hypothesis

H0成立 Ha成立

Reject H0

Accept H0

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Proportion test between two groups

 Derive point/interval estimates for P1-P2


E P1  P2   P1  P2  0 ?? Null: P1 P2  0
Alternative: P1 P2  0
P1 1  P1  P2 1  P2 
 P  P   
1 2
n1 n2
 Example:
– Investigate and compare policy support between urban and rural
regions: samples from cities is 5000 and 2400 votes support, samples
from country sides is 2000 and 1200 votes support
– A 95% confidence interval for the difference of support between cities
and country sides is (-0.141 , -0.099)
– Because confidence interval does not contain 0, so, their support rates
are significantly different (urban region is less than rural area)

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


One group t-test for height of the male

H 0 :   170

H 0 :   170

H 0 :   170

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Mean difference between two groups

 X1 and X2 come from two independent groups with


sample mean and sample variance as X 1 , X 2 , s12 , s 22

 Sample distribution for X1  X 2

E X 1  X 2   1   2    X1  X 2  Null: 1 2  0
 12  22 Alternative: 1 2  0
Var X 1  X 2      2 X1  X 2 
n1 n2

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Sample distribution for μ1-μ2
 Statistical measures
z
X 1  X 2   1   2 
t
X 1  X 2   1   2 
 12  22 S12 S 22
 
n1 n2 n1 n2

 Confidence interval: X 1  X 2   Z       X1  X 2 
 
2

 Z or t distribution: if standard deviation of the


population is known , using z distribution; otherwise,
using t distribution.
 In reality, when the number of samples is more than 30,
the result using t is close to using z. Thus, t distribution
is commonly used in a mean-difference test.
Data Science, Department of IEM at NYCU, Hsinchu, Taiwan
Example for mean-difference test

 Comparing male’s waist to female’s waist, are they


different?
– H0: their average waists are equal (null hypothesis)
(μ1-μ2=0 or μ1=μ2)
– Ha: their average waists are different (alternative
hypothesis)
(μ1-μ2≠0 or μ1≠μ2)

 Paired-sample t test: samples are from the same group,


such as comparing the patients have been improved ot
not after taking a particular treatment
Data Science, Department of IEM at NYCU, Hsinchu, Taiwan
Notes
 Prior to conducting a mean-difference test, a variance-ratio
test should run first because T-test requires knowing whether
the variances between the two groups are equal or not
 Hypothesis:
– Null H0: 1   2 1 /  2  1
2 2 2 2

– Alternative Ha:  12   2 2

1 / df 1 2
 F distribution: F 
 2 2 / df 2
F value is a ratio of two Chi-square values (respectively with
degree of freedoms, df1 and df2)
Data Science, Department of IEM at NYCU, Hsinchu, Taiwan
ANOVA(analysis of variance)

 ANOVA is a typical application of F test


 Hypothesis for three groups:
– Null μ1=μ2=μ3
– Alternative μi≠μj (any pair)
 Three assumptions:
– Yi~N(μi, 2), i=1,2,…,k (normality)
– i2=2 (homogeneity)
– each group is independent (independence)

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


One Way ANOVA
 Total variance (assuming k factors)
SST  SSF  SSE

 Factorial variance (variance between groups)


ni k
SSF   ni ((Yi  Y ) 2
k
SSF   (Yi  Y ) 2
i 1
i 1 j 1

 Random variance (variance within groups)


k ni k
SSE   (Yij  Yi ) 2
SSE   (n  1) S i2
i 1 j 1 i 1

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


ANOVA can handle multiple groups
SSF
 Mean factor errors (MSF or MSB) MSF 
k 1
SSE
 Mean random errors (MSE or MSW) MSE 
 ni  k
MSF
 F value: F ~ Fk 1, n  k
MSE i

 Decision making
Accept H0 if F  Fk 1,  ni  k , (the factor is insignificant)
Reject H0 if F  Fk 1, ni k , (the factor is significant)

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


F distribution is a ratio of two t values

Rejection

Acceptance

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Two-way ANOVA (two factors)

SST= SSA + SSB + SSAB + SSE

SST: sum of total squared errors


SSA: sum of squared errors caused by factor A
SSB: sum of squared errors caused by factor B
SSAB: sum of squared errors caused by factor
AB (interaction between factors A and B)
SSE: sum of squared random errors (cannot be
avoided and explained)

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Two-way ANOVA without interaction

Productivity


.

.

均 . .
Machine 1
機器 1


. . 機器 2
Machine 2

A B C 原料
Materials

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Two-way ANOVA with interaction

Productivity

.



. .
Machine 1
機器 1




. . . 機器 2
Machine 2

A B C 原料
Materials

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Single variate control chart

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Multi-variate control chart

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Simple Moving Average

 Forecast is the average of a fixed number of past


periods.
 Useful when demand is not growing or declining
rapidly and no seasonality is present.
 Removes some of the random fluctuation from the
data.
 Selecting the period length is important.
– Longer periods provide more smoothing.
– Shorter periods react to variations more quickly.

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Simple Moving Average Formula

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Exponential Smoothing
 A weighted average method that includes all past data in the
forecasting calculation
 More recent results weighted more heavily
 The most used of all forecasting techniques
 Selecting the smoothing constant is important.
 Smaller smoothing constants provide long-term trends.
 Bigger smoothing constants react to variations more quickly.
 Well accepted for the following reasons
 Exponential models are surprisingly accurate.
 Formulating an exponential model is relatively easy.
 The user can understand how the model works.
 Little computation is required to use the model.
 Computer storage requirements are small.

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Exponential Smoothing Model

= αAt-1 + (1- α ) Ft-1


=αAt-1 + (1- α ) [αAt-2 + (1- α ) Ft-2]……

alpha is between zero and unity, typically set by 0.3 to avoid overfitting

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan


Holt-Winter’s moving average

Lt   ( yt  St s )  (1   )(Lt 1  bt 1 )

Tt   (Lt  Lt 1 )  (1   )Tt 1

St   ( yt  Lt )  (1   )St s

Ft  k  Lt  kbt  S t  k  s

Forecast consists of level (L), trend (T), and seasonal (S) components

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

You might also like