DS5 Statistics
DS5 Statistics
Data Science
By
Prof. Chih-Hsuan (Jason) Wang
Aggregation
Sampling
Dimensionality Reduction
Feature subset selection
Feature creation
Discretization and Binarization
Attribute Transformation
Purpose
– Data reduction
Reduce the number of attributes or objects
– Change of scale
Cities aggregated into regions, states, countries, etc
– More “stable” data
Aggregated data tends to have less variability
When dimensionality
increases, data becomes
increasingly sparse in the
space that it occupies
Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory required
– Allow data to be more easily visualized
– May help to eliminate irrelevant features or reduce
noise
Techniques (PCA & SVD are called feature extraction)
– Principle Component Analysis (PCA)
– Singular Value Decomposition (SVD)
– Feature selection (filter, wrapper)
– Feature transformation (Fourier, Wavelet)
Fourier transform
Wavelet transform
Similarity
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
Dissimilarity
– Numerical measure of how different are two data objects
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
Proximity refers to a similarity or dissimilarity
Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
d1 d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245
p= 1000000000
q= 0000001001
n 2
dist (
kp qk )
k 1
p1
point x y
2
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
r = 2. Euclidean distance
Do not confuse r with n, i.e., all these distances are defined for
all numbers of dimensions.
L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0
Distance Matrix
Data Science, Department of IEM at NYCU, Hsinchu, Taiwan
Mahalanobis Distance
1 T
mahalanobi s ( p , q ) ( p q ) ( p q )
covariance
1 n
j ,k
n 1 i 1
( X ij X j )( X ik X k )
Covariance Matrix:
0.3 0.2
0. 2 0 .3
C
B A: (0.5, 0.5)
B: (0, 1)
A C: (1.5, 1.5)
Mahal(A,B) = 5
Mahal(A,C) = 4
Scatter plots
showing the
similarity from
–1 to 1.
f (x)
機
率
密
度
Gaussian
Differential
PDF CDF
Integral
Uniform
Differential
Poisson probability
Difference
Integral
Differential
Integral
Triangular
Differential
Integral
PDF
Weibull
CDF
Differential
Population mean
total
i 1 j 1 Eij
Data Science, Department of IEM at NYCU, Hsinchu, Taiwan
Statistical assumption
Chi-Square Test
H0: row variable and column variable are independent
Ha: row variable and column variable are dependent
H0成立 Ha成立
Reject H0
Accept H0
H 0 : 170
H 0 : 170
H 0 : 170
E X 1 X 2 1 2 X1 X 2 Null: 1 2 0
12 22 Alternative: 1 2 0
Var X 1 X 2 2 X1 X 2
n1 n2
Confidence interval: X 1 X 2 Z X1 X 2
2
– Alternative Ha: 12 2 2
1 / df 1 2
F distribution: F
2 2 / df 2
F value is a ratio of two Chi-square values (respectively with
degree of freedoms, df1 and df2)
Data Science, Department of IEM at NYCU, Hsinchu, Taiwan
ANOVA(analysis of variance)
Decision making
Accept H0 if F Fk 1, ni k , (the factor is insignificant)
Reject H0 if F Fk 1, ni k , (the factor is significant)
Rejection
Acceptance
Productivity
每
小
.
時
.
平
均 . .
Machine 1
機器 1
產
量
. . 機器 2
Machine 2
A B C 原料
Materials
Productivity
每
.
小
時
平
. .
Machine 1
機器 1
均
產
量
. . . 機器 2
Machine 2
A B C 原料
Materials
alpha is between zero and unity, typically set by 0.3 to avoid overfitting
Lt ( yt St s ) (1 )(Lt 1 bt 1 )
Tt (Lt Lt 1 ) (1 )Tt 1
St ( yt Lt ) (1 )St s
Ft k Lt kbt S t k s
Forecast consists of level (L), trend (T), and seasonal (S) components