0% found this document useful (0 votes)

21 views

DS5 Statistics

The document provides an overview of key concepts in data quality and preprocessing for data science. It discusses common data quality problems like noise, outliers, and missing values. It also covers preprocessing techniques such as data aggregation, sampling, dimensionality reduction, and data transformation. The goal is to detect and address data issues while preparing data for analysis.

Uploaded by

michaelschung0515

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views

DS5 Statistics

Uploaded by

michaelschung0515

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 67

Overview of statistics

Lecture Notes for Chapter 5

Data Science
By
Prof. Chih-Hsuan (Jason) Wang

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Data Quality

 What kinds of data quality problems?

 How can we detect problems with the data?
 What can we do about these problems?

 Examples of data quality problems:

– noise and outliers
– missing values
– duplicate data

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Noise

 Noise refers to modification of original values

– Examples: distortion of a person’s voice when
talking on a poor phone and “snow” on television
screen

Two Sine Waves Two Sine Waves + Noise

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Outliers

 Outliers are data objects with characteristics that are

considerably different than most of the other data
objects in the data set

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Missing Values
 Reasons for missing values
– Information is not collected
(e.g., people decline to give their age and weight)
– Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

 Handling missing values

– Eliminate Data Objects
– Estimate Missing Values
– Ignore the Missing Value During Analysis
– Replace with all possible values (weighted by their
probabilities)
Data Science, Department of IEM at NYCU, Hsinchu, Taiwan
Data Preprocessing

 Aggregation
 Sampling
 Dimensionality Reduction
 Feature subset selection
 Feature creation
 Discretization and Binarization
 Attribute Transformation

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Aggregation (roll-up)

 Combining two or more attributes (or objects) into a

single attribute (or object)

 Purpose
– Data reduction
 Reduce the number of attributes or objects
– Change of scale
 Cities aggregated into regions, states, countries, etc
– More “stable” data
 Aggregated data tends to have less variability

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Sampling

 Sampling is the main technique employed for data selection.

– It is often used for both the preliminary investigation of the data and the
final data analysis.

 Statisticians sample because obtaining the entire set of data of

interest is too expensive or time consuming.

 Sampling is used in data mining because processing the entire

set of data of interest is too expensive or time consuming.

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Types of Sampling

 Simple random sampling

– There is an equal probability of selecting any particular item

 Sampling without replacement

– As each item is selected, it is removed from the population

 Sampling with replacement

– Objects are not removed from the population as they are selected for the
sample.
 In sampling with replacement, the same object can be picked up more than
once
 Stratified sampling
– Split the data into several partitions; then draw random samples from
each partition

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Sampling is not needed if you have big data??

 The key principle for effective sampling is the

following:
– Using a sample will work almost as well as using
the entire data sets, if the sample is representative

– A sample is representative if it has approximately

the same statistical property (of interest) as the
original set of data

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Sample Size

8000 points 2000 Points 500 Points

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Curse of Dimensionality

 When dimensionality
increases, data becomes
increasingly sparse in the
space that it occupies

 Definitions of density and

distance between points,
which is critical for
clustering and outlier
detection, become less
meaningful • Randomly generate 500 points
• Compute difference between max and min
distance between any pair of points

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Dimensionality Reduction

 Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory required
– Allow data to be more easily visualized
– May help to eliminate irrelevant features or reduce
noise
 Techniques (PCA & SVD are called feature extraction)
– Principle Component Analysis (PCA)
– Singular Value Decomposition (SVD)
– Feature selection (filter, wrapper)
– Feature transformation (Fourier, Wavelet)

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Mapping Data to a New Space

 Fourier transform
 Wavelet transform

Two Sine Waves Two Sine Waves + Noise Frequency

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Attribute Transformation

 A function that maps the entire set of values of a given

attribute to a new set of replacement values such that
each old value can be identified with one of the new
values
– Simple functions: xk, log(x), ex, |x|
– Standardization and Normalization

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Similarity and Dissimilarity

 Similarity
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
 Dissimilarity
– Numerical measure of how different are two data objects
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
 Proximity refers to a similarity or dissimilarity

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Similarity/Dissimilarity measures

p and q are the attribute values for two data objects.

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Cosine Similarity

 If d1 and d2 are two document vectors, then

cos( d1, d2 ) = (d1  d2) / ||d1|| ||d2|| ,
where  indicates vector dot product and || d || is the length of vector d.

 Example:

d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2

d1  d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245

cos( d1, d2 ) = .3150

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Similarity Between Binary Vectors

 Common situation is that objects, p and q, have only binary

attributes
 Compute similarities using the following quantities
M01 = the number of attributes where p was 0 and q was 1
M10 = the number of attributes where p was 1 and q was 0
M00 = the number of attributes where p was 0 and q was 0
M11 = the number of attributes where p was 1 and q was 1

 Simple Matching (SM) and Jaccard Coefficients

SM = number of matches / number of attributes
= (M11 + M00) / (M01 + M10 + M11 + M00)

JC = number of 11 matches / number of not-both-zero attributes values

= (M11) / (M01 + M10 + M11)

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

SMC versus Jaccard: Example

p= 1000000000
q= 0000001001

M01 = 2 (the number of attributes where p was 0 and q was 1)

M10 = 1 (the number of attributes where p was 1 and q was 0)
M00 = 7 (the number of attributes where p was 0 and q was 0)
M11 = 0 (the number of attributes where p was 1 and q was 1)

SM = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7) = 0.7

JC = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Extended Jaccard Coefficient (Tanimoto)

 Variation of Jaccard for continuous or count attributes

– Reduces to Jaccard for binary attributes

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Euclidean Distance
 Euclidean Distance

n 2
dist  (
 kp  qk )
k 1

Where n is the number of dimensions (attributes) and pk and qk are,

respectively, the kth attributes (components) or data objects p and q.

 Standardization is necessary, if scales differ.

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Examples for Euclidean Distance

p1
point x y
2
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6

p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0

Distance Matrix

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Minkowski Distance
 Minkowski Distance is a generalization of Euclidean
Distance
1
n r r
dist  (  | pk  qk |)
k 1
Where r is a parameter, n is the number of dimensions (attributes) and
pk and qk are, respectively, the kth attributes (components) or data
objects p and q.

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Minkowski Distance: Examples

 r = 1. City block (Manhattan, taxicab, L1 norm) distance.

– A common example of this is the Hamming distance, which is just the number of
bits that are different between two binary vectors

 r = 2. Euclidean distance

 r  . “Supremum” (Lmax norm, L norm) distance.

– This is the maximum difference between any component of the vectors

 Do not confuse r with n, i.e., all these distances are defined for
all numbers of dimensions.

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Minkowski Distance
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
point x y
p1 0 2 L2 p1 p2 p3 p4
p2 2 0 p1 0 2.828 3.162 5.099
p3 3 1 p2 2.828 0 1.414 3.162
p4 5 1 p3 3.162 1.414 0 2
p4 5.099 3.162 2 0

L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0

Distance Matrix
Data Science, Department of IEM at NYCU, Hsinchu, Taiwan
Mahalanobis Distance

1 T
mahalanobi s ( p , q )  ( p  q )  ( p  q )
covariance

 is the covariance matrix of

the input data X

1 n
 j ,k  
n  1 i 1
( X ij  X j )( X ik  X k )

For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Mahalanobis Distance [chapter 2 . 4 . 6, page 82]

Covariance Matrix:

 0.3 0.2
 
 0. 2 0 .3
C

B A: (0.5, 0.5)
B: (0, 1)
A C: (1.5, 1.5)

Mahal(A,B) = 5
Mahal(A,C) = 4

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Basic statistics

 Mean, median, mode, variance, covariance,

correlation, skewness, kurtosis
 Discrete vs. continuous random variables
 PDF: probability density function
 CDF: cumulative density function
 Normality test, outlier detection
 Chi-square test, T-test (mean), F-test (variance)
 ANOVA
 Linear regression

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Visually evaluating correlation using scatter plot

Scatter plots
showing the
similarity from
–1 to 1.

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Box plot for recognizing outliers

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Correlation coefficient

 Correlation measures the linear relationship between

objects
 To compute correlation, we standardize data objects, p
and q, and then take their dot product

pk  ( pk  mean( p)) / std ( p)

qk  ( qk  mean( q)) / std ( q)

 correlation( p, q)  p  q (inner product)

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Covariance

 Correlation is actually covariance normalized by their

respective standard deviation

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Discrete probability distribution

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Continuous probability distribution

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Probability density function (PDF)

f (x)

機
率
密
度

PDF is a mathematical function that characterizes the shape of

histogram for a continuous random variable

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Normal (Gaussian) distribution

represents a standard normal with mean 0 and unity

standard deviation

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Continuous distribution
Integral

Gaussian

Differential

PDF CDF

Integral

Uniform

Differential

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Other probability distribution
Accumulative
PMF

Poisson probability

Difference

Integral

PDF Exponential CDF

Differential

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Specific probability distribution

Integral

Triangular

Differential

Integral
PDF

Weibull
CDF

Differential

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Unbiased estimation

Sample mean (first order)

Sample variance (second order)

Population mean

E[S2] = σ2 Population variance

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Skewness (third order)

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Kurtosis (forth order)

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Chi-square Test of Independence

 Goals: determine whether two categorical variables are

independent or not?
 Examples: gender vs. purchased car types (sedan, SUV,
truck, etc.), machine types vs. materials in terms of
productivity
total
Contingency Table

total

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Chi-Square independent test

 Assuming the row variable and the column variable

are independent, the expected value of occurrence in
the contingency table can be derived as follows:
OO OOj OO
i  O jO
Eij  n  i 
i
. j i  j
Eij  n  n n  n , i=1,2,…,r j=1,2,…,c
n n n

 Generally, to justify the validity, Eij 5; the larger the

Chi-square value, the higher probability for alternative
r c ( Eij  Oij ) 2
  
2

i 1 j 1 Eij
Data Science, Department of IEM at NYCU, Hsinchu, Taiwan
Statistical assumption

 Chi-Square Test
H0: row variable and column variable are independent
Ha: row variable and column variable are dependent

 χ2 measure with degree of freedom (r－1)(c－1)

 IF    r 1c 1, , then p  
2 2

 In this case, we reject null H0 (accept Ha)

 P value means the right-tail probability

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Acceptance or rejection is based on null hypothesis

P> alpha=0.05 P< alpha=0.05

Null hypothesis Alternative hypothesis

H0成立 Ha成立

Reject H0

Accept H0

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Proportion test between two groups

 Derive point/interval estimates for P1-P2

E P1  P2   P1  P2  0 ?? Null: P1 P2  0
Alternative: P1 P2  0
P1 1  P1  P2 1  P2 
 P  P   
1 2
n1 n2
 Example:
– Investigate and compare policy support between urban and rural
regions: samples from cities is 5000 and 2400 votes support, samples
from country sides is 2000 and 1200 votes support
– A 95% confidence interval for the difference of support between cities
and country sides is (-0.141 , -0.099)
– Because confidence interval does not contain 0, so, their support rates
are significantly different (urban region is less than rural area)

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

One group t-test for height of the male

H 0 :   170

H 0 :   170

H 0 :   170

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Mean difference between two groups

 X1 and X2 come from two independent groups with

sample mean and sample variance as X 1 , X 2 , s12 , s 22

 Sample distribution for X1  X 2

E X 1  X 2   1   2    X1  X 2  Null: 1 2  0
 12  22 Alternative: 1 2  0
Var X 1  X 2      2 X1  X 2 
n1 n2

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Sample distribution for μ1-μ2
 Statistical measures
z
X 1  X 2   1   2 
t
X 1  X 2   1   2 
 12  22 S12 S 22
 
n1 n2 n1 n2

 Confidence interval: X 1  X 2   Z       X1  X 2 
 
2

 Z or t distribution: if standard deviation of the

population is known , using z distribution; otherwise,
using t distribution.
 In reality, when the number of samples is more than 30,
the result using t is close to using z. Thus, t distribution
is commonly used in a mean-difference test.
Data Science, Department of IEM at NYCU, Hsinchu, Taiwan
Example for mean-difference test

 Comparing male’s waist to female’s waist, are they

different?
– H0: their average waists are equal (null hypothesis)
(μ1-μ2=0 or μ1=μ2)
– Ha: their average waists are different (alternative
hypothesis)
(μ1-μ2≠0 or μ1≠μ2)

 Paired-sample t test: samples are from the same group,

such as comparing the patients have been improved ot
not after taking a particular treatment
Data Science, Department of IEM at NYCU, Hsinchu, Taiwan
Notes
 Prior to conducting a mean-difference test, a variance-ratio
test should run first because T-test requires knowing whether
the variances between the two groups are equal or not
 Hypothesis:
– Null H0: 1   2 1 /  2  1
2 2 2 2

– Alternative Ha:  12   2 2

1 / df 1 2
 F distribution: F 
 2 2 / df 2
F value is a ratio of two Chi-square values (respectively with
degree of freedoms, df1 and df2)
Data Science, Department of IEM at NYCU, Hsinchu, Taiwan
ANOVA(analysis of variance)

 ANOVA is a typical application of F test

 Hypothesis for three groups:
– Null μ1=μ2=μ3
– Alternative μi≠μj (any pair)
 Three assumptions:
– Yi~N(μi, 2), i=1,2,…,k (normality)
– i2=2 (homogeneity)
– each group is independent (independence)

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

One Way ANOVA
 Total variance (assuming k factors)
SST  SSF  SSE

 Factorial variance (variance between groups)

ni k
SSF   ni ((Yi  Y ) 2
k
SSF   (Yi  Y ) 2
i 1
i 1 j 1

 Random variance (variance within groups)

k ni k
SSE   (Yij  Yi ) 2
SSE   (n  1) S i2
i 1 j 1 i 1

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

ANOVA can handle multiple groups
SSF
 Mean factor errors (MSF or MSB) MSF 
k 1
SSE
 Mean random errors (MSE or MSW) MSE 
 ni  k
MSF
 F value: F ~ Fk 1, n  k
MSE i

 Decision making
Accept H0 if F  Fk 1,  ni  k , (the factor is insignificant)
Reject H0 if F  Fk 1, ni k , (the factor is significant)

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

F distribution is a ratio of two t values

Rejection

Acceptance

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Two-way ANOVA (two factors)

SST= SSA + SSB + SSAB + SSE

SST: sum of total squared errors

SSA: sum of squared errors caused by factor A
SSB: sum of squared errors caused by factor B
SSAB: sum of squared errors caused by factor
AB (interaction between factors A and B)
SSE: sum of squared random errors (cannot be
avoided and explained)

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Two-way ANOVA without interaction

Productivity
每
小
.
時
.
平
均 . .
Machine 1
機器 1
產
量
. . 機器 2
Machine 2

A B C 原料
Materials

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Two-way ANOVA with interaction

Productivity
每

.
小
時
平
. .
Machine 1
機器 1

均
產
量
. . . 機器 2
Machine 2

A B C 原料
Materials

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Single variate control chart

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Multi-variate control chart

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Simple Moving Average

 Forecast is the average of a fixed number of past

periods.
 Useful when demand is not growing or declining
rapidly and no seasonality is present.
 Removes some of the random fluctuation from the
data.
 Selecting the period length is important.
– Longer periods provide more smoothing.
– Shorter periods react to variations more quickly.

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Simple Moving Average Formula

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Exponential Smoothing
 A weighted average method that includes all past data in the
forecasting calculation
 More recent results weighted more heavily
 The most used of all forecasting techniques
 Selecting the smoothing constant is important.
 Smaller smoothing constants provide long-term trends.
 Bigger smoothing constants react to variations more quickly.
 Well accepted for the following reasons
 Exponential models are surprisingly accurate.
 Formulating an exponential model is relatively easy.
 The user can understand how the model works.
 Little computation is required to use the model.
 Computer storage requirements are small.

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Exponential Smoothing Model

= αAt-1 + (1- α ) Ft-1

=αAt-1 + (1- α ) [αAt-2 + (1- α ) Ft-2]……

alpha is between zero and unity, typically set by 0.3 to avoid overfitting

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

Holt-Winter’s moving average

Lt   ( yt  St s )  (1   )(Lt 1  bt 1 )

Tt   (Lt  Lt 1 )  (1   )Tt 1

St   ( yt  Lt )  (1   )St s

Ft  k  Lt  kbt  S t  k  s

Forecast consists of level (L), trend (T), and seasonal (S) components

Data Science, Department of IEM at NYCU, Hsinchu, Taiwan

PWA Standard Details
73% (15)
PWA Standard Details
206 pages
Long Vowel e Lesson Plan
50% (2)
Long Vowel e Lesson Plan
2 pages
Chip Insider Aug11
No ratings yet
Chip Insider Aug11
48 pages
Lec 5
No ratings yet
Lec 5
24 pages
class 1c -DataFundamentals
No ratings yet
class 1c -DataFundamentals
27 pages
Similarity
No ratings yet
Similarity
19 pages
9-2 Data analysis and pre-processing part 2.pdf
No ratings yet
9-2 Data analysis and pre-processing part 2.pdf
27 pages
02data Part4
No ratings yet
02data Part4
28 pages
Clustering Lecture 1: Basics: Jing Gao
No ratings yet
Clustering Lecture 1: Basics: Jing Gao
62 pages
Data Similarity
0% (1)
Data Similarity
18 pages
Similarity and Dissimilarity
No ratings yet
Similarity and Dissimilarity
34 pages
Chapter - 2 Data Mining
No ratings yet
Chapter - 2 Data Mining
21 pages
TE IT DMBI Module2 Data Preprocessing L8-L11
No ratings yet
TE IT DMBI Module2 Data Preprocessing L8-L11
73 pages
Lecture 2. Similarity Measures For Cluster Analysis
No ratings yet
Lecture 2. Similarity Measures For Cluster Analysis
31 pages
L13
No ratings yet
L13
19 pages
lec01-dataprep
No ratings yet
lec01-dataprep
67 pages
DWDM AR16 Unit 1.2
No ratings yet
DWDM AR16 Unit 1.2
14 pages
Data Preprocessing II
No ratings yet
Data Preprocessing II
21 pages
29.measuring Data Similarity and Dissimilarity Introduction
No ratings yet
29.measuring Data Similarity and Dissimilarity Introduction
43 pages
rsfinal (1)
No ratings yet
rsfinal (1)
30 pages
DMi_03-Proximity
No ratings yet
DMi_03-Proximity
51 pages
Lecture 2
No ratings yet
Lecture 2
27 pages
lec3
No ratings yet
lec3
60 pages
Class-Data Preprocessing-IV
No ratings yet
Class-Data Preprocessing-IV
28 pages
Data Science: Department of Computer Science & Engineering
No ratings yet
Data Science: Department of Computer Science & Engineering
31 pages
17 Data Analysis
No ratings yet
17 Data Analysis
64 pages
Week 5 - Data Mining Exploring Data With R
No ratings yet
Week 5 - Data Mining Exploring Data With R
146 pages
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
No ratings yet
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
30 pages
CS2209 Similarity Distances
No ratings yet
CS2209 Similarity Distances
23 pages
Materi 7.1. Distance Measurement
No ratings yet
Materi 7.1. Distance Measurement
14 pages
Similarity
No ratings yet
Similarity
20 pages
Similarity
No ratings yet
Similarity
20 pages
2 Similarity Disimilarity Measure
No ratings yet
2 Similarity Disimilarity Measure
35 pages
DWDM Unit6-Data Similarity Measures
No ratings yet
DWDM Unit6-Data Similarity Measures
40 pages
Chapter 2: Getting To Know Your Data
No ratings yet
Chapter 2: Getting To Know Your Data
30 pages
ML Co4 Session 29
No ratings yet
ML Co4 Session 29
36 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
26 pages
Knowing Your Data
No ratings yet
Knowing Your Data
43 pages
DMi 03 Proximity
No ratings yet
DMi 03 Proximity
9 pages
BITS-WASE-DATA MINING-Session-2 PDF
No ratings yet
BITS-WASE-DATA MINING-Session-2 PDF
47 pages
03 - Data Mining
No ratings yet
03 - Data Mining
37 pages
Lab 2
No ratings yet
Lab 2
21 pages
CSE-1-PPT-MiniTest-12feb24-Similarity (6)
No ratings yet
CSE-1-PPT-MiniTest-12feb24-Similarity (6)
11 pages
Lecture 10
No ratings yet
Lecture 10
26 pages
Sess02 Data
No ratings yet
Sess02 Data
96 pages
Data Mining: Data
No ratings yet
Data Mining: Data
50 pages
Lecture 8-9 - Clustering
No ratings yet
Lecture 8-9 - Clustering
43 pages
X Chapter 02 Data
No ratings yet
X Chapter 02 Data
67 pages
Chapter 2 Data Issues
No ratings yet
Chapter 2 Data Issues
21 pages
Lecture 4
No ratings yet
Lecture 4
33 pages
Pattern Recognition - Clustering - Classification
No ratings yet
Pattern Recognition - Clustering - Classification
177 pages
Wk. 3. Data (12-05-2021)
No ratings yet
Wk. 3. Data (12-05-2021)
57 pages
Similarty and Dissimilarity
No ratings yet
Similarty and Dissimilarity
11 pages
Similarity Measures
No ratings yet
Similarity Measures
11 pages
Lesson 6 Similarities KNN
No ratings yet
Lesson 6 Similarities KNN
25 pages
DM Lab 02
No ratings yet
DM Lab 02
12 pages
Chapter_2
No ratings yet
Chapter_2
70 pages
Introduction To Data Science: Tom A S Horv Ath
No ratings yet
Introduction To Data Science: Tom A S Horv Ath
39 pages
Data Similarity and Dissimilarity
No ratings yet
Data Similarity and Dissimilarity
3 pages
Data Mining Chapter 2 Data Preprocessing
No ratings yet
Data Mining Chapter 2 Data Preprocessing
33 pages
ITS665dm Topic2-DataUnderstanding
No ratings yet
ITS665dm Topic2-DataUnderstanding
53 pages
Data Preprocessing for Clustering
No ratings yet
Data Preprocessing for Clustering
40 pages
A Bird's Eye view of Data Visualisation
From Everand
A Bird's Eye view of Data Visualisation
Nisarg Patel
No ratings yet
Download Full (Ebook) Cognitive Fatigue: Multidisciplinary Perspectives on Current Research and Future Applications by Phillip L. Ackerman ISBN 9781433808395, 1433808390 PDF All Chapters
No ratings yet
Download Full (Ebook) Cognitive Fatigue: Multidisciplinary Perspectives on Current Research and Future Applications by Phillip L. Ackerman ISBN 9781433808395, 1433808390 PDF All Chapters
76 pages
Gibson 11e Ch08
No ratings yet
Gibson 11e Ch08
24 pages
Crescent Company An Electronics Firm Buys Circuit Boards and M
No ratings yet
Crescent Company An Electronics Firm Buys Circuit Boards and M
1 page
Virgil Aeneid Vocabulary
No ratings yet
Virgil Aeneid Vocabulary
7 pages
The Type of Value and Its Definition Constitute Important Assignment Elements That Must Be Determined As Part of Problem Identification
No ratings yet
The Type of Value and Its Definition Constitute Important Assignment Elements That Must Be Determined As Part of Problem Identification
5 pages
Gerald NG Resume & Portfolio (September 2014)
No ratings yet
Gerald NG Resume & Portfolio (September 2014)
42 pages
Cubit Descriptions
No ratings yet
Cubit Descriptions
2 pages
PicoBlaze Amplifier and ADC Control Rev2
No ratings yet
PicoBlaze Amplifier and ADC Control Rev2
18 pages
TEKNION Expansion Cityline Brochure - Bluespace Interiors
No ratings yet
TEKNION Expansion Cityline Brochure - Bluespace Interiors
15 pages
Product Manual: Router CCR
No ratings yet
Product Manual: Router CCR
7 pages
Topic: The Washington and Jefferson College Review: Style Sheet For Authors
No ratings yet
Topic: The Washington and Jefferson College Review: Style Sheet For Authors
4 pages
Ik Dee 391384 R4a
No ratings yet
Ik Dee 391384 R4a
10 pages
Booklet Solar Drying
No ratings yet
Booklet Solar Drying
4 pages
SQL Lab Exercise For Day 3,4,5
No ratings yet
SQL Lab Exercise For Day 3,4,5
5 pages
How To Draw A Portrait in Three Quarter View, Part 9
No ratings yet
How To Draw A Portrait in Three Quarter View, Part 9
10 pages
Program Kompetensi Teknologi Dron Komersial PERKESO Madani
No ratings yet
Program Kompetensi Teknologi Dron Komersial PERKESO Madani
45 pages
EXPO Station Foster Partners PDF
No ratings yet
EXPO Station Foster Partners PDF
5 pages
Anodal Block PDF
No ratings yet
Anodal Block PDF
10 pages
1-5-6 Sprue
No ratings yet
1-5-6 Sprue
45 pages
DCR & Building Bye Laws
No ratings yet
DCR & Building Bye Laws
15 pages
Choose The Correct Answer by Putting A Thick On Either A
No ratings yet
Choose The Correct Answer by Putting A Thick On Either A
15 pages
Enhancing Productivity Through AI Technologies
No ratings yet
Enhancing Productivity Through AI Technologies
11 pages
Reviewer-FAR
No ratings yet
Reviewer-FAR
7 pages
Form 50 (See Rule 90 (3) ) Bill of Lading
No ratings yet
Form 50 (See Rule 90 (3) ) Bill of Lading
1 page
Fringe Benefit Tax: By: Dana Cortez (ACC 311-2984)
No ratings yet
Fringe Benefit Tax: By: Dana Cortez (ACC 311-2984)
46 pages
internship report eaaa
No ratings yet
internship report eaaa
26 pages
LT1275X1 NCAT Manual
No ratings yet
LT1275X1 NCAT Manual
68 pages