Statistics For Data Science by Mihir Patnaik
Statistics For Data Science by Mihir Patnaik
STATI
Accredited by IABA C
• Estimation
– e.g., Estimate the population mean weight using the
sample mean weight
• Hypothesis testing
– e.g., Test the claim that the population mean weight
is 70 kg
Inference is the process o f drawing conclusions or m a k i n g decisions about a population
based o n sample results
Vantharr. All Rights
s
e Reserved | www.vantharr.com STATISTICS ESSENTIALS 5
Definition to Basic terms
Population: A collection, or set, of individuals or objects or events whose properties are to
be analyzed.
Data (singular): The value of the variable associated with one element of a population or
sample. This value may be a number, a word, or a symbol.
Random Variable: Variable are placeholder where you can store anything.It can
number,or string,sentences.
1
Vantharr. All Rights
e Reserved | www.vantharr.com STATISTICS ESSENTIALS
s
Steps Involved In Descriptive Statistics
Survey: Data are obtained by sampling some of the population of interest. The
investigator does not modify the environment.
Census: A 100% survey. Every element of the population is listed. Seldom used: difficult
and time- consuming to compile, and expensive.
Probability Samples: Samples in which the elements to be selected are drawn on the basis of
probability. Each element in a population has a certain probability of being selected as part of
the sample.
Vantharr. All Rights
s
e Reserved | www.vantharr.com STATISTICS ESSENTIALS 16
Types Of Sampling
2.Stratified Sampling :-divide the population into groups called strata and then take a sample
from each stratum.
3.Cluster sampling :-divide the population into strata and then randomly select some of the
strata. All the members from these strata are in the cluster sample.
4.Systematic sampling :-randomly select a starting point and take every n-th piece of data from
a listing of the population.
5.Multistage Random :- divide the population into clusters and select some clusters at the first
stage. At each subsequent stage, you further divide up those selected clusters into smaller
clusters, and repeat the process until you get the desired sample size.
Figure 1.2
A demonstration of sampling error. Two samples are
selected from the same population. Notice that the sample
statistics are different from one sample to another, and all
of the sample statistics are different from the corresponding
population parameters. The natural differences that exist,
by chance, between a sample statistic and a population
parameter are called sampling error.
2
Vantharr. All Rights
e Reserved | www.vantharr.com STATISTICS ESSENTIALS
s
Measures of Central Tendencies
• Mean
• Median
• Mode
2
Vantharr. All Rights Reserved | www.vantharr.com STATISTICS
Range
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 39
Standard Deviation (σ)
Standard Deviation (SD) is a
measure that is used to
quantify the amount of
variation or dispersion of a set
of data values.
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 31
Standard Deviation (σ)
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 32
Standard Deviation (σ)
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 31
Variance(σ2)
Variance is the average
squared difference of the
values from the mean. Unlike
the previous measures of
variability, the variance
includes all values in the
calculation by comparing
each value to the mean.
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 33
Percentile
• A percentile (or a centile) is a measure used in statistics
indicating the value below which a given percentage of
observations in a group of observations fall.
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 34
Python Implementation
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 35
Distributions
The graphical representation of all observations is known as distribution
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 36
Types Of Distributions
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 37
Normal Distribution
Normal distribution, also known as the Gaussian distribution, is a probability distribution that
is symmetric about the mean, showing that data near the mean are more frequent in
occurrence than data far from the mean.
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 38
Type of Distribution
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 38
Type of Distribution
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 38
Type of Distribution
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 38
Type of Distribution
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 38
Properties Of Normal Distribution
1. Empirical Rule
2. Distortion in Normal
Distribution
3. Central Limit Theorem
4. Standard Normal
Distribution
5. Outliers
6. QQ plot
7. Log,Sqrt,Boxcox
transformation
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 39
Empirical Rule
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 40
Distortion in Normal Distribution
The distortion in normally distributed curves can be
quantified in 2 ways
1. Skewness
2. Kurtosis
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 41
Skewness in Normal Distribution
Skewness is asymmetry in
a statistical distribution, in
which the curve appears
distorted or skewed either
to the left or to the right.
Skewness can be
quantified to define the
extent to which a
distribution differs from a
normal distribution
If the skewness is greater than 1 or less than -1, the data is highly
skewed.
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 42
Kurtosis in Normal Distribution
In probability theory
and statistics,
kurtosis is a measure
of the “peakedness"
of the probability
distribution of a real-
valued random
variable.
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 43
4
© 2018 DataMites . All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS
How much Skewness and Kurtosis
● If the skewness is between -0.5 and 0.5, the data are fairly symmetrical. If
the skewness is between -1 and – 0.5 or between 0.5 and 1, the data is
moderately skewed.
● If the skewness is greater than 1or less than -1, the data is highly skewed.
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 45
Which is Best—the Mean, Median, or Mode?
● When you have a symmetrical distribution for continuous data, the mean,
median, and mode are equal. In this case, analysts tend to use the mean
because it includes all of the data in the calculations. However, if you have
a skewed distribution, the median is often the best measure of central
tendency.
● When you have categorical or discrete data, the median or mode is usually
the best choice. For categorical data, you have to use the mode.
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 46
Central Limit Theorem
The central limit theorem states that the distribution of sample means approximates a normal
distribution as the sample size gets larger (assuming that all samples are identical in size),
regardless of population distribution shape.
CLT in one sentence "Even if I'm not normal, the average is normal"
When collecting means of the samples from any distribution, the no of samples taken for
calculating the mean should be greater or equal to 30.
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 47
Correcting The distortion In Normal
Distribution
Transformation is nothing but taking a mathematical function and applying it
to the data.
1. Log Transformation [Each data point is replaced with log(x)
to obtain ND]
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 49
Outliers
An outlier is an observation point
that is distant from other
observations. An outlier may be
due to variability in the
measurement or it may indicate
experimental error; the latter are
sometimes excluded from the data
set.
https://tribe.datamites.com/pos
ts/outliers
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 50
Probability Density Function
Probability density is the relationship between observations and their
probability.
CDF: It provides a shortcut for calculating many probabilities at once. We integrate the pdf function to get the
cumulative probability.
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 51
Standard Normal Distribution
The standard normal distribution is a special case of the normal distribution. It is the distribution that
occurs when a normal random variable has a mean of zero and a standard deviation of one.
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 52
Z- Score / Z-Value/ Standard Score
• z = (X - μ) / σ
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 53
Calculating Z-Score
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 54
Z-Table
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 54
Z-Table
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 54
Z-Table
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 54
Z-Table
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 54
Outliers
• Euclidean Distance
• Manhattan Distance
• Minkowski Distance
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 56
Euclidean Distance
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 57
Manhattan Distance
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 58
Minkowski Distance
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 59
Use of Distance Metrics
• Outlier Detection
• Clustering Analysis
• Feature Selection
• Data Cleaning
• Dimensionality Reduction
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 60
Covariance
• It is the relationship between a pair of random variables where change in one variable causes
change in another variable.
• It can take any value between -∞ to +∞, where the negative value represents the
negative relationship whereas a positive value represents the positive relationship.
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 61
Correlation
• It is the scaled version of Covariance.
• Correlation is a step ahead of covariance as it quantifies the relationship between two
random variables. In simple terms, it is a unit measure of how these variables change
concerning each other (normalized covariance value).
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 62
Based on the correlation coefficient, we can select the
important features:
• If the independent feature is not correlated with the target variable, we could
remove that feature.
• Or if two independent features are highly correlated, we could remove any one
feature.
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 63
Hypothesis Testing &
Other computational Techniques
Module - 4
6
© 2018 DataMite . All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS
s
Hypothesis Testing
• Hypothesis is a statement, assumption or claim about the value of the parameter (mean,
variance, median etc.).
• A hypothesis is an educated guess about something in the world around you. It should
be testable, either by experiment or observation.
Ex:-if we make a statement that “Dhoni is the best Indian Captain ever.” This is an assumption
that we are making based on the average wins and loses team had under his captaincy. We can
test this statement based on all the match data.
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 65
Comparing And Analyzing The Relationships
• Does the treatment with new drug help more patients than the standard
treatment with old drug?
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 66
Types of Hypothesis
• When a hypothesis specifices an exact value of parameter, it is simple
hypothesis. For eg., Motor cycle company claiming that a certain model
gives an average mileage of 100km per litre, this is a case of simple
hypothesis.
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 67
Null Hypothesis (H0)
• The null hypothesis is the hypothesis to be tested for possible
rejection under the assumption that it is true.
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 68
Alternate Hypothesis (HA)/(H1)
• The alternative hypothesis complements the Null hypothesis.
• It is opposite of the null hypothesis such that both Alternate and null
hypothesis together cover all the possible values of the population
parameter.
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 69
Hypothesis Testing – Case Discussion
• Consider a court of law; the null hypothesis is that the defendant is innocent
• When we collect evidence and try to reject null hypothesis, there are 2 errors
that could potentially occur: Type 1 and Type 2 errors.
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 70
Type I and Type II Error
TYPE -1 ERROR - WHEN THE NULL HYPOTHESIS IS TURE BUT , WE REJECT THE NULL HYPOTHESIS
TYPE – 2 ERROR - WHEN THE NULL HYPOTHESIS IS FALSE BUT WE FAILED TO REJECT THAT
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 71
Type I and Type II Error
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 71
Critical Region
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 72
Three Cases Of Critical Region Arise
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 74
Process of Hypothesis Testing
1. Set up Null hypothesis and
Alternate hypothesis.
2. Decide the level of
significance.(1% or 5%)
3. Select the test as per
requirement.
4. Calculate the p-
value.
5. If p-value less than level of
significance, reject the
null hypothesis.
6. If p-value more than level
of
significance, accept the null
hypothesis.
© 2018 DataMit e . All Rights Reserved | www.datamites.com
s
STATISTICS ESSENTIALS 75
Applications of Hypothesis Testing
• Feature Selection
• Model Evaluation & Selection
• Hyper Parameter Tuning
• Anamoly Detection
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 76
Types of Hypothesis Tests
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 77
Hypothesis Tests:-Parametric
1. Z test
2. T/Student’s T test
3. Paired t Test
4. One Way ANOVA
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 78
Hypothesis Tests:-Non Parametric
1. Chi Square Test
2. Mann-Whitney Test
3. Wilcoxon Signed-Rank Test
4. Kruskal-Wallis Test
5. Friedman’s ANOVA
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 79
T-test
• A t-test is an analysis of two populations means through the use of
statistical examination; a t-test with two samples is commonly used
with small sample sizes, testing the difference between the samples
when the variances of two normal distributions are not known.
• This helps in finding the association between Categorical and
Continuous features.
X -> Mean of sample set
μ -> Mean of Population
S -> Standard deviation of sample
N -> Sample size
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 80
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 81
The T-Test helps to determine whether there is a
statistically significant difference in the average
salaries between these two groups.
H0: There is no significant difference between two
groups.
Female Group:
Mean=10,70,000 SD=3,95,995.81
Male Group:
Mean=9,02,000 SD=7,52,807.45
T-Statistics = 4.24
P-Value = 0.001 (From T-Table)
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 82
Paired T-test
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 83
One-Way ANOVA
The one-way analysis of variance (ANOVA) is used to determine whether there are
any statistically significant differences between the means of three or more
independent (unrelated) groups. Eg.,
Within Group Variance - Values within the group are close to each other
Between Group Variance - Values between groups are not close to each other
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 84
• If the within-group variation is less and
between-group variation is high, then it
means this feature impacts the target
variable. Hence it is an important feature.
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 85
Chi-Square Test
Chi-squaíe test is used foí categoíical featuíes in a dataset. We
calculate Chi-squaíe between each featuíe and the taíget and select
the desiíed numbeí of featuíes with best Chi-squaíe scoíes. It
deteímines if the association between two categoíical vaíiables of the
sample would íeflect theií íeal association in the population. Chi- squaíe
scoíe is given by
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 86
H0: There is no association between two categorical features
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com 87
Contingency
Table:
Observed
Frequency
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 88
Python
Implementation
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 89
Thank You.
9
© 2018 DataMite . All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS
s