0% found this document useful (0 votes)
99 views103 pages

Statistics For Data Science by Mihir Patnaik

This PPT helps to get dive deep in to statistics for data science

Uploaded by

Mihir Patnaik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
99 views103 pages

Statistics For Data Science by Mihir Patnaik

This PPT helps to get dive deep in to statistics for data science

Uploaded by

Mihir Patnaik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 103

STATISTICS ESSENTIAL

STATI

Accredited by IABA C

Vantharr. All Rights Reserved | www.vantharr.com STATISTICS ESSENTIALS 1


Overview of Statistics
Module - 1

Vantharr. All Rights Reserved | www.vantharr.com STATISTICS ESSENTIALS 2


Statistics
The science of collecting, describing, and interpreting data is
popularly known as Statistical leveraging

Two areas of Statistics:


Descriptive statistics – Methods of organizing, summarizing,
and presenting data in an informative way
Inferential statistics – The methods used to determine
something about a population on the basis of a sample

Vantharr. All Rights


s
e Reserved | www.vantharr.com STATISTICS ESSENTIALS 3
Descriptive Statistics
Descriptive statistics are methods for organizing and summarizing data.
For example, tables or graphs are used to organize data, and descriptive values
such as the average score are used to summarize data.
A descriptive value for a population is called a parameter and a descriptive value
for a sample is called a statistic.
Collect data
e.g., Survey
Present data
e.g., Tables and graphs
Summarize data
e.g., Sample mean =

Vantharr. All Rights


s Reserved | www.vantharr.com STATISTICS ESSENTIALS 4
Inferential Statistics
• Inferential statistics are methods for using sample data to make general conclusions
(inferences) about populations.
• Because a sample is typically only a part of the whole population, sample data provide
only limited information about the population. As a result, sample statistics are
generally imperfect representatives of the corresponding population parameters.

• Estimation
– e.g., Estimate the population mean weight using the
sample mean weight
• Hypothesis testing
– e.g., Test the claim that the population mean weight
is 70 kg
Inference is the process o f drawing conclusions or m a k i n g decisions about a population
based o n sample results
Vantharr. All Rights
s
e Reserved | www.vantharr.com STATISTICS ESSENTIALS 5
Definition to Basic terms
Population: A collection, or set, of individuals or objects or events whose properties are to
be analyzed.

Two kinds of populations: finite or infinite.

Sample: A subset of the population.

Variable: A characteristic about each individual element of a population or sample.

Data (singular): The value of the variable associated with one element of a population or
sample. This value may be a number, a word, or a symbol.

Vantharr. All Rights


s
e Reserved | www.vantharr.com STATISTICS ESSENTIALS 6
Definition to Basic terms
Data (plural): The set of values collected for the variable from each of the elements
belonging to the sample.

Random Variable: Variable are placeholder where you can store anything.It can
number,or string,sentences.

Experiment: A planned activity whose results yield a set of data.

Parameter: A numerical value summarizing all the data of an entire population.

Statistic: A numerical value summarizing the sample data.

Vantharr. All Rights


s
e Reserved | www.vantharr.com STATISTICS ESSENTIALS 7
Examples

Vantharr. All Rights


s
e Reserved | www.vantharr.com STATISTICS ESSENTIALS 8
Types of Data

Vantharr. All Rights


s
e Reserved | www.vantharr.com STATISTICS ESSENTIALS 9
Examples of variables
Example: Identify each of the following as examples of qualitative or numerical
variables:
1. The temperature in Barrow, Alaska at 12:00 pm on any
given day.
2. The model of automobile.
3. Whether or not a 6 volt lantern battery is defective.
4. The weight of a lead pencil.
5. The length of time billed for a long distance telephone call.
6. The brand of cereal children eat for breakfast.
7. The type of book taken out of the library by an adult.

Vantharr. All Rights


s
e Reserved | www.vantharr.com STATISTICS ESSENTIALS 10
Examples of variables…
Example: Identify each of the following as examples of
(1) nominal, (2) ordinal, (3) discrete, or (4) continuous variables:

• The length of time until a pain reliever begins to work.


• The number of chocolate chips in a cookie.
• The number of colors used in a statistics textbook.
• The brand of refrigerator in a home.
• The overall satisfaction rating of a new car.
• The number of files on a computer’s hard disk.
• The pH level of the water in a swimming pool.
• The number of staples in a stapler.

Vantharr. All Rights


s
e Reserved | www.vantharr.com STATISTICS ESSENTIALS 11
Harnessing Data
Module - 2

1
Vantharr. All Rights
e Reserved | www.vantharr.com STATISTICS ESSENTIALS
s
Steps Involved In Descriptive Statistics

• Collecting the data

• Presenting the data-->Visualization using Matplotlib and Seaborn.

• Summarizing the data-->Module 3

Vantharr. All Rights


s
e Reserved | www.vantharr.com STATISTICS ESSENTIALS 13
Making sense of the data

A sample which is drawn from the population should have


same characteristics as the population.

Sampling can be:


• with replacement: a member of the population may be chosen more than once
• without replacement: a member of the population may be chosen only
once (lottery ticket)

Vantharr. All Rights


s
e Reserved | www.vantharr.com STATISTICS ESSENTIALS 14
Collecting The Data
Step1:-Define the object or aim of the experiment.
i.e Estimate the average life of electronic component

Step2:-Define the variable and population of interest.


i.e usage,power rating,battery life etc

Step3:-Defining the data collection scheme and data measuring scheme.


i.e sampling procedure,sample size,data measuring device.

Step4:-Defining the appropriate descriptive and inferential analysis


techniques
Vantharr. All Rights
s
e Reserved | www.vantharr.com STATISTICS ESSENTIALS 15
Methods Used To Collect Data
Experiment: The investigator controls or modifies the environment and observes the effect on
the variable under study.

Survey: Data are obtained by sampling some of the population of interest. The
investigator does not modify the environment.

Census: A 100% survey. Every element of the population is listed. Seldom used: difficult
and time- consuming to compile, and expensive.

Judgment Samples: It is a non-probability sampling technique in which the sample members


are chosen only on the basis of the researcher's knowledge and judgment.

Probability Samples: Samples in which the elements to be selected are drawn on the basis of
probability. Each element in a population has a certain probability of being selected as part of
the sample.
Vantharr. All Rights
s
e Reserved | www.vantharr.com STATISTICS ESSENTIALS 16
Types Of Sampling

Vantharr. All Rights


s
e Reserved | www.vantharr.com STATISTICS ESSENTIALS 17
Probability Sampling
1. Simple Random sampling :-each sample of the same size has an equal chance of being selected.

2.Stratified Sampling :-divide the population into groups called strata and then take a sample
from each stratum.

3.Cluster sampling :-divide the population into strata and then randomly select some of the
strata. All the members from these strata are in the cluster sample.

4.Systematic sampling :-randomly select a starting point and take every n-th piece of data from
a listing of the population.

5.Multistage Random :- divide the population into clusters and select some clusters at the first
stage. At each subsequent stage, you further divide up those selected clusters into smaller
clusters, and repeat the process until you get the desired sample size.

Vantharr. All Rights Reserved | www.vantharr.com STATISTICS ESSENTIALS 18


Sampling Example

Example: An employer is interested in the time it takes each employee to commute


to work each morning. A random sample of 35 employees will be selected and their
commuting time will be recorded.

There are 2712 employees.

Each employee is numbered: 0001, 0002, 0003, etc. up to 2712.


Using four-digit random numbers, a sample is identified: 1315, 0987, 1125, etc.

Vantharr. All Rights


s
e Reserved | www.vantharr.com STATISTICS ESSENTIALS 19
Sampling Error
• The discrepancy between a sample statistic and
its population parameter is called sampling
error.
• Defining and measuring sampling error is a large
part of inferential statistics

Figure 1.2
A demonstration of sampling error. Two samples are
selected from the same population. Notice that the sample
statistics are different from one sample to another, and all
of the sample statistics are different from the corresponding
population parameters. The natural differences that exist,
by chance, between a sample statistic and a population
parameter are called sampling error.

Vantharr. All Rights Reserved | www.vantharr.com STATISTICS ESSENTIALS 20


Exploratory Analysis
Module - 3

2
Vantharr. All Rights
e Reserved | www.vantharr.com STATISTICS ESSENTIALS
s
Measures of Central Tendencies

Vantharr. All Rights Reserved | www.vantharr.com STATISTICS ESSENTIALS 24


Central Tendency

The property of data being concentrated in the


centre.

Vantharr. All Rights Reserved | www.vantharr.com STATISTICS ESSENTIALS 23


Measures of Central Tendency

• Mean
• Median
• Mode

Vantharr. All Rights


s
e Reserved | www.vantharr.com STATISTICS ESSENTIALS 24
Mean

The mean is the average of all numbers and is


sometimes called the arithmetic mean.

Vantharr. All Rights Reserved | www.vantharr.com STATISTICS ESSENTIALS 25


Median

The statistical median is the middle number in a


sequence of numbers. To find the median, organize each
number in order by size; the number in the middle is
the median

Vantharr. All Rights Reserved | www.vantharr.com STATISTICS ESSENTIALS 26


Median

Vantharr. All Rights


s
e Reserved | www.vantharr.com STATISTICS ESSENTIALS 26
Mode

The mode is the number that occurs most often within a


set of numbers.

Vantharr. All Rights


s
e Reserved | www.vantharr.com STATISTICS ESSENTIALS 27
Mode

Vantharr. All Rights Reserved | www.vantharr.com STATISTICS ESSENTIALS 27


Measure of Spread / Data Variability

2
Vantharr. All Rights Reserved | www.vantharr.com STATISTICS
Range

The range is the


difference between the
highest and lowest
values within a set of
numbers.

Vantharr. All Rights


s
e Reserved | www.vantharr.com STATISTICS ESSENTIALS 29
Vantharr. All Rights
s
e Reserved | www.vantharr.com STATISTICS ESSENTIALS 29
Interquartile Range (IQR)Range

The interquartile range is the


middle half of the data.
Mathematically the
interquartile range includes the
50% of data points that fall
between Q1 and Q3.

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 39
Standard Deviation (σ)
Standard Deviation (SD) is a
measure that is used to
quantify the amount of
variation or dispersion of a set
of data values.

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 31
Standard Deviation (σ)

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 32
Standard Deviation (σ)

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 31
Variance(σ2)
Variance is the average
squared difference of the
values from the mean. Unlike
the previous measures of
variability, the variance
includes all values in the
calculation by comparing
each value to the mean.

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 33
Percentile
• A percentile (or a centile) is a measure used in statistics
indicating the value below which a given percentage of
observations in a group of observations fall.

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 34
Python Implementation

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 35
Distributions
The graphical representation of all observations is known as distribution

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 36
Types Of Distributions

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 37
Normal Distribution
Normal distribution, also known as the Gaussian distribution, is a probability distribution that
is symmetric about the mean, showing that data near the mean are more frequent in
occurrence than data far from the mean.

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 38
Type of Distribution

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 38
Type of Distribution

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 38
Type of Distribution

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 38
Type of Distribution

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 38
Properties Of Normal Distribution
1. Empirical Rule
2. Distortion in Normal
Distribution
3. Central Limit Theorem
4. Standard Normal
Distribution
5. Outliers
6. QQ plot
7. Log,Sqrt,Boxcox
transformation

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 39
Empirical Rule

• The empirical rule states that for a normal


distribution, nearly all of the data will fall within
three standard deviations of the mean. The
empirical rule can be broken down into three
parts:
• 68% of data falls within the first standard deviation from the mean. (1 Sigma)
• 95% fall within two standard deviations. (2 Sigma)
• 99.7% fall within three standard deviations. (3 Sigma)
• Any points lying after 3 sigma is outliers.

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 40
Distortion in Normal Distribution
The distortion in normally distributed curves can be
quantified in 2 ways
1. Skewness
2. Kurtosis

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 41
Skewness in Normal Distribution
Skewness is asymmetry in
a statistical distribution, in
which the curve appears
distorted or skewed either
to the left or to the right.
Skewness can be
quantified to define the
extent to which a
distribution differs from a
normal distribution
If the skewness is greater than 1 or less than -1, the data is highly
skewed.
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 42
Kurtosis in Normal Distribution

In probability theory
and statistics,
kurtosis is a measure
of the “peakedness"
of the probability
distribution of a real-
valued random
variable.

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 43
4
© 2018 DataMites . All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS
How much Skewness and Kurtosis

● If the skewness is between -0.5 and 0.5, the data are fairly symmetrical. If
the skewness is between -1 and – 0.5 or between 0.5 and 1, the data is
moderately skewed.

● If the skewness is greater than 1or less than -1, the data is highly skewed.

● A standard normal distribution has kurtosis of 3 and is recognized as


mesokurtic. An increased kurtosis (>3) can be visualized as a thin “bell”
with a high peak whereas a decreased kurtosis corresponds to a
broadening of the peak and “thickening” of the tails.

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 45
Which is Best—the Mean, Median, or Mode?

● When you have a symmetrical distribution for continuous data, the mean,
median, and mode are equal. In this case, analysts tend to use the mean
because it includes all of the data in the calculations. However, if you have
a skewed distribution, the median is often the best measure of central
tendency.

● When you have categorical or discrete data, the median or mode is usually
the best choice. For categorical data, you have to use the mode.

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 46
Central Limit Theorem
The central limit theorem states that the distribution of sample means approximates a normal
distribution as the sample size gets larger (assuming that all samples are identical in size),
regardless of population distribution shape.

CLT in one sentence "Even if I'm not normal, the average is normal"

When collecting means of the samples from any distribution, the no of samples taken for
calculating the mean should be greater or equal to 30.

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 47
Correcting The distortion In Normal
Distribution
Transformation is nothing but taking a mathematical function and applying it
to the data.
1. Log Transformation [Each data point is replaced with log(x)
to obtain ND]

2. Square-Root Transformation [Each data point is replaced by its


square root]

3. Reciprocal Transformation [ It takes the inverse of x ie., 1/x]

4. Box-Cox Transformation [Transformation of non-normal

dependent variables to normal shape]

Reason:. AllTo transform the data to either reduce the skewness or to


© 2018 DataMit e
s
STATISTICS ESSENTIALS
Rights Reserved | www.datamites.com 48
normalize the data or simply make the data easier to understand.
Q-Q Plots
Q-Q plots are used to find the type of distribution for a random variable
whether it be a Gaussian Distribution / Normal distribution or not.

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 49
Outliers
An outlier is an observation point
that is distant from other
observations. An outlier may be
due to variability in the
measurement or it may indicate
experimental error; the latter are
sometimes excluded from the data
set.

https://tribe.datamites.com/pos
ts/outliers

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 50
Probability Density Function
Probability density is the relationship between observations and their
probability.

The overall shape of the probability density is referred to as a probability


distribution, and the calculation of probabilities for specific outcomes of a
random variable is performed by a probability density function, or PDF for short.
The probability density function for Normal distribution is given as

CDF: It provides a shortcut for calculating many probabilities at once. We integrate the pdf function to get the
cumulative probability.

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 51
Standard Normal Distribution
The standard normal distribution is a special case of the normal distribution. It is the distribution that
occurs when a normal random variable has a mean of zero and a standard deviation of one.

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 52
Z- Score / Z-Value/ Standard Score

• A z-score (aka, a standard score) indicates how many


standard deviations an element is above or below from the
mean. A z-score can be calculated from the following formula.

• z = (X - μ) / σ

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 53
Calculating Z-Score

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 54
Z-Table

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 54
Z-Table

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 54
Z-Table

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 54
Z-Table

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 54
Outliers

• - Z-scores generally range from -3.0 to


+3.0.
• - For bell shaped distributions, the
empirical rule says 99.7% of all the data values have z-
scores between -3.0 and +3.0.

• - We consider any z-score that is


either less than -3.0 or greater than +3.0 to be an
outlier.
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 55
Measure of Distance

• Euclidean Distance
• Manhattan Distance
• Minkowski Distance

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 56
Euclidean Distance

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 57
Manhattan Distance

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 58
Minkowski Distance

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 59
Use of Distance Metrics

Statistical distances are used for several important reasons:

• Outlier Detection
• Clustering Analysis
• Feature Selection
• Data Cleaning
• Dimensionality Reduction

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 60
Covariance
• It is the relationship between a pair of random variables where change in one variable causes
change in another variable.
• It can take any value between -∞ to +∞, where the negative value represents the
negative relationship whereas a positive value represents the positive relationship.

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 61
Correlation
• It is the scaled version of Covariance.
• Correlation is a step ahead of covariance as it quantifies the relationship between two
random variables. In simple terms, it is a unit measure of how these variables change
concerning each other (normalized covariance value).

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 62
Based on the correlation coefficient, we can select the
important features:

• If the independent feature is not correlated with the target variable, we could
remove that feature.
• Or if two independent features are highly correlated, we could remove any one
feature.

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 63
Hypothesis Testing &
Other computational Techniques
Module - 4

6
© 2018 DataMite . All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS
s
Hypothesis Testing
• Hypothesis is a statement, assumption or claim about the value of the parameter (mean,
variance, median etc.).
• A hypothesis is an educated guess about something in the world around you. It should
be testable, either by experiment or observation.

Ex:-if we make a statement that “Dhoni is the best Indian Captain ever.” This is an assumption
that we are making based on the average wins and loses team had under his captaincy. We can
test this statement based on all the match data.

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 65
Comparing And Analyzing The Relationships
• Does the treatment with new drug help more patients than the standard
treatment with old drug?

• Which of these four methods is the most efficient way of teaching


machine learning?

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 66
Types of Hypothesis
• When a hypothesis specifices an exact value of parameter, it is simple
hypothesis. For eg., Motor cycle company claiming that a certain model
gives an average mileage of 100km per litre, this is a case of simple
hypothesis.

• If a hypothesis specifies a range of values then it is called a composite


hypothesis. For eg., Average age of students in a class is greater than 20.
This statement is a composite hypothesis.

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 67
Null Hypothesis (H0)
• The null hypothesis is the hypothesis to be tested for possible
rejection under the assumption that it is true.

• The concept of the null is similar to innocent until proven


guilty.

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 68
Alternate Hypothesis (HA)/(H1)
• The alternative hypothesis complements the Null hypothesis.
• It is opposite of the null hypothesis such that both Alternate and null
hypothesis together cover all the possible values of the population
parameter.

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 69
Hypothesis Testing – Case Discussion

• Consider a court of law; the null hypothesis is that the defendant is innocent

• We require evidence to reject the null hypothesis (convict)

• When we collect evidence and try to reject null hypothesis, there are 2 errors
that could potentially occur: Type 1 and Type 2 errors.

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 70
Type I and Type II Error

H0 – THE DEFENDANT IS INNOCENT

TYPE -1 ERROR - WHEN THE NULL HYPOTHESIS IS TURE BUT , WE REJECT THE NULL HYPOTHESIS
TYPE – 2 ERROR - WHEN THE NULL HYPOTHESIS IS FALSE BUT WE FAILED TO REJECT THAT

Ha –p THE DEFENDANT IS NOT INNOCENT

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 71
Type I and Type II Error

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 71
Critical Region

• The critical region is that region in the sample


space in which if the calculated value lies then we
reject the null hypothesis.
• The critical region lies in one tail or two tails on
the probability distribution curve according to the
alternative hypothesis.
• The value of critical region is denoted by α.
• It is known as level of significance. i.e what is
passing criteria of test.

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 72
Three Cases Of Critical Region Arise

• If the alternate hypothesis gives the alternate in both


directions (less than and greater than) of the value of
the parameter specified in null hypothesis, it is called
Two tailed test.
• Here according to H1, mean can be greater than or
less than 100. This is an example of Two tailed test.
• e.g. if H0: mean= 100 H1: mean not
equal to 100
• If the alternate hypothesis gives the alternate in only
one direction (either less than or greater than) of the
value of the parameter specified in null hypothesis, it
is called One tailed test.
• Similarly, if H0: mean>=100 then H1: mean<
100
• Here, mean is less than 100, it is called One tailed
test.
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 73
p –Value
• Since we know that passing value of a
test can be 1% or 5%. But we must also
know what are the test score and this
test score is known as p- value.
• Technically p-value (probability value) is
the smallest level of significance ath
whic a null hypothesis can be rejected.
• If p-value is greater than alpha, we do
not reject the null hypothesis.
• If p-value is smaller than alpha, we reject
the null hypothesis.

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 74
Process of Hypothesis Testing
1. Set up Null hypothesis and
Alternate hypothesis.
2. Decide the level of
significance.(1% or 5%)
3. Select the test as per
requirement.
4. Calculate the p-
value.
5. If p-value less than level of
significance, reject the
null hypothesis.
6. If p-value more than level
of
significance, accept the null
hypothesis.
© 2018 DataMit e . All Rights Reserved | www.datamites.com
s
STATISTICS ESSENTIALS 75
Applications of Hypothesis Testing

• Feature Selection
• Model Evaluation & Selection
• Hyper Parameter Tuning
• Anamoly Detection

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 76
Types of Hypothesis Tests

• Parametric Tests:-Those test which considers the


shape of distribution of sample.
• Non – Parametric Tests:-Those test which do not
considers the shape of distribution of sample.

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 77
Hypothesis Tests:-Parametric

1. Z test
2. T/Student’s T test
3. Paired t Test
4. One Way ANOVA

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 78
Hypothesis Tests:-Non Parametric
1. Chi Square Test
2. Mann-Whitney Test
3. Wilcoxon Signed-Rank Test
4. Kruskal-Wallis Test
5. Friedman’s ANOVA

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 79
T-test
• A t-test is an analysis of two populations means through the use of
statistical examination; a t-test with two samples is commonly used
with small sample sizes, testing the difference between the samples
when the variances of two normal distributions are not known.
• This helps in finding the association between Categorical and
Continuous features.
X -> Mean of sample set
μ -> Mean of Population
S -> Standard deviation of sample
N -> Sample size

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 80
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 81
The T-Test helps to determine whether there is a
statistically significant difference in the average
salaries between these two groups.
H0: There is no significant difference between two
groups.
Female Group:
Mean=10,70,000 SD=3,95,995.81
Male Group:
Mean=9,02,000 SD=7,52,807.45

T-Statistics = 4.24
P-Value = 0.001 (From T-Table)

There is a significant difference in salaries


between the two groups.

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 82
Paired T-test

ľ h e paiíed t-test is peífoímed when the samples


typically consist of matched paiís of similaí units, oí
when theíe aíe cases of íepeated measuíes.

Foí example, theíe may be instances of the same


patients being tested íepeatedly—befoíe and afteí
íeceiving a paíticulaí tíeatment. In such cases, each
patient is being used as a contíol sample against
themselves.

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 83
One-Way ANOVA
The one-way analysis of variance (ANOVA) is used to determine whether there are
any statistically significant differences between the means of three or more
independent (unrelated) groups. Eg.,

Within Group Variance - Values within the group are close to each other
Between Group Variance - Values between groups are not close to each other

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 84
• If the within-group variation is less and
between-group variation is high, then it
means this feature impacts the target
variable. Hence it is an important feature.

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 85
Chi-Square Test
Chi-squaíe test is used foí categoíical featuíes in a dataset. We
calculate Chi-squaíe between each featuíe and the taíget and select
the desiíed numbeí of featuíes with best Chi-squaíe scoíes. It
deteímines if the association between two categoíical vaíiables of the
sample would íeflect theií íeal association in the population. Chi- squaíe
scoíe is given by

Observed frequency = No. of observations of class


Expected frequency = No. of expected observations of class if there was no relationship between the
feature and the target. Expected Frequency = (Row Total * Column Total)/N

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 86
H0: There is no association between two categorical features

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com 87
Contingency
Table:

Observed
Frequency

Expected Frequency = (row total * column total) /


Grand Total

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 88
Python
Implementation

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 89
Thank You.

9
© 2018 DataMite . All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS
s

You might also like