0% found this document useful (0 votes)

99 views103 pages

Statistics For Data Science by Mihir Patnaik

This PPT helps to get dive deep in to statistics for data science

Uploaded by

Mihir Patnaik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

99 views103 pages

Statistics For Data Science by Mihir Patnaik

This PPT helps to get dive deep in to statistics for data science

Uploaded by

Mihir Patnaik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 103

STATISTICS ESSENTIAL

STATI

Accredited by IABA C

Vantharr. All Rights Reserved | www.vantharr.com STATISTICS ESSENTIALS 1

Overview of Statistics
Module - 1

Vantharr. All Rights Reserved | www.vantharr.com STATISTICS ESSENTIALS 2

Statistics
The science of collecting, describing, and interpreting data is
popularly known as Statistical leveraging

Two areas of Statistics:

Descriptive statistics – Methods of organizing, summarizing,
and presenting data in an informative way
Inferential statistics – The methods used to determine
something about a population on the basis of a sample

Vantharr. All Rights

s
e Reserved | www.vantharr.com STATISTICS ESSENTIALS 3
Descriptive Statistics
Descriptive statistics are methods for organizing and summarizing data.
For example, tables or graphs are used to organize data, and descriptive values
such as the average score are used to summarize data.
A descriptive value for a population is called a parameter and a descriptive value
for a sample is called a statistic.
Collect data
e.g., Survey
Present data
e.g., Tables and graphs
Summarize data
e.g., Sample mean =

Vantharr. All Rights

s Reserved | www.vantharr.com STATISTICS ESSENTIALS 4
Inferential Statistics
• Inferential statistics are methods for using sample data to make general conclusions
(inferences) about populations.
• Because a sample is typically only a part of the whole population, sample data provide
only limited information about the population. As a result, sample statistics are
generally imperfect representatives of the corresponding population parameters.

• Estimation
– e.g., Estimate the population mean weight using the
sample mean weight
• Hypothesis testing
– e.g., Test the claim that the population mean weight
is 70 kg
Inference is the process o f drawing conclusions or m a k i n g decisions about a population
based o n sample results
Vantharr. All Rights
s
e Reserved | www.vantharr.com STATISTICS ESSENTIALS 5
Definition to Basic terms
Population: A collection, or set, of individuals or objects or events whose properties are to
be analyzed.

Two kinds of populations: finite or infinite.

Sample: A subset of the population.

Variable: A characteristic about each individual element of a population or sample.

Data (singular): The value of the variable associated with one element of a population or
sample. This value may be a number, a word, or a symbol.

Vantharr. All Rights

s
e Reserved | www.vantharr.com STATISTICS ESSENTIALS 6
Definition to Basic terms
Data (plural): The set of values collected for the variable from each of the elements
belonging to the sample.

Random Variable: Variable are placeholder where you can store anything.It can
number,or string,sentences.

Experiment: A planned activity whose results yield a set of data.

Parameter: A numerical value summarizing all the data of an entire population.

Statistic: A numerical value summarizing the sample data.

Vantharr. All Rights

s
e Reserved | www.vantharr.com STATISTICS ESSENTIALS 7
Examples

Vantharr. All Rights

s
e Reserved | www.vantharr.com STATISTICS ESSENTIALS 8
Types of Data

Vantharr. All Rights

s
e Reserved | www.vantharr.com STATISTICS ESSENTIALS 9
Examples of variables
Example: Identify each of the following as examples of qualitative or numerical
variables:
1. The temperature in Barrow, Alaska at 12:00 pm on any
given day.
2. The model of automobile.
3. Whether or not a 6 volt lantern battery is defective.
4. The weight of a lead pencil.
5. The length of time billed for a long distance telephone call.
6. The brand of cereal children eat for breakfast.
7. The type of book taken out of the library by an adult.

Vantharr. All Rights

s
e Reserved | www.vantharr.com STATISTICS ESSENTIALS 10
Examples of variables…
Example: Identify each of the following as examples of
(1) nominal, (2) ordinal, (3) discrete, or (4) continuous variables:

• The length of time until a pain reliever begins to work.

• The number of chocolate chips in a cookie.
• The number of colors used in a statistics textbook.
• The brand of refrigerator in a home.
• The overall satisfaction rating of a new car.
• The number of files on a computer’s hard disk.
• The pH level of the water in a swimming pool.
• The number of staples in a stapler.

Vantharr. All Rights

s
e Reserved | www.vantharr.com STATISTICS ESSENTIALS 11
Harnessing Data
Module - 2

1
Vantharr. All Rights
e Reserved | www.vantharr.com STATISTICS ESSENTIALS
s
Steps Involved In Descriptive Statistics

• Collecting the data

• Presenting the data-->Visualization using Matplotlib and Seaborn.

• Summarizing the data-->Module 3

Vantharr. All Rights

s
e Reserved | www.vantharr.com STATISTICS ESSENTIALS 13
Making sense of the data

A sample which is drawn from the population should have

same characteristics as the population.

Sampling can be:

• with replacement: a member of the population may be chosen more than once
• without replacement: a member of the population may be chosen only
once (lottery ticket)

Vantharr. All Rights

s
e Reserved | www.vantharr.com STATISTICS ESSENTIALS 14
Collecting The Data
Step1:-Define the object or aim of the experiment.
i.e Estimate the average life of electronic component

Step2:-Define the variable and population of interest.

i.e usage,power rating,battery life etc

Step3:-Defining the data collection scheme and data measuring scheme.

i.e sampling procedure,sample size,data measuring device.

Step4:-Defining the appropriate descriptive and inferential analysis

techniques
Vantharr. All Rights
s
e Reserved | www.vantharr.com STATISTICS ESSENTIALS 15
Methods Used To Collect Data
Experiment: The investigator controls or modifies the environment and observes the effect on
the variable under study.

Survey: Data are obtained by sampling some of the population of interest. The
investigator does not modify the environment.

Census: A 100% survey. Every element of the population is listed. Seldom used: difficult
and time- consuming to compile, and expensive.

Judgment Samples: It is a non-probability sampling technique in which the sample members

are chosen only on the basis of the researcher's knowledge and judgment.

Probability Samples: Samples in which the elements to be selected are drawn on the basis of
probability. Each element in a population has a certain probability of being selected as part of
the sample.
Vantharr. All Rights
s
e Reserved | www.vantharr.com STATISTICS ESSENTIALS 16
Types Of Sampling

Vantharr. All Rights

s
e Reserved | www.vantharr.com STATISTICS ESSENTIALS 17
Probability Sampling
1. Simple Random sampling :-each sample of the same size has an equal chance of being selected.

2.Stratified Sampling :-divide the population into groups called strata and then take a sample
from each stratum.

3.Cluster sampling :-divide the population into strata and then randomly select some of the
strata. All the members from these strata are in the cluster sample.

4.Systematic sampling :-randomly select a starting point and take every n-th piece of data from
a listing of the population.

5.Multistage Random :- divide the population into clusters and select some clusters at the first
stage. At each subsequent stage, you further divide up those selected clusters into smaller
clusters, and repeat the process until you get the desired sample size.

Vantharr. All Rights Reserved | www.vantharr.com STATISTICS ESSENTIALS 18

Sampling Example

Example: An employer is interested in the time it takes each employee to commute

to work each morning. A random sample of 35 employees will be selected and their
commuting time will be recorded.

There are 2712 employees.

Each employee is numbered: 0001, 0002, 0003, etc. up to 2712.

Using four-digit random numbers, a sample is identified: 1315, 0987, 1125, etc.

Vantharr. All Rights

s
e Reserved | www.vantharr.com STATISTICS ESSENTIALS 19
Sampling Error
• The discrepancy between a sample statistic and
its population parameter is called sampling
error.
• Defining and measuring sampling error is a large
part of inferential statistics

Figure 1.2
A demonstration of sampling error. Two samples are
selected from the same population. Notice that the sample
statistics are different from one sample to another, and all
of the sample statistics are different from the corresponding
population parameters. The natural differences that exist,
by chance, between a sample statistic and a population
parameter are called sampling error.

Vantharr. All Rights Reserved | www.vantharr.com STATISTICS ESSENTIALS 20

Exploratory Analysis
Module - 3

2
Vantharr. All Rights
e Reserved | www.vantharr.com STATISTICS ESSENTIALS
s
Measures of Central Tendencies

Vantharr. All Rights Reserved | www.vantharr.com STATISTICS ESSENTIALS 24

Central Tendency

The property of data being concentrated in the

centre.

Vantharr. All Rights Reserved | www.vantharr.com STATISTICS ESSENTIALS 23

Measures of Central Tendency

• Mean
• Median
• Mode

Vantharr. All Rights

s
e Reserved | www.vantharr.com STATISTICS ESSENTIALS 24
Mean

The mean is the average of all numbers and is

sometimes called the arithmetic mean.

Vantharr. All Rights Reserved | www.vantharr.com STATISTICS ESSENTIALS 25

Median

The statistical median is the middle number in a

sequence of numbers. To find the median, organize each
number in order by size; the number in the middle is
the median

Vantharr. All Rights Reserved | www.vantharr.com STATISTICS ESSENTIALS 26

Median

Vantharr. All Rights

s
e Reserved | www.vantharr.com STATISTICS ESSENTIALS 26
Mode

The mode is the number that occurs most often within a

set of numbers.

Vantharr. All Rights

s
e Reserved | www.vantharr.com STATISTICS ESSENTIALS 27
Mode

Vantharr. All Rights Reserved | www.vantharr.com STATISTICS ESSENTIALS 27

Measure of Spread / Data Variability

2
Vantharr. All Rights Reserved | www.vantharr.com STATISTICS
Range

The range is the

difference between the
highest and lowest
values within a set of
numbers.

Vantharr. All Rights

s
e Reserved | www.vantharr.com STATISTICS ESSENTIALS 29
Vantharr. All Rights
s
e Reserved | www.vantharr.com STATISTICS ESSENTIALS 29
Interquartile Range (IQR)Range

The interquartile range is the

middle half of the data.
Mathematically the
interquartile range includes the
50% of data points that fall
between Q1 and Q3.

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 39
Standard Deviation (σ)
Standard Deviation (SD) is a
measure that is used to
quantify the amount of
variation or dispersion of a set
of data values.

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 31
Variance(σ2)
Variance is the average
squared difference of the
values from the mean. Unlike
the previous measures of
variability, the variance
includes all values in the
calculation by comparing
each value to the mean.

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 33
Percentile
• A percentile (or a centile) is a measure used in statistics
indicating the value below which a given percentage of
observations in a group of observations fall.

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 35
Distributions
The graphical representation of all observations is known as distribution

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 37
Normal Distribution
Normal distribution, also known as the Gaussian distribution, is a probability distribution that
is symmetric about the mean, showing that data near the mean are more frequent in
occurrence than data far from the mean.

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 38
Properties Of Normal Distribution
1. Empirical Rule
2. Distortion in Normal
Distribution
3. Central Limit Theorem
4. Standard Normal
Distribution
5. Outliers
6. QQ plot
7. Log,Sqrt,Boxcox
transformation

• The empirical rule states that for a normal

distribution, nearly all of the data will fall within
three standard deviations of the mean. The
empirical rule can be broken down into three
parts:
• 68% of data falls within the first standard deviation from the mean. (1 Sigma)
• 95% fall within two standard deviations. (2 Sigma)
• 99.7% fall within three standard deviations. (3 Sigma)
• Any points lying after 3 sigma is outliers.

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 40
Distortion in Normal Distribution
The distortion in normally distributed curves can be
quantified in 2 ways
1. Skewness
2. Kurtosis

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 41
Skewness in Normal Distribution
Skewness is asymmetry in
a statistical distribution, in
which the curve appears
distorted or skewed either
to the left or to the right.
Skewness can be
quantified to define the
extent to which a
distribution differs from a
normal distribution
If the skewness is greater than 1 or less than -1, the data is highly
skewed.
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 42
Kurtosis in Normal Distribution

In probability theory
and statistics,
kurtosis is a measure
of the “peakedness"
of the probability
distribution of a real-
valued random
variable.

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 43
4
© 2018 DataMites . All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS
How much Skewness and Kurtosis

● If the skewness is between -0.5 and 0.5, the data are fairly symmetrical. If
the skewness is between -1 and – 0.5 or between 0.5 and 1, the data is
moderately skewed.

● If the skewness is greater than 1or less than -1, the data is highly skewed.

● A standard normal distribution has kurtosis of 3 and is recognized as

mesokurtic. An increased kurtosis (>3) can be visualized as a thin “bell”
with a high peak whereas a decreased kurtosis corresponds to a
broadening of the peak and “thickening” of the tails.

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 45
Which is Best—the Mean, Median, or Mode?

● When you have a symmetrical distribution for continuous data, the mean,
median, and mode are equal. In this case, analysts tend to use the mean
because it includes all of the data in the calculations. However, if you have
a skewed distribution, the median is often the best measure of central
tendency.

● When you have categorical or discrete data, the median or mode is usually
the best choice. For categorical data, you have to use the mode.

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 46
Central Limit Theorem
The central limit theorem states that the distribution of sample means approximates a normal
distribution as the sample size gets larger (assuming that all samples are identical in size),
regardless of population distribution shape.

CLT in one sentence "Even if I'm not normal, the average is normal"

When collecting means of the samples from any distribution, the no of samples taken for
calculating the mean should be greater or equal to 30.

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 47
Correcting The distortion In Normal
Distribution
Transformation is nothing but taking a mathematical function and applying it
to the data.
1. Log Transformation [Each data point is replaced with log(x)
to obtain ND]

2. Square-Root Transformation [Each data point is replaced by its

square root]

3. Reciprocal Transformation [ It takes the inverse of x ie., 1/x]

4. Box-Cox Transformation [Transformation of non-normal

dependent variables to normal shape]

Reason:. AllTo transform the data to either reduce the skewness or to

© 2018 DataMit e
s
STATISTICS ESSENTIALS
Rights Reserved | www.datamites.com 48
normalize the data or simply make the data easier to understand.
Q-Q Plots
Q-Q plots are used to find the type of distribution for a random variable
whether it be a Gaussian Distribution / Normal distribution or not.

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 49
Outliers
An outlier is an observation point
that is distant from other
observations. An outlier may be
due to variability in the
measurement or it may indicate
experimental error; the latter are
sometimes excluded from the data
set.

https://tribe.datamites.com/pos
ts/outliers

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 50
Probability Density Function
Probability density is the relationship between observations and their
probability.

The overall shape of the probability density is referred to as a probability

distribution, and the calculation of probabilities for specific outcomes of a
random variable is performed by a probability density function, or PDF for short.
The probability density function for Normal distribution is given as

CDF: It provides a shortcut for calculating many probabilities at once. We integrate the pdf function to get the
cumulative probability.

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 51
Standard Normal Distribution
The standard normal distribution is a special case of the normal distribution. It is the distribution that
occurs when a normal random variable has a mean of zero and a standard deviation of one.

• A z-score (aka, a standard score) indicates how many

standard deviations an element is above or below from the
mean. A z-score can be calculated from the following formula.

• z = (X - μ) / σ

• - Z-scores generally range from -3.0 to

+3.0.
• - For bell shaped distributions, the
empirical rule says 99.7% of all the data values have z-
scores between -3.0 and +3.0.

• - We consider any z-score that is

• Euclidean Distance
• Manhattan Distance
• Minkowski Distance

Statistical distances are used for several important reasons:

• Outlier Detection
• Clustering Analysis
• Feature Selection
• Data Cleaning
• Dimensionality Reduction

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 60
Covariance
• It is the relationship between a pair of random variables where change in one variable causes
change in another variable.
• It can take any value between -∞ to +∞, where the negative value represents the
negative relationship whereas a positive value represents the positive relationship.

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 61
Correlation
• It is the scaled version of Covariance.
• Correlation is a step ahead of covariance as it quantifies the relationship between two
random variables. In simple terms, it is a unit measure of how these variables change
concerning each other (normalized covariance value).

• If the independent feature is not correlated with the target variable, we could
remove that feature.
• Or if two independent features are highly correlated, we could remove any one
feature.

6
© 2018 DataMite . All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS
s
Hypothesis Testing
• Hypothesis is a statement, assumption or claim about the value of the parameter (mean,
variance, median etc.).
• A hypothesis is an educated guess about something in the world around you. It should
be testable, either by experiment or observation.

Ex:-if we make a statement that “Dhoni is the best Indian Captain ever.” This is an assumption
that we are making based on the average wins and loses team had under his captaincy. We can
test this statement based on all the match data.

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 65
Comparing And Analyzing The Relationships
• Does the treatment with new drug help more patients than the standard
treatment with old drug?

• Which of these four methods is the most efficient way of teaching

machine learning?

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 66
Types of Hypothesis
• When a hypothesis specifices an exact value of parameter, it is simple
hypothesis. For eg., Motor cycle company claiming that a certain model
gives an average mileage of 100km per litre, this is a case of simple
hypothesis.

• If a hypothesis specifies a range of values then it is called a composite

hypothesis. For eg., Average age of students in a class is greater than 20.
This statement is a composite hypothesis.

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 67
Null Hypothesis (H0)
• The null hypothesis is the hypothesis to be tested for possible
rejection under the assumption that it is true.

• The concept of the null is similar to innocent until proven

guilty.

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 68
Alternate Hypothesis (HA)/(H1)
• The alternative hypothesis complements the Null hypothesis.
• It is opposite of the null hypothesis such that both Alternate and null
hypothesis together cover all the possible values of the population
parameter.

• Consider a court of law; the null hypothesis is that the defendant is innocent

• We require evidence to reject the null hypothesis (convict)

• When we collect evidence and try to reject null hypothesis, there are 2 errors
that could potentially occur: Type 1 and Type 2 errors.

H0 – THE DEFENDANT IS INNOCENT

TYPE -1 ERROR - WHEN THE NULL HYPOTHESIS IS TURE BUT , WE REJECT THE NULL HYPOTHESIS
TYPE – 2 ERROR - WHEN THE NULL HYPOTHESIS IS FALSE BUT WE FAILED TO REJECT THAT

Ha –p THE DEFENDANT IS NOT INNOCENT

• The critical region is that region in the sample

space in which if the calculated value lies then we
reject the null hypothesis.
• The critical region lies in one tail or two tails on
the probability distribution curve according to the
alternative hypothesis.
• The value of critical region is denoted by α.
• It is known as level of significance. i.e what is
passing criteria of test.

• If the alternate hypothesis gives the alternate in both

directions (less than and greater than) of the value of
the parameter specified in null hypothesis, it is called
Two tailed test.
• Here according to H1, mean can be greater than or
less than 100. This is an example of Two tailed test.
• e.g. if H0: mean= 100 H1: mean not
equal to 100
• If the alternate hypothesis gives the alternate in only
one direction (either less than or greater than) of the
value of the parameter specified in null hypothesis, it
is called One tailed test.
• Similarly, if H0: mean>=100 then H1: mean<
100
• Here, mean is less than 100, it is called One tailed
test.
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 73
p –Value
• Since we know that passing value of a
test can be 1% or 5%. But we must also
know what are the test score and this
test score is known as p- value.
• Technically p-value (probability value) is
the smallest level of significance ath
whic a null hypothesis can be rejected.
• If p-value is greater than alpha, we do
not reject the null hypothesis.
• If p-value is smaller than alpha, we reject
the null hypothesis.

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 74
Process of Hypothesis Testing
1. Set up Null hypothesis and
Alternate hypothesis.
2. Decide the level of
significance.(1% or 5%)
3. Select the test as per
requirement.
4. Calculate the p-
value.
5. If p-value less than level of
significance, reject the
null hypothesis.
6. If p-value more than level
of
significance, accept the null
hypothesis.
© 2018 DataMit e . All Rights Reserved | www.datamites.com
s
STATISTICS ESSENTIALS 75
Applications of Hypothesis Testing

• Feature Selection
• Model Evaluation & Selection
• Hyper Parameter Tuning
• Anamoly Detection

• Parametric Tests:-Those test which considers the

shape of distribution of sample.
• Non – Parametric Tests:-Those test which do not
considers the shape of distribution of sample.

1. Z test
2. T/Student’s T test
3. Paired t Test
4. One Way ANOVA

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 78
Hypothesis Tests:-Non Parametric
1. Chi Square Test
2. Mann-Whitney Test
3. Wilcoxon Signed-Rank Test
4. Kruskal-Wallis Test
5. Friedman’s ANOVA

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 79
T-test
• A t-test is an analysis of two populations means through the use of
statistical examination; a t-test with two samples is commonly used
with small sample sizes, testing the difference between the samples
when the variances of two normal distributions are not known.
• This helps in finding the association between Categorical and
Continuous features.
X -> Mean of sample set
μ -> Mean of Population
S -> Standard deviation of sample
N -> Sample size

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 80
© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 81
The T-Test helps to determine whether there is a
statistically significant difference in the average
salaries between these two groups.
H0: There is no significant difference between two
groups.
Female Group:
Mean=10,70,000 SD=3,95,995.81
Male Group:
Mean=9,02,000 SD=7,52,807.45

T-Statistics = 4.24
P-Value = 0.001 (From T-Table)

There is a significant difference in salaries

between the two groups.

ľ h e paiíed t-test is peífoímed when the samples

typically consist of matched paiís of similaí units, oí
when theíe aíe cases of íepeated measuíes.

Foí example, theíe may be instances of the same

patients being tested íepeatedly—befoíe and afteí
íeceiving a paíticulaí tíeatment. In such cases, each
patient is being used as a contíol sample against
themselves.

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 83
One-Way ANOVA
The one-way analysis of variance (ANOVA) is used to determine whether there are
any statistically significant differences between the means of three or more
independent (unrelated) groups. Eg.,

Within Group Variance - Values within the group are close to each other
Between Group Variance - Values between groups are not close to each other

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 84
• If the within-group variation is less and
between-group variation is high, then it
means this feature impacts the target
variable. Hence it is an important feature.

© 2018 DataMit e
s
. All Rights Reserved | www.datamites.com STATISTICS ESSENTIALS 85
Chi-Square Test
Chi-squaíe test is used foí categoíical featuíes in a dataset. We
calculate Chi-squaíe between each featuíe and the taíget and select
the desiíed numbeí of featuíes with best Chi-squaíe scoíes. It
deteímines if the association between two categoíical vaíiables of the
sample would íeflect theií íeal association in the population. Chi- squaíe
scoíe is given by

Observed frequency = No. of observations of class

Expected frequency = No. of expected observations of class if there was no relationship between the
feature and the target. Expected Frequency = (Row Total * Column Total)/N

Observed
Frequency

Expected Frequency = (row total * column total) /

Grand Total

Modelling in R
No ratings yet
Modelling in R
47 pages
SAS Presentation
No ratings yet
SAS Presentation
49 pages
Gaussian Noise Detection & Estimation
No ratings yet
Gaussian Noise Detection & Estimation
55 pages
Session 1 (The Nature of Probability and Statistics) PDF
No ratings yet
Session 1 (The Nature of Probability and Statistics) PDF
173 pages
4-Data Preprocessing (Cleaning) and Exploration
No ratings yet
4-Data Preprocessing (Cleaning) and Exploration
54 pages
R-Tutorial - Introduction
No ratings yet
R-Tutorial - Introduction
30 pages
STA112 - Lecture - 1 - Content - Probability 1
No ratings yet
STA112 - Lecture - 1 - Content - Probability 1
42 pages
R Programming UNIT-1
No ratings yet
R Programming UNIT-1
48 pages
Datascience
No ratings yet
Datascience
13 pages
Statistics
No ratings yet
Statistics
41 pages
Free Data Science Courses & Certs
No ratings yet
Free Data Science Courses & Certs
2 pages
365 Data Science R Course Notes
No ratings yet
365 Data Science R Course Notes
20 pages
Introduction To Data Mining: Dr. Dipti Chauhan Assistant Professor SCSIT, SUAS Indore
No ratings yet
Introduction To Data Mining: Dr. Dipti Chauhan Assistant Professor SCSIT, SUAS Indore
16 pages
Assignment I Data Analytics
No ratings yet
Assignment I Data Analytics
3 pages
R Lnaguager
No ratings yet
R Lnaguager
38 pages
Data Analysis
No ratings yet
Data Analysis
30 pages
Bca 3 Sem Statistical Methods 2019
No ratings yet
Bca 3 Sem Statistical Methods 2019
3 pages
AKTU MBA 1st Semester
No ratings yet
AKTU MBA 1st Semester
14 pages
23.0 Logistic Regression-6
No ratings yet
23.0 Logistic Regression-6
24 pages
Probability & Statistics
No ratings yet
Probability & Statistics
351 pages
Logistic Regression in R
No ratings yet
Logistic Regression in R
19 pages
Chapter 8 B - Trendlines and Regression Analysis
No ratings yet
Chapter 8 B - Trendlines and Regression Analysis
73 pages
Lecture Notes - Random Forests PDF
100% (1)
Lecture Notes - Random Forests PDF
4 pages
Gaussian Mixture Models Unit-III
No ratings yet
Gaussian Mixture Models Unit-III
13 pages
RSTUDIO
No ratings yet
RSTUDIO
44 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
34 pages
Multivariate Linear Regression Guide
No ratings yet
Multivariate Linear Regression Guide
24 pages
RapidMiner Setup & Data Handling Guide
No ratings yet
RapidMiner Setup & Data Handling Guide
38 pages
Machine Learning Guide Line
No ratings yet
Machine Learning Guide Line
10 pages
Data Science - Unit-4
No ratings yet
Data Science - Unit-4
30 pages
Ss Project With Python
No ratings yet
Ss Project With Python
9 pages
Lecture 9 PDF
100% (1)
Lecture 9 PDF
28 pages
Chapter 08
No ratings yet
Chapter 08
41 pages
Unit 1 DS BCA NOTES
No ratings yet
Unit 1 DS BCA NOTES
7 pages
Probability Statistics
No ratings yet
Probability Statistics
125 pages
Statistic Interview Questions and Answers by Jeevan Raj
No ratings yet
Statistic Interview Questions and Answers by Jeevan Raj
21 pages
Data Science Lab
No ratings yet
Data Science Lab
28 pages
Data Science Workshop
No ratings yet
Data Science Workshop
6 pages
04 Chap04 ClassificationMethods LDA QDA
No ratings yet
04 Chap04 ClassificationMethods LDA QDA
28 pages
B.SC Statistics
No ratings yet
B.SC Statistics
16 pages
Iot Systems - Logical Design Using Python: Bahga & Madisetti, © 2015
No ratings yet
Iot Systems - Logical Design Using Python: Bahga & Madisetti, © 2015
31 pages
Programming For Data Science
100% (1)
Programming For Data Science
4 pages
K Means R and Rapid Miner Patient and Mall Case Study
No ratings yet
K Means R and Rapid Miner Patient and Mall Case Study
80 pages
Linear Regression with Scikit-Learn
No ratings yet
Linear Regression with Scikit-Learn
8 pages
Class Notes Unit 2 ML Material
No ratings yet
Class Notes Unit 2 ML Material
31 pages
Outlier Detection Techniques
No ratings yet
Outlier Detection Techniques
55 pages
MATPLOTLIB Updated
No ratings yet
MATPLOTLIB Updated
95 pages
Excel Data Analysis Resources
No ratings yet
Excel Data Analysis Resources
1 page
Machine Learning Course Notes
No ratings yet
Machine Learning Course Notes
112 pages
7 Types of Neural Network Activation Functions
No ratings yet
7 Types of Neural Network Activation Functions
16 pages
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
No ratings yet
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
79 pages
Predictive Modeling Project Report
100% (2)
Predictive Modeling Project Report
31 pages
Parameters: Unless Otherwise Noted, These Formulas Assume
No ratings yet
Parameters: Unless Otherwise Noted, These Formulas Assume
6 pages
Facets of Data
No ratings yet
Facets of Data
6 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Ma5160 Applied Probability and Statistics 1 PDF
50% (2)
Ma5160 Applied Probability and Statistics 1 PDF
4 pages
DATA SUMMARIZATION - Print
No ratings yet
DATA SUMMARIZATION - Print
28 pages
Amt305 Introduction To Machine Learning, Pyq
No ratings yet
Amt305 Introduction To Machine Learning, Pyq
5 pages
Module 6 Data Visualiztion Matplotlib
No ratings yet
Module 6 Data Visualiztion Matplotlib
69 pages
Statistics Essential
No ratings yet
Statistics Essential
87 pages
Finance Econometrics: Regression Models
No ratings yet
Finance Econometrics: Regression Models
29 pages
Assignment 1 Stat2 Post
No ratings yet
Assignment 1 Stat2 Post
3 pages
Statistics Exam Practice Questions
No ratings yet
Statistics Exam Practice Questions
11 pages
8614 Assignment 1
No ratings yet
8614 Assignment 1
17 pages
Module 6 A Inferential Statistics Non Parametric
No ratings yet
Module 6 A Inferential Statistics Non Parametric
88 pages
Pretest ch10
No ratings yet
Pretest ch10
7 pages
Sampling Distributions:: N X X X X
No ratings yet
Sampling Distributions:: N X X X X
3 pages
12 Housing Prices
No ratings yet
12 Housing Prices
12 pages
This Study Resource Was: Problem # 1
No ratings yet
This Study Resource Was: Problem # 1
7 pages
2 Forecasting
No ratings yet
2 Forecasting
75 pages
Multiplexer and De-Multiplexer
No ratings yet
Multiplexer and De-Multiplexer
3 pages
Sas Procs
No ratings yet
Sas Procs
8 pages
Supervised Learning Essentials
No ratings yet
Supervised Learning Essentials
30 pages
Final Assignment Business Analytics
No ratings yet
Final Assignment Business Analytics
10 pages
Reliability Enginnering: Presented by
100% (2)
Reliability Enginnering: Presented by
15 pages
JQT1997
No ratings yet
JQT1997
3 pages
Define Mean Square Error
No ratings yet
Define Mean Square Error
3 pages
Financial Time Series Analysis
No ratings yet
Financial Time Series Analysis
11 pages
Statpro Lesson 6th Week
No ratings yet
Statpro Lesson 6th Week
27 pages
PDF Trial STPM Mathematics M 2 Selangor SMK Seafieldsubang Compress
No ratings yet
PDF Trial STPM Mathematics M 2 Selangor SMK Seafieldsubang Compress
5 pages
AS MCQ New
100% (2)
AS MCQ New
13 pages
Chapter 3 - Testbank
50% (4)
Chapter 3 - Testbank
62 pages
Data Cepietso-Pcp (7-11 May 2012)
No ratings yet
Data Cepietso-Pcp (7-11 May 2012)
17 pages
Analysis of Contingency Tables
No ratings yet
Analysis of Contingency Tables
34 pages
Meta Tutorial
No ratings yet
Meta Tutorial
10 pages
Judge, Piccolo, & Ilies (2004)
No ratings yet
Judge, Piccolo, & Ilies (2004)
17 pages
Discrete Weibull
No ratings yet
Discrete Weibull
17 pages
Ayush File 1
No ratings yet
Ayush File 1
37 pages
Important Questions BMB 104
No ratings yet
Important Questions BMB 104
4 pages
Causal Inference Extended Tutorial
No ratings yet
Causal Inference Extended Tutorial
189 pages