Final Correction Basic Statistics Combined Chapter
Final Correction Basic Statistics Combined Chapter
Statistics
Chapter -1 Stats
Vision
At our educational institute, our mission is to provide an empowering learning
environment that encourages students to excel and reach their full potential. Our
vision is to equip students with the knowledge, skills and values necessary for them to
become responsible and engaged citizens in the global community.
Mission
Our educational institute has a vision of providing students with an
environment for learning and growth. We offer a wide range of courses,
from basic to advanced, to help students gain knowledge and skills that can
be applied in their daily lives.
Objective
Our educational institute is dedicated to providing a world-class learning
experience. We strive to foster a culture of exploration, creativity, and
collaboration in our students. Our faculty and staff are committed to
providing an environment that encourages academic excellence and
personal growth.
Dear Students,
On behalf of Learning Lab, I am delighted to welcome you to our esteemed institution! As
the Director, I am committed to providing you with the best possible education in the field
of data science, Digital Marketing, Cyber Security & many more.
At our institute, we believe in staying at the forefront of the latest advancements and
industry trends. Our experienced faculty, who are experts in the field, will guide you
through a comprehensive curriculum that covers a wide range of topics, including data
analysis, machine learning, statistical modeling, data visualization, Digital marketing and
more.
Through hands-on projects, case studies, and real-world applications, you will gain practical
experience and develop critical thinking skills that will enable you to excel in any role.
In addition to technical skills, we also emphasize soft skills such as communication,
teamwork, and problem-solving. Our collaborative learning environment encourages
discussions, teamwork, and peer-to-peer learning, providing you with a well rounded
education.
I am excited to witness your growth and success as you embark on this data science journey
with us. Get ready to unlock your potential and thrive in the data-driven world!
Best regards,
- Prashant Biradar
Founder & CEO
Agenda Basic Statistics
1 Data types- Continous, Discrete, Nominal, Ordinal, Interval, Ratio
www.learninglabb.com Page 1
Statistics
Statistics is the science of data, It involves
Collecting
Classifying
Summarizing
Analyzing and
Interpreting numerical information.
Statistics is used in several different disciplines (both scientific and non-scientific) to make decisions and draw
conclusions based on data.
Statistics is indeed the science of data that involves various processes such as collecting, classifying, summarizing,
analyzing, and interpreting numerical information. These processes are used to gain insights and make informed
decisions based on the data available. Here's a brief explanation of each step:
Collecting: This involves gathering data from various sources, such as surveys, experiments, observations, or existing
datasets. The data collected should be relevant and representative of the population or phenomenon of interest.
Classifying: Once the data is collected, it needs to be organized and classified into different categories or groups.
This step helps in better understanding the data and identifying patterns or relationships within the dataset.
www.learninglabb.com Page 1
Statistics
Summarizing: In this step, the collected data is summarized using various statistical measures such as measures of
central tendency (e.g., mean, median, mode) and measures of dispersion (e.g., range, standard deviation).
Summarizing the data provides a concise representation of its main characteristics.
Analyzing: Statistical analysis involves applying various statistical techniques and methods to explore relationships,
patterns, and trends in the data. This step may involve techniques like hypothesis testing, regression analysis,
correlation analysis, or data mining, depending on the research question or objective.
Interpreting: The final step in the statistical process is interpreting the results obtained from the analysis. This
involves drawing conclusions, making inferences, and providing meaningful insights based on the statistical findings.
Effective interpretation requires an understanding of the context and subject matter expertise.
By following these steps, statisticians can extract valuable information from data and provide evidence-based
insights to inform decision-making in various fields such as business, healthcare, social sciences, and more.
www.learninglabb.com Page 1
Books and Softwares
www.learninglabb.com Page 1
Sample Statistics Population parameters
A sample is a set of n observations actually A hypothetical set of N observations from which
obtained and a statistic is a numerical value that
describe the sample.
www.learninglabb.com Page 1
Statistics
There are two types of statistics that are often referred to when making a statistical decision or working on a statistical
problem.
Descriptive Statistics: Descriptive statistics involve using numerical and graphical methods to examine and summarize
a given data set. The main objective is to describe and understand the characteristics and patterns present within the
data. Descriptive statistics provide measures that summarize the central tendency (such as mean, median, and mode)
and variability (such as range, standard deviation, and variance) of the data. Graphical representations like histograms,
bar charts, pie charts, and scatter plots can also be used to visually depict the data. Descriptive statistics help in
providing a concise summary of the data, enabling individuals to comprehend and interpret the information for
decision-making purposes.
Inferential Statistics: Inferential statistics involve drawing conclusions or making inferences about a larger population
based on a sample of data. It allows you to generalize the findings from the sample to the entire population. Inferential
statistics use probability theory and sampling techniques to estimate population parameters, test hypotheses, make
predictions, or determine relationships between variables. Techniques such as confidence intervals and hypothesis
tests (such as z-tests or t-tests) are commonly used in inferential statistics. The goal is to make reliable and valid
inferences about the population based on the observed sample data.
www.learninglabb.com Page 1
Statistics
In summary, descriptive statistics summarize and describe the characteristics of a given data set, while inferential
statistics use sample data to draw conclusions and make inferences about a larger population. Both branches of
statistics are crucial in analyzing and interpreting data to gain insights and support decision-making processes.
www.learninglabb.com Page 1
Statistics
Inferential Statistics
The main goal of inferential statistics is to make a conclusion about a population based off of a sample of data from that
population. One of the most commonly used inferential techniques is hypothesis testing
www.learninglabb.com Page 1
Data come in many flavors
Type of data Definition Example
Ordinal Can be ranked ordered but not measured Rai ness school rankings
Interval scale Intervals are meaningful not ratios Temperature in Fahrenheit or Celsius
Experimental Analyst good control over data generation Drug efficacy in clinical trials
www.learninglabb.com Page 1
www.learninglabb.com Page 1
Data Types Preliminaries
Nominal Nominal Ordinal Interval Ratio
Almost All
Ratio Statistical Ratio of
Example: "I paid Rs 1 lakh for HP and Rs 1.25 lakh for Dell",
All Statistical
Analysis
www.learninglabb.com Page 1
Data types-Continous & Discrete
www.learninglabb.com Page 1
Group the following as either discrete or continuous data
Volume of a Speed of a
cereal box car
Population Length of a
of a town crocodile
Discrete?
Continous?
Number of
Shirts matches in a
box
Number of
Temperature
goals in a
in oven
season
www.learninglabb.com Page 1
Measures of Central Tendency
Men/Average
www.learninglabb.com Page 1
Sample Mean Formula
The sample mean formula is written as
= (20+21+23+22+21+20+23)/7
= 150/7
Where,
=21.42
www.learninglabb.com Page 1
Sample Median
Median
The median is the middle value, to calculate median, sort the given list an ascending or descending
4 5 2 6 9 3 10 11 3 8
www.learninglabb.com Page 1
Mode
The mode is the number that is repeated more often than any other
A unimodal mode is a set of data with only one mode.
The mode of data set A = {14, 15, 16, 17, 15, 18, 15, 19}, for example, is 15 because just one value repeats itself. As a
result, it's a unimodal data set.
Bimodal Mode
A bimodal mode is a set of data that has two modes. This indicates that the data values with the highest frequencies
are two.
Set A = {2,2,2,3,4,4,5,5,5} has a mode of 2 and 5, because both 2 and 5 are repeated three times in the provided set
Trimodal Mode
A trimodal mode is a set of data that has three modes. This indicates that the top three data values have the most
frequency.
Set A = {2,2,2,3,4,4,5,5,5,7,8,8,8} has a mode of 2, 5, and 8 since all three numbers are repeated thrice in the
provided set. As a result, it's a trimodal data collection.
Multimodal Mode
A multimodal mode is a set of data that contains four or more modalities.
Because all four values in the given set recur twice, the mode of data set A = 100, 80, 80, 95, 95, 100, 90, 90,100,95 is
80, 90, 95 and 100. As a result, it's a multimodal dataset.
www.learninglabb.com Page 1
www.learninglabb.com Page 1
Sampling Funnel
Population
Sampling Frame
SRS
Sample
www.learninglabb.com Page 1
Sampling Frame
A sampling frame is a list or representation of all the individuals or items in a population from which a sample is drawn.
It serves as a reference or basis for selecting the sample. The sampling frame should ideally include all members of the
population, and each member should have a known and non-zero chance of being included in the sample
A sample is a subset of individuals from a larger population. Sampling means selecting the group that you will
actually collect data from in your research. For example, if you are researching the opinions of students in your
university, you could survey a sample of 100 students
www.learninglabb.com Page 1
www.learninglabb.com Page 1
Standard deviation
Standard deviation is the degree of dispersion. In descriptive statistics, it is the scatter of the data points
relative to its mean.
It represents the way in which the values are spread across the data sample.
It is the measure of the variation of the data points from the mean.
The standard deviation of a sample is therefore the square root of its variance
Assume a number of gold coins 5 pirates own; 4, 2, 5, 8, 6.
(x1+x2+x3+x4…..+xn)
n
www.learninglabb.com Page 1
www.learninglabb.com Page 1
Variance
Variance can be defined as the:
In case the data values are identical to one another, then the variance is zero. Herein, all the non-zero variances are
positive.
A little variance indicated that the data points are close to the mean, as well as to one another.
However, in case the data points are highly spread from the mean and from each other, then it represents a high
variance.
Variance can be expressed as the average of the squared distance from every point to the mean.
Variance Formula
www.learninglabb.com Page 1
Variance Formula
The Population Variance Formula can be represented by:
www.learninglabb.com Page 1
Measures of Variability
The mean, mode, and median do a nice job in telling where the center of the data set is, but often we are interested in
more...
For example, a pharmaceutical engineer develops a new drug that regulates sugar in the blood.
Suppose she finds out that the average sugar content after taking the medication is the optimal level.
This does not mean that the drug is effective. There is a possibility that half of the patients have dangerously low sugar
content while the other half has dangerously high content. Instead of the drug being an effective regulator, it is a deadly
poison.
What the pharmacist needs is a measure of how far the data is spread apart. This is what the variance and standard
deviation do
www.learninglabb.com Page 1
Measures of Dispersion
Dispersion Population Sample
Variance
Standard Deviation
www.learninglabb.com Page 1
We define Variance to be
Standard deviation to be
www.learninglabb.com Page 1
Computing step by step
Variance and Standard Deviation: Step by Step
www.learninglabb.com Page 1
Measures of Variability
Deviation is the distance from the mean.
Deviation score = observation - true mean.
Variance = mean or average of squared deviation scores.
Standard Deviation = Square root of variance.
In class exercise
Find Std and Var
44,50,38,96,42,47,40,39,46,50
www.learninglabb.com Page 1
Range
The "Range" for a data set is the difference between the largest value and smallest value contained in the data set. First
reorder the data set from smallest to largest then subtract the first element from the last element
Data Set = 2, 5, 9, 3, 5, 4, 7
Reordered = 2, 3, 4, 5, 5, 7, 9
Range=(9-2)=7
www.learninglabb.com Page 1
Basic
Statistics
Chapter -2 Visualization and Stats
Visualization Techniques
GMAT Scores of an MBA Class
www.learninglabb.com Page 1
Pictorial summary of data:
A bar chart
25
20
15
10
0
100 200 300 400 500
www.learninglabb.com Page 1
Graphical Techniques - Box plot
Box Plot : This graph shows the distribution of data
Range(IQR): The
by dividing the data into four groups with the same
middle half of a data
number of data points in each group. The box
set falls within the
contains the middle 50% of the data points and each
inter-quartile range
of the two whiskers contain 25% of the data points.
Inter-quartile
It displays two common measures of the variability
or spread in a data set
www.learninglabb.com Page 1
Graphical Techniques - Box plot
Minimum: The minimum value in the given dataset
First Quartile (Q1): The first quartile is the median of the lower half of
the data set.
Median: The median is the middle value of the dataset, which divides
the given dataset into two equal parts. The median is considered as
the second quartile.
Third Quartile (Q3): The third quartile is the median of the upper half
of the data.
Maximum: The maximum value in the given dataset.
Apart from these five terms, the other terms used in the box plot are:
Interquartile Range (IQR): The difference between the third quartile
and first quartile is known as the interquartile range. (i.e.) IQR = Q3-Q1
Outlier: The data that falls on the far left or right side of the ordered
data is tested to be the outliers. Generally, the outliers fall more than
the specified distance from the first and third quartile.
(i.e.) Outliers are greater than Q3+(1.5 . IQR) or less than Q1-(1.5 . IQR).
www.learninglabb.com Page 1
Graphical Techniques - Box plot
www.learninglabb.com Page 1
Graphical techniques - Histogram
A histogram represents the frequency distribution, i.e., how many observations take the value within a certain interval.
25
20
15
10
0
600 650 700 750
www.learninglabb.com Page 1
Skewness & Kurtosis
Third and Fourth moments
Skewness Kurtosis
A measure of asymmetry in the distribution A measure of the "Peakedness" of the distribution
Mathematically it is given by E[(x-µ/o)]3 Mathematically it is given by E[(x-µ/o)]° -3
Negative skewness implies mass of the distribution For Symmetric distributions, negative kurtosis
is concentrated on the implies wider peak and thinner tails
www.learninglabb.com Page 1
www.learninglabb.com Page 1
Basic
Statistics
Chapter -3 Normal distribution
Random Variable
A random variable is a rule that assigns a numerical value to each outcome in a sample space.
A discrete random variable can take only a finite number of distinct values such as 0, 1, 2, 3, 4, … and so on.
A numerically valued variable is said to be continuous if, in any unit of measurement, whenever it can take on
the values between a and b
Ex: time it takes to complete a race or the length of time between arrivals at a hospital clinic.
www.learninglabb.com Page 1
Z-Scores
z-score makes use of the mean and the standard deviation of the data set in order to specify the relative location
of a measurement.
It represents the distance between a given data point and the mean, expressed in standard deviations. The score
is also known as "standardizing" the data point
Large z-scores tell us that the measurement is larger than almost all other measurements in the data set.
Similarly, a small z-score tells us that the measurement is small than all other measurements.
If a score is 0, then the observation lies on the mean.
www.learninglabb.com Page 1
www.learninglabb.com Page 1
Probability Distributions
Probability is simply how likely something is to happen. Whenever we're unsure about the outcome of an event,
we can talk about the probabilities of certain outcomes—how likely they are. The analysis of events governed by
probability is called statistics.
A probability distribution is a statistical function that describes the likelihood of obtaining all possible values that
a random variable can take. In other words, the values of the variable vary based on the underlying probability
distribution. Typically, analysts display probability distributions in graphs and tables. There are equations to
calculate probability distributions.
Suppose you draw a random sample and measure the heights of the subjects. As you measure heights, you
create a distribution of heights. This type of distribution is useful when you need to know which outcomes are
most likely, the spread of potential values, and the likelihood of different results.
www.learninglabb.com Page 1
Probability Distributions
A discrete distribution has a range of values that are countable. For example, the numbers on birthday
cards have a possible range from 0 to 122 (122 is the age of Jeanne Calment the oldest person who ever
lived).
Discrete vs. Continuous Probability Distributions
A discrete probability distribution is made up of discrete variables
Probability properties:
The sum of all probabilities in a distribution sum to 1.
Each value has a probability between 0 and 1.
www.learninglabb.com Page 1
Discrete probability Distribution:
Binomial Distribution
A binomial experiment is a statistical experiment that has the following properties:
The experiment consists of n repeated trials.
Each trial can result in just two possible outcomes. We call one of these outcomes a success and the other, a
failure ; Yes or No
The probability of success, denoted by P, is the same on every trial.
The trials are independent; that is, the outcome on one trial does not affect the outcome on other trials.
www.learninglabb.com Page 1
Discrete probability Distribution:
Binomial Distribution
Example:
Consider the following statistical expermient.
You flip a coin 2 times and count the number of times the coin lands on heads.
This is binomial experiment because
www.learninglabb.com Page 1
In class exercise: Probability
Calculation for discrete P.D
The daily sales of large flat panel TVs at a store (X)
www.learninglabb.com Page 1
Continuous Probability
Distributions
What is a Continuous Probability Distribution?
Probability distributions are either continuous probability distributions or discrete probability distributions. A
continuous distribution has a range of values that are infinite, and therefore uncountable. For example, time is
infinite: you could count from 0 seconds to a billion seconds…a trillion seconds…and so on, forever.
A continuous probability distribution is made up of continuous variables.
www.learninglabb.com Page 1
Normal Distribution
The normal distribution is the proper term for a probability bell curve. In a normal distribution the mean is zero
and the standard deviation is 1. It has zero skew and a kurtosis of 3.
The normal distribution, also known as the Gaussian distribution or bell curve, is a probability distribution that is
widely used in statistics and probability theory. It is characterized by its symmetric bell-shaped curve, with the
mean, median, and mode all being equal and located at the center of the distribution.
www.learninglabb.com Page 1
Normal Distribution
Data can be "distributed" (spread out) in different ways. But there are many cases where the data tends to be
around a central value with no bias left or right, and it
gets close to a "Normal Distribution" like this.
www.learninglabb.com Page 1
Normal Distribution
www.learninglabb.com Page 1
Probability Calculation for
Continuous Distributions
The probability associated with any single value of the random variable is always zero
Probability of values being in a range = Area under the pdf curve in that range
Area under the entire curve always equals 1
www.learninglabb.com Page 1
Probability calculation for
Normal Distribution
Consider normally distributed random variable X—N(mu,sigma^2)
To compute probability P(X<=x)
www.learninglabb.com Page 1
Calculating Probabilities :
Normal distribution
Ex: find the probability that a normally distributed random variable has a mean of 60 and a standard deviation of 10
and we want to find the probability that X is less than 70.
www.learninglabb.com Page 1
www.learninglabb.com Page 1
In class exercise
Suppose GMAT scores can be reasonably modeled using a normal distribution — p 711 and o =29.
What is P(Xs 680)?
Suppose GMAT scores are distributed normally with IA = 711 and a = 29.
What is P(X.c. 680)?
www.learninglabb.com Page 1
Exercise
What is P(697≤ X≤740)?
www.learninglabb.com Page 1
Stock Price
To understand normal distribution and its application, Let us use daily returns of stocks traded in BSE (Bombay
Stock Exchange). Imagine a scenario where an investor wants to understand the risks and returns associated with
various stocks before investing in them.
For this analysis, we will evaluate two stocks: BENI and GLAXO. The daily trading data (open and close price) for each
stock is taken for the period starting from 2010 to 2016 from BSE site (www.bseindia.com)
www.learninglabb.com Page 1
Data
www.learninglabb.com Page 1
What questions can be
answered?
1.What is the expected daily rate of return of these stocks?
2. Which stocks have higher risk or volatility as far as daily returns are concerned?
3. Which stock has higher probability of making a daily return of 2% or more?
4. Which stock has higher probability of making a loss (risk) of 2% or more?
To answer the above questions, we must find out the behavior of daily returns (we will refer to this as gain hence
forward) on these stocks. The gain can be calculated as a percentage change in close price, from the previous day's
close price.
The method pct _change() in Pandas will give the percentage change in a column value shifted by a period, which is
passed as a parameter to periods.
www.learninglabb.com Page 1
www.learninglabb.com Page 1
Mean and Variance
Glaxo : Mean: 0.0004
Standard Deviation: 0.0134
www.learninglabb.com Page 1
The expected daily rate of return (gain) is around 0%
for both stocks.
www.learninglabb.com Page 1
To calculate the probability of gain higher than 2% or more, we need to find out what is the sum of all probabilities
that gain can take values more than 0.02 (i.e., 2%).
www.learninglabb.com Page 1
Normal Distribution codes
www.learninglabb.com Page 1
Student t- Distribution codes
www.learninglabb.com Page 1
Basic
Statistics
Chapter -4 CLT & CI
Inferential Statistics
GMAT Scores of an MBA Class
www.learninglabb.com Page 1
Sampling-How can it go wrong
Collecting data only from volunteers (voluntary response sample)
- e.g. online reviews (yelp.com, maps.google.com, tripadvisor.com)
Picking easily available respondents (convenience sample)
- e.g. choosing to survey In In-Orbit mall
A high rate of non-response (more than 7096)
- e.g. CEO / CIO surveys on some Industry trends
Sampling variation
Sample mean varies from one sample to another Sample
mean on be (and most likely is) different from the population
mean Sample mean is a random variable
www.learninglabb.com Page 1
Central Limit Theorem
The distribution of the sample mean
- will be normal when the distribution of data In the population is normal
- will be approximately normal even if the distribution of data In the population Is not normal, If the sample size is
"fairly large"
Mean (X)= µ (the same as the population mean of the raw data)
σ
Standard deviation (X)= √n , where a is the population standard deviation and n is the sample size
www.learninglabb.com Page 1
Central Limit Theorem
The distribution of the data itself
(e.g. distribution of the individual
work experience), which may or may
not be normal
www.learninglabb.com Page 1
Confidence Intervals
Learning objectives
How to provide an interval estimate (confidence interval) for population parameters such as mean?
How to adjust the interval estimate if the population standard deviation is not known?
How to calculate confidence interval for population proportion?
What should be the sample size to collect for a desired width of the interval estimate?
What is the Probability of tomorrow's temperature being 42 degrees exactly ?
Probability is '0'
www.learninglabb.com Page 1
Confidence Intervals
Sometimes samples don't give quite the right result
Because we're not dealing with the entire population, all we're doing is giving a best estimate. If the sample we use is
unbiased, then the estimate is likely to be close to the true value of the population
www.learninglabb.com Page 1
Credit card
Example: Credit Card Launch
www.learninglabb.com Page 1
Interval estimates of parameters
Based on sample data
- The point estimate for mean balance = $1990
- Can we trust this estimate ? • What do you think will happen if we took another random sample of 140 alumni ?
Because of this uncertainty, we prefer to provide the estimate as an interval (range) and associate a level of
confidence with it
www.learninglabb.com Page 1
Confidence Interval for the
Population Mean
Start by choosing a confidence level (1-a)% (e.g. 95%, 99%, 90%)
Then, the population mean will be within
Margin of error depends on the underlying uncertainty, confidence level and the sample size
www.learninglabb.com Page 1
Credit card: Mean Balance
Based on the survey and past data from scipy import stats Confidence
Z
— = 140; σ = $ 2500; x̅ = $ 1.990 stats.norm.ppf(.975) Interval
80% 1.282
Construct a 95% confidence interval for the mean card balance and interpret it 85% 1.440
Construct a 90% confidence interval for the mean card balance and interpret it 90% 1.645
95% 1.960
99% 2.576
99.5% 2.807
99.9% 3.291
www.learninglabb.com Page 1
(Mis) Interpreting the
Confidence Interval
Consider the 95% Confidence Interval for the mean income: [51576, $2404]
Is it accurate to say that the mean balance of the population falls within this range?
Can we conclude that the mean balance is within this range 95% of the time?
Does it imply that 95% of the alumni have a balance within this range?
www.learninglabb.com Page 1
What if We Don't Know σ?
Suppose that the alumni of this university are very different and hence population standard deviation from
previous launches cannot be used.
www.learninglabb.com Page 1
Population SD (σ) Unknown
We replace a with our best guess (point estimate) s, which is the standard deviation of the sample:
Calculate
Research has showm that the r-distribution is fairly robust to deviations of the population from the normal model
www.learninglabb.com Page 1
Student's t-distribution
As n →∞,
→
t n N(0,1)
i.e. as the degrees of
freedom increase, the
t-distribution
approaches the
standard normal dist
www.learninglabb.com Page 1
Confidence Interval for Mean
with Unknown σ
www.learninglabb.com Page 1
Back to Credit Card Balance
1-α = 0.95; n = 140
from scipy import stats
stats.t.ppf(0.975,df=139)
Calculate t = 1.98
0.95,139
Our estimate of
www.learninglabb.com Page 1
Basic
Statistics
Chapter -5 Hypothesis
What is hypothesis testing?
Hypothesis testing is a statistical method that is used in
making statistical decisions using experimental data.
Hypothesis Testing is basically an assumption that we make Process A Process B
about the population parameter.
89.7
Ex : You say avg height of citizens of city is more than 5.8 ft. 81.4 86.1
Votes for the politician will be more than 60% 84.5 83.2
All those example we assume need some statistic way to prove
84.8 91.9
those. we need some mathematical conclusion what ever we
are assuming is true. 87.3 86.3
79.7 79.3
Example
it is claimed that a process has been improved in yield by 85.1 82.6
bringing a change in an important factor X. Yield data are 81.7 89.1
collected from old and new processes.
✓ Random samples are drawn from yield data from old process
83.7
84.5
83.7
88.5
A and improved process B.
www.learninglabb.com Page 1
Example: Hypothesis Testing
Real Question:
Can we say that the yield of improved Process B is greater than old Process A?
Descriptive Statistics :
B 10 85.54 3.65
Statistical Question:
Is there a statistically significant difference between mean of Process B (85.54) and mean of Process A (84.24)? Or, is
this difference in mean just due to chance?
www.learninglabb.com Page 1
Hypothesis Testing
Example: Medicine B for treating
Develop the hypothesis for population
headache that is newly developed by a
and make statistical decision by
pharmaceutical company has 30
determining the acceptance of
minutes longer effect than existing
hypothesis using sample data.
Medicine A.
www.learninglabb.com Page 1
Procedure of Hypothesis Testing
Define the null hypothesis and the alternative hypothesis.
Determine the appropriate test statistic for evaluating the validity of the null hypothesis, such as a Z-test or t-test.
Choose the significance level (Alpha), typically set to 0.05.
Calculate the p-value, which represents the probability of obtaining the observed test statistic value (or more extreme)
under the assumption that the null hypothesis is true. Utilize the functions provided in the scipy.stats module for
p-value calculation.
Make a decision to either reject or fail to reject the null hypothesis based on comparing the p-value to the significance
level (Alpha).
P-value: The p-value is a probability value that measures the evidence against the null hypothesis.
A small p-value suggests strong evidence against the null hypothesis, while a large p-value suggests weak evidence.
Researchers compare the p-value to the significance level (α) to make a decision about rejecting or retaining the null
hypothesis. If the p-value is smaller than α, the null hypothesis is rejected. If the p-value is greater than or equal to α,
the null hypothesis is retained.
www.learninglabb.com Page 1
Procedure of Hypothesis Testing
www.learninglabb.com Page 1
How to frame hypothesis
Hypothesis is a starting position that is open to a test and rejection in light of strong adverse evidence
www.learninglabb.com Page 1
Two Tail Hypothesis Testing
www.learninglabb.com Page 1
www.learninglabb.com Page 1
Hypothesis Testing Process
Start with Hypotheses about a Population Parameter
Parameter could mean proportion or something else
www.learninglabb.com Page 1
Example: Hypermarket Loyalty
Program
The hypermarket intends to introduce a loyalty program provided that it leads to an average spending per
shopper exceeding 110 per week.
In a pilot program, a sample of 80 shoppers who enrolled in the loyalty program spent an average of 120 per
week, with a standard deviation of 40.
Based on the average spending of 130 per week from a sample of 90 shoppers enrolled in the loyalty program,
along with a standard deviation of 40, should the hypermarket proceed with launching the loyalty program or
not?
www.learninglabb.com Page 1
www.learninglabb.com Page 1
Right-tailed hypothesis test
(sample mean of 130)
Null Hypothesis. The additional Acceptable level of type4 error is 0.05
spending is less than or equal to $120 (α = 0.05)
H0: μ S ≤ μ0 (120)
If I reject my null hypothesis that p S 120, I will be wrong with probability 0.01
www.learninglabb.com Page 1
Normal Distribution Codes
www.learninglabb.com Page 1
Student't- Distribution Codes
www.learninglabb.com Page 1
Simple Exercise
A prisoner is on trial for a crime, and you're on the jury. The jury's task is to assume the prisoner is innocent, but if there's
enougl evidence against him, they need to convict him.
www.learninglabb.com Page 1
Simple Exercise
In the trial, what's the null hypothesis?
The null hypothesis is that the prisoner is innocent, as that is what we have to assume until there's proof otherwise.
www.learninglabb.com Page 1
Simple Exercise
In what ways can the jury make a verdict that's correct?
We can make a correct verdict if:
a) The prisoner is innocent, and we find him innocent.
b) The prisoner is guilty, and we find him guilty.
The errors we can make when conducting a hypothesis test are the same sort of errors we could make when putting a
prisoner on trial
Hypothesis tests are basically tests where you take a claim and put it on trial by assessing the evidence against it. If there's
sufficient evidence against it, you reject it, but if there's insufficient evidence against it, you accept it.
You may correctly accept or reject the null hypothesis, but even considering the evidence, it's also possible to make an
error. You may reject a valid null hypothesis, or you might accept it when it's actually false
www.learninglabb.com Page 1
Statisticians have special names for these types of errors.
A Type I error is :
when you wrongly reject a true null hypothesis (Punished a innocent guy), and
A Type II error is :
when you wrongly accept a false null hypothesis (Let guilty go free).
Errors
You can make two types of errors!
Actual
Situation
www.learninglabb.com Page 1
Example: Process Control at a
Call Center
Performance of a call center is monitored by the average call duration
Data from 18 months shows that on the days when the process runs normally
— = 4 min, a = 3 min
Cannot monitor each and every call due to limited resources; so randomly sample 50 calls per day
www.learninglabb.com Page 1
Day variation in sample mean
We already know that sample mean even, day will be different
inherent variability
But when should you be alarmed and conclude that the system
is not behaving normally — external variability
www.learninglabb.com Page 1
Let us solve the problem
Mean =4.0
Standard deviation =3
Sample Size =50
Sample mean =4.6
import scipy
import numpy as np
2*stats.t.cdf(-1.41,df=49)
www.learninglabb.com Page 1
Two-tailed hypothesis test
(sample mean of 4.6)
Null Hypothesis: The mean call duration is 4
minutes H0:µ = µ0 (4)
(α = 0.05)
www.learninglabb.com Page 1
Two-tailed hypothesis test
(sample mean of 5.3)
Null Hypothesis: The mean call duration is 4
minutes H0:µ = µ0 (4)
(α = 0.05)
www.learninglabb.com Page 1
Let us do it in Python
scipy.stats.ttest_l samp(array,mu)
Ex. An outbreak of Salmonella-related illness was attributed to ice cream produced at a certain factory. Scientists
measured the level of Salmonella in 9 randomly sampled batches of ice cream. The levels (in MPN/g) were
Is there evidence that the mean level of Salmonella in the ice cream is greater than 0.3 MPN/g?
www.learninglabb.com Page 1
Let be the mean level of Salmonella in all batches of ice cream. Here the hypothesis of interest can be expressed as:
Data = pd.Series([0.593, 0.142, 0.329, 0.691, 0.231, 0.793, 0.519, 0.392, 0.418]) scipy.stats.ttest_1samp(data10.3)
Two-sample t-tests
Ex. 6 subjects were given a drug (treatment group) and an additional 6 subjects a placebo (control group). Their reaction
time to a stimulus was measured (in tins). We want to perform a two-sample t-test for comparing the means of the
treatment and control groups.
www.learninglabb.com Page 1
Let Mul be the mean of the population taking medicine and Mu2 the mean of the untreated population. Here the
hypothesis of interest can be expressed as:
HO: Mul-Mu2=0
Ha: Mul-Mu2 !=0
Ttest_indResult(statistic=-3.445612673536487, pvalue=0.006272124350809803)
www.learninglabb.com Page 1
Proportion t test
Usecase: Is there a significant difference between the population proportions of state 1 and state 2 who report that they
have been placed immediately after education?
Populations: All Students who have completed graduation and Post graduation in both both states
Data: 247 students from state 1. 36.8% of students report that they have got the job.
308 students from state 2. 38.9% of students report that they have got the job.
Hypothesis Definition:
Null Hypothesis: p1 - p2 = 0
Alternative Hypothesis: p1 -p2 ≠ 0
www.learninglabb.com Page 1
The difference in population proportion needs t-test. Also, the population follows a binomial distribution here. We can
just pass on the two population quantities with the appropriate binomial distribution parameters to the t-test function
Data Given:
n1 = 247
pl = .37
n2 = 308
p2 = .39
www.learninglabb.com Page 1
Basic
Statistics
Chapter -6 Annova and Chi square
Annova
Analysis of Variance( ANOVA ) is a statistical technique
which is used to check if the means of two or more
groups are statistically different from each other.
Terminologies used
Grand Mean: it is the average of sample means
Variation between groups: For each data point look at the difference between
its group mean and the overall mean
Variation Within Groups: Difference between data point and its sample mean
www.learninglabb.com Page 1
Example:
A research institute wants to determine
the best diet plan for weight loss and
for this they have randomly selected 36
volunteers to analyze three diet plans
( Atkins , GM , South beach). Help the
research institute in identifying the best
diet plan by analyzing the weight lost by
the volunteers.
www.learninglabb.com Page 1
Calculations
www.learninglabb.com Page 1
Calculations
Deviation of mean with global Mean: ( µ1— µ)2
SS between group : product of Deviation of global mean with local mean and no. of data points
SS within group(Residual): total squared deviation of data points with its local mean
www.learninglabb.com Page 1
Conclusion
If calculated F ratio is greater than F value based on the degrees of freedom ,we reject HO .
www.learninglabb.com Page 1
Chi-Square test
Chi-square tests of independence test whether two qualitative variables are independent, that is, whether
there exists a relationship between two categorical variables
Hypotheses The Chi-square test of independence is a hypothesis test so it has a null (HO) and an alternative
hypothesis (HI):
HO : the variables are independent, there is no relationship between the two categorical variables. Knowing
the value of one variable does not help to predict the value of the other variable
H1 : the variables are dependent, there is a relationship between the two categorical variables. Knowing the
value of one variable helps to predict the value of the other variable
www.learninglabb.com Page 1
Chi-Square test
The chi-square test is used when the data is categorical and it detects the difference between observed data
and what could we expect if the Ho is true.
www.learninglabb.com Page 1
How the test works
If the difference between the observed frequencies and the expected frequencies is small, we cannot reject the null
hypothesis of independence and thus we cannot reject the fact that the two variables are not related
if the difference between the observed frequencies and the expected frequencies is large, we can reject the null
hypothesis of independence and thus we can conclude that the two variables are related How the test works
we want to determine whether there is a statistically significant association between smoking and being a
professional athlete.
Smoking can only be "yes" or "no" and being a professional athlete can only be "yes" or "no". The two variables of
interest are qualitative variables, so we need to use a Chi-square test of independence, and the data have been
collected on 28 persons
www.learninglabb.com Page 1
0: Observed Frequency in the data for each type
www.learninglabb.com Page 1
Test statistic
In the subgroup of athlete and non-smoker:
www.learninglabb.com Page 1
Critical value
df= (number of rows-1) (number of columns -1)
Chi-square table (x=0.05 and df=1) : The critical value is 3.84146
test statistic=15.56>critical value=3.84146
Like for any statistical test, when the test statistic is larger than the critical value, we can reject the null hypothesis at
the specified significance level
www.learninglabb.com Page 1