UNIT- 4
Inferential Statistics : Normal distribution, Poisson
distribution, Bernoulli distribution, z-score, p-score
One- tailed and two-tailed, Type 1 and Type-2 errors,
Confidence interval, Correlation, Z-test vs T-test
F- distribution, Chi-square distribution, the chi-square test of
independence, ANOVA
Inferential Statistics
The different kinds of distributions
• Different statistical tests that can be utilized to test a hypothesis
• How to make inferences about the population of a sample from the
data given
• Different kinds of errors that can occur during hypothesis testing
• Defining the confidence interval at which the population mean lies
• The significance of p-value and how it can be utilized to interpret
results
A normal distribution
A normal distribution is the most common and widely used distribution
in statistics.
It is also called a "bell curve" and "Gaussian curve" after the
mathematician KarlFriedrich Gauss.
A normal distribution occurs commonly in nature. Let's take the height
example we saw previously. If you have data for the height of all the
peopleof a particular gender in a city, and you plot a bar chart where
each bar represents the number of people at this particular height,
then the curve that is obtained will look very similar to the following
graph. The numbers in the plot are the standard deviation numbers
from the mean, which is zero.
The Normal Distribution is a continuous probability
distribution that is symmetric around its mean,
with data near the mean more frequent in
occurrence than data far from the mean. It is also
known as the Gaussian Distribution.
Properties
Symmetry:
•The normal distribution is perfectly symmetrical around its mean (μ).
•The left and right sides of the curve are mirror images.
Mean, Median, and Mode:
•For a perfectly normal distribution, the mean, median, and mode are all equal and located at the center of the
distribution.
Bell-Shaped Curve:
•The distribution forms a smooth, bell-shaped curve that is highest at the mean.
Asymptotic Nature:
•The tails of the distribution approach the horizontal axis but never touch it, extending infinitely in both directions.
Empirical Rule (68-95-99.7 Rule):
•Approximately:
•68% of data falls within 1 standard deviation of the mean.
•95% of data falls within 2 standard deviations of the mean.
•99.7% of data falls within 3 standard deviations of the mean.
Probability Density Function (PDF):
• The equation for normal distribution:
A normal distribution from a
binomial distribution
• Let's take a coin and flip it. The probability of getting a head or a tail is 50%. If
you take the same coin and flip it six times, the probability of getting a head
three times can be computed using the following formula:
• In the preceding formula, n is the number of times the coin is flipped, p is the
probability of success, and q is (1– p), which is the probability of failure.
Properties of Binomial Distribution
Use Cases
• A factory produces 100 items per day. The probability of producing a
defective item is 0.05. What is the probability of producing exactly 3
defective items in a day?
• A quiz consists of 10 multiple-choice questions, each with 4 options.
What is the probability of guessing exactly 6 questions correctly?
• In a class of 20 students, the probability of any student passing a test
is 0.75. What is the probability that exactly 15 students pass the test?
A Poisson distribution
A Poisson distribution is the probability distribution of independent
interval
occurrences in an interval. A binomial distribution is used to determine
theprobability of binary occurrences, whereas, a Poisson distribution is
used forcount-based distributions.
If lambda is the mean occurrence of the events per interval, then the
probability of having a k occurrence within a given intervalis given by
the following formula:
Problem
• A customer service center receives an average of 5 calls per hour. We
want to find the probability of receiving exactly 3 calls in the next
hour.
• Result
• The probability of receiving exactly 3 calls in the next hour is
approximately 0.1404 or 14.04%.
Use Cases
• A call center receives an average of 10 calls per hour. What is the
probability of receiving exactly 15 calls in a particular hour?
• A hospital emergency room gets an average of 3 critical cases per
night. What is the probability of receiving 5 or more critical cases in a
single night?
• A website receives an average of 4 complaints per day. What is the
probability that no complaints are received on a given day?
• Here, e is the Euler's number, k is the number of occurrences for
which the probability is going to be determined, and lambda is the
mean number of occurrences.
• Let's understand this with an example. The number of cars that pass
through a bridge in an hour is 20.
What would be the probability of 23 cars passing through the bridge in
an hour?
A Bernoulli distribution
• You can perform an experiment with two possible outcomes: success
or failure.
• Success has a probability of p, and failure has a probability of 1 - p. A
random variable that takes a 1 value in case of a success and 0 in case
of failure is called a Bernoulli distribution. The probability distribution
function can be written as:
Use Case
• A manufacturing unit produces an item that has a 0.8 probability of
being defect-free. What is the probability that a randomly chosen
item is defective?
• A student has a 70% chance of passing a quiz. What is the probability
that the student fails the quiz?
• A computer server has a 99% uptime probability each day. What is the
probability that it will be down on a randomly selected day?
A z-score
A z-score, in simple terms, is a score that expresses the value of a
distribution in standard deviation with respect to the mean. Let's take a
look at the followingformula that calculates the z-score:
Here, X is the value in the distribution, µ is the mean of the distribution,
and σ is the standard deviation of the distribution. If the p-value is
equal to or less than the significance level (α), then the null hypothesis
is inconsistent and it needs to be rejected.
Problem
• The scores of students in a math test are normally distributed with a
mean of 70 and a standard deviation of 10.
• Calculate the z-scores for the following students' scores:
• Alice: 85
• Bob: 60
• Charlie: 95
• David: 70
• Eve: 50
P Value
• A p-value is the probability of rejecting a null-hypothesis when the hypothesis is proven true. The
null hypothesis is a statement that says that there is no difference between two measures. If the
hypothesis is that people who clock in 4 hours of study everyday score more that 90 marks out of
100. The null hypothesis here would be that
p-v
• there is no relation between the number of hours clocked in and the marks scored.If the p-value is
equal to or less than the significance level (α), then the nullhypothesis is inconsistent and it needs
to be rejected.
P Value
• The p-value is the probability of obtaining test results at least as
extreme as the observed results, assuming the null hypothesis is
true.
• Small p-value (≤ α, e.g., 0.05) → Strong evidence against H0→ Reject H.
• Large p-value (> α) → Weak evidence against H0→ Fail to reject H0
• How is it Used?
One-tailed and two-tailed tests
• The example in the previous section was an instance of a one-tailed
test where the null hypothesis is rejected or accepted based on one
direction of the normal distribution. In a two-tailed test, both the tails
of the null hypothesis are used to test the hypothesis
Sampling Distributions
A sampling distribution is the probability distribution of a statistic (e.g., mean, variance, proportion) calculated from
repeated random samples drawn from a population. It plays a critical role in inferential statistics for estimating
population parameters and conducting hypothesis tests.
A Sangeetha CSE Dept. 31
Sampling Distributions
A Sangeetha CSE Dept. 32
Sampling Distributions
The three distributions are sampling distributions because they arise from sample-based statistics:
1.t-Distribution:
1. Derived from the sampling distribution of the mean when the population standard deviation (σ) is unknown
and replaced by the sample standard deviation (s).
2. Used for small sample sizes.
2.Chi-Square Distribution:
1. Arises when estimating the variance of a population from a sample variance.
2. Commonly used in hypothesis testing for categorical data (e.g., goodness-of-fit, independence).
3.F-Distribution:
1. Derived from the ratio of two variances from independent samples.
2. Used to test variance equality and for comparing multiple group means.
A Sangeetha CSE Dept. 33
t-test
• The t-test is a statistical test procedure that tests whether there is a
significant difference between the means of two groups.
Types of t-test
One sample t-Test
Example of a one sample t-test
• A manufacturer of chocolate bars claims that its chocolate bars weigh
50 grams on average. To verify this, a sample of 30 bars is taken and
weighed. The mean value of this sample is 48 grams.
Example of a t-test for independent
samples
• We would like to compare the effectiveness of two painkillers, drug
A and drug B. To do this, we randomly divide 60 test subjects into two
groups. The first group receives drug A, the second group
receives drug B. With an independent t-test we can now test whether
there is a significant difference in pain relief between the two drugs.
Paired samples t-Test
• We want to know how effective a diet is. To do this, we weigh 30
people before the diet and exactly the same people after the diet.
Calculate t-test
Problem on One sample t-test
Chi-square test
• The Chi-square test is a hypothesis test used to determine whether there is a
relationship between two categorical variables. The chi-square test checks
whether the frequencies occurring in the sample differ significantly from the
frequencies one would expect. Thus, the observed frequencies are compared
with the expected frequencies and their deviations are examined.
Use Case
The Chi-square test is used to investigate whether there is a relationship between gender and the highest level
of education.
The null hypothesis and the alternative hypothesis then result in:
Null hypothesis: there is no relationship between gender and highest educational attainment.
Alternative hypothesis: There is a relation between gender and the highest educational attainment.
Applications of the Chi-Square Test
• There are various applications of the Chi-square test, it can be used to answer the following questions:
1) Independence test
• Are two categorical variables independent of each other? For example, does gender have an impact on
whether a person has a Netflix subscription or not?
2) Distribution test
• Are the observed values of two categorical variables equal to the expected values? One question could
be, is one of the three video streaming services Netflix, Amazon, and Disney subscribed to above
average?
3) Homogeneity test
• Are two or more samples from the same population? One question could be whether the subscription
frequencies of the three video streaming services Netflix, Amazon and Disney differ in different age
groups.
Calculate chi-squared
Are the characteristics of gender and ownership of a Netflix subscription independent of
each other?
Analysis of Variance (ANOVA)
• An analysis of variance (ANOVA) tests whether statistically significant differences
exist between more than two samples. For this purpose, the means and variances
of the respective groups are compared with each other.
• Four types of analysis of variance
• One-factor (or one-way) ANOVA
• Two-factors (or two-way) ANOVA
• One-factor ANOVA with repeated measurements
• Two-factors ANOVA with repeated measurements
one-way and two-way ANOVA
One-way ANOVA
Two-Way ANOVA
• A Two-Way ANOVA (Analysis of Variance) is a statistical test used to determine the
effects of two independent variables (or factors) on a dependent variable. The
Two-Way ANOVA is an extension of the one-way ANOVA calculator for situations
where there are two independent factors. This method can also assess the
interaction between the two factors. A basic outline of a Two-Way ANOVA is as
follows:
• Factors and Levels:
• The two independent variables are known as factors.
• Each factor can have two or more levels. For example, a Two-Way ANOVA could
examine the effects of gender (male vs. female) and treatment type (treatment A
vs. treatment B vs. treatment C) on some outcome.
Type 1 and Type 2 errors
• Type 1 error is a type of error that occurs when there is a rejection of the
null hypothesis when it is actually true. This kind of error is also called an
error of the first kind and is equivalent to false positives.
How to Control Errors:
•Reduce α (Type I error) by using a smaller significance level (like 0.01 instead of 0.05).
•Reduce β (Type II error) by increasing sample size or using more powerful tests.
Contd..
Confusion matrix. Confusion matrix with statistical terms.
Meme
Example
A Confidence interval
• A confidence interval is a type of interval statistics for a population
parameter. The confidence interval helps in determining the interval
at which the population mean can be defined.
A confidence interval is a range of values that is likely to contain
the true population parameter (like the mean or proportion)
with a certain level of confidence (e.g., 95%).
We are 95% confident that the true population mean lies between 94.12 and 105.88.
• To determine the confidence interval, we'll now define the standard
error of the mean.
• The standard error of the mean is the deviation of the sample mean
from the population mean. It is defined using the following formula:
• Here, s is the standard deviation of the sample, and n is the number
of elements of the sample.
Correlation
• In statistics, correlation defines the similarity between two random
variables.
• The most commonly used correlation is the Pearson correlation and it
is defined by the following:
The Pearson correlation coefficient can now take values
Correlation between -1 and +1 and can be interpreted as follows
• The value +1 means that there is an entirely positive linear
relationship (the more, the more).
• The value -1 indicates that an entirely negative linear
relationship exists (the more, the less).
• With a value of 0 there is no linear relationship, i.e. the
variables do not correlate with each other.
Positive Correlation
• Definition: As one variable increases, the other also increases (and vice versa).
• Real-time Examples:
• Height vs. Weight – Generally, taller people tend to weigh more.
• Study Time vs. Exam Scores – More hours spent studying often leads to higher test scores.
• Advertising Spend vs. Sales Revenue – Increased spending on advertisements can boost sales.
• Years of Experience vs. Salary – In many jobs, more experience correlates with a higher salary.
Negative Correlation
• Definition: As one variable increases, the other decreases.
• Real-time Examples:
• Speed vs. Travel Time – Higher speed usually means less travel time.
• Exercise Frequency vs. Body Fat Percentage – More exercise often results in lower body fat.
• Screen Time vs. Sleep Quality – More screen time (especially before bed) is linked to poorer
sleep quality.
• Fuel Efficiency vs. Car Weight – Heavier cars generally have lower fuel efficiency.
Zero Correlation
• Definition: No meaningful relationship between the two variables.
• Real-time Examples:
• Shoe Size vs. IQ – These two have no logical or measurable connection.
• Favorite Color vs. Income Level – No correlation between a person’s favorite color and how much
they earn.
• Phone Brand vs. Academic Performance – The brand of smartphone used doesn’t affect grades.
• Number of Siblings vs. Daily Coffee Intake – No consistent pattern between these.
Visualization
T-Distribution (Student's T-
distribution)
Used When:
•The sample size is small (n < 30).
•Population standard deviation is
unknown.
F-Distribution
Used When:
•Comparing two variances.
•In ANOVA (Analysis of Variance) for testing if multiple group means are equal.
Chi-Square (χ²) Distribution
Used When:
•Testing independence between categorical variables.
•Goodness-of-fit tests (how well observed data match expected).
• The preceding formula defines the Pearson correlation as
the covariance between X and Y, which is divided by the
standard deviation of X and Y, or it can also be defined as
the expected mean of the sum of multiplied difference of
random variables with respect to the mean divided by
the standard deviation of X and Y.
Z-test vs T-test
• A T-distribution is similar to a Z-distribution—it is centered at zero and has a
basic bell shape, but its shorter and flatter around the center than the Z-
distribution.
• The T-distributions' standard deviation is usually proportionally larger than the
Z, because of which you see the fatter tails on each side.
• The t distribution is usually used to analyze the population when the sample is
small.
• The Z-test is used to compare the population mean against a sample or
compare the population mean of two distributions with a sample size greater
than 30.
• An example of a Z-test would be comparing the heights of men from different
ethnicity groups.
• The T-test is used to compare the population mean against a sample, or
compare the population mean of two distributions with a sample size less
The F distribution
• The F distribution is also known as Snedecor's F distribution or the
Fisher–Snedecor distribution.
• An f statistic is given by the following formula:
Here, s1 is the standard deviation of a sample 1 with an n1 size, s2 is
the standard deviation of a sample 2, where the size n2 σ1 is the
population standard deviation of a sample 1σ2 is the population
standard deviation of a sample 12
• The distribution of all the possible values of f statistics is called F
distribution. The d1and d2 represent the degrees of freedom in the
following chart:
The chi-square distribution
• The chi-square statistics are defined by the following formula:
• Here, n is the size of the sample, s is the standard deviation of the sample, and
σ is the standard deviation of the population.
• If we repeatedly take samples and define the chi-square statistics, then we can
form a chi-square distribution, which is defined by the following probability
density function:
• Here, Y0 is a constant that depends on the number of degrees of freedom, Χ2 is
the chi-square statistic, v = n - 1 is the number of degrees of freedom, and e is a
Chi-square for the goodness of fit
• The Chi-square test can be used to test whether the observed data differs
significantly from the expected data
• The chi-square can be performed using the chisquare function in the SciPy
package:
• >>> stats.chisquare(observed,expected)
• The first value is the chi-square value and the second value is the p-value, which
is very high. This means that the null hypothesis is valid and the observed value is
similar to the expected value.
The chi-square test of
independence
• The chi-square test of independence is a statistical
test used to
determine whether two categorical variables are independent of
each other or not.
• The Chi-Square test of independence can be performed using the
chi2_contingencyfunction in the SciPy package:
• >>> men_women = np.array([[100, 120, 60],[350, 200, 90]])
• >>> stats.chi2_contingency(men_women)
ANOVA
• Analysis of Variance (ANOVA) is a statistical method used to test
differences between two or more means.
• This test basically compares the means between groups and determines
whether any of these means are significantly different from each other:
• ANOVA is a test that can tell you which group is significantly different
from each other.