Statistics Qestions PDF
Statistics Qestions PDF
Ans: 1
Both t-tests and z-tests are used to compare means of two groups and determine if the
observed difference between the groups is statistically significant.
• Sample Size: The key difference between a t-test and a z-test is the sample size. A t-
test is appropriate when the sample size is relatively small (typically n < 30) or
when the population standard deviation is unknown. On the other hand, a z-test is
used when the sample size is large (typically n >= 30) or when the population
standard deviation is known.
t-test:
Suppose we want to compare the heights of two groups of individuals: Group A and Group B.
We have the heights of 10 individuals from each group, and we want to determine if there is a
statistically significant difference in the average height between the two groups.
import numpy as np
from scipy.stats import ttest_ind
t-statisti:c 6.218479840396996
p-value: 7.226808844213945e-06
There is a statistically significant difference in the average height
between Group A and Group B.
z-test:
Now, suppose we have a larger sample size for each group and we know the population
standard deviation. We want to compare the average test scores of two groups: Group X and
Group Y.
z-score: 4.695742752749559
p-value: 2.656396824285423e-06
There is a statistically significant difference in the average test
scores between Group X and Group Y.
Ans: 2
One-tailed and two-tailed tests are two types of hypothesis tests used in statistical analysis to
assess the significance of an observed effect or difference. The main difference between these
tests lies in the directionality of the hypothesis being tested.
One-tailed test:
In a one-tailed test, the hypothesis being tested is directional, meaning it specifies a particular
direction of the effect or difference. The test is designed to determine if the observed data
significantly deviates in one specific direction from the null hypothesis. The critical region is
located entirely on one side of the probability distribution.
One-tailed tests are typically used when there is a strong prior expectation or theoretical reason
to believe that the effect or difference, if it exists, will occur in a specific direction.
Two-tailed test:
In a two-tailed test, the hypothesis being tested is non-directional, meaning it does not specify a
particular direction of the effect or difference. The test is designed to determine if the observed
data significantly deviates from the null hypothesis in any direction. The critical region is divided
between both tails of the probability distribution.
Two-tailed tests are more conservative because they consider deviations in both directions and
are typically used when there is no prior expectation or theoretical reason to predict the
direction of the effect or difference.
Ans: 3
Type 1 error (False Positive):
A Type 1 error occurs when we incorrectly reject a true null hypothesis. In other words, we
conclude that there is a significant effect or difference when, in reality, there is no such effect or
difference in the population. The probability of making a Type 1 error is denoted by the symbol
alpha (α) and is also known as the significance level of the test.
• Example: Let's say a company claims their new energy drink improves people's memory.
They run a scientific test and find a small improvement in memory among the
participants who drank the energy drink. However, in reality, the drink has no effect on
memory. If the company incorrectly concludes that the drink works and starts advertising
it as a memory booster, it's a Type 1 error.
A Type 2 error occurs when we incorrectly fail to reject a false null hypothesis. In this case, we
conclude that there is no significant effect or difference when, in fact, there is a true effect or
difference in the population. The probability of making a Type 2 error is denoted by the symbol
beta (β).
• Example: Let's consider the same company testing the memory-boosting energy drink.
Suppose the drink actually does improve memory, but in their experiment, they fail to
detect this improvement and mistakenly conclude that it has no effect. It's a Type 2 error
because they missed the true effect of the drink.
Q4: Explain Bayes's theorem with an example.
Ans: 4
Bayes's theorem is a fundamental concept in probability theory that allows us to update the
probability of an event based on new evidence. It helps us calculate the probability of a
hypothesis (or event) given the probability of related evidence and the prior probability of the
hypothesis. Bayes's theorem is represented mathematically as:
Where :
Example: Coin Toss Suppose we have an unfair coin, and we want to calculate the probability of
the coin being biased towards heads (H) or tails (T) based on the evidence of three consecutive
tosses: H, H, T.
Now, let's use Bayes's theorem to calculate the posterior probability of the coin being biased
towards heads (H) after observing three consecutive tosses: H, H, T.
# Prior probabilities
p_h = 0.4 # Probability of the coin being biased towards heads
(prior)
p_t = 0.6 # Probability of the coin being biased towards tails
(prior)
# Likelihoods
p_h_given_h = 0.9 # Probability of getting heads given the coin is
biased towards heads
p_t_given_h = 0.1 # Probability of getting tails given the coin is
biased towards heads
p_h_given_t = 0.3 # Probability of getting heads given the coin is
biased towards tails
p_t_given_t = 0.7 # Probability of getting tails given the coin is
biased towards tails
Ans: 5
Confidence interval:
A confidence interval is a range of values that provides an estimate of the unknown population
parameter (such as the population mean) along with a level of confidence. It is a measure of the
uncertainty associated with an estimate based on a sample from the population. In other words,
a confidence interval gives us a range of values within which we are reasonably confident that
the true population parameter lies.
How to calculate the confidence interval: The formula for calculating a confidence interval
depends on the type of data and the distribution of the data. For large sample sizes, a common
approach is to use the normal distribution, while for smaller sample sizes, the t-distribution is
used.
Suppose we want to estimate the average height of a certain population. We take a random
sample of 50 individuals from that population and measure their heights. The sample mean
height is 170 cm, and the sample standard deviation is 5 cm.
# Sample statistics
sample_mean = np.mean(sample_data)
sample_std = np.std(sample_data, ddof=1) # ddof=1 for sample standard
deviation
sample_size = len(sample_data)
Ans : 6
Sample Problem: Suppose there is a rare disease that affects 1 in 10,000 people in a certain
population. We have a diagnostic test for this disease, but it is not perfect. The test has the
following characteristics:
The probability of a positive test result (indicating the presence of the disease) given that a
person has the disease is 0.98 (P(Positive|Disease) = 0.98). The probability of a negative test
result (indicating the absence of the disease) given that a person does not have the disease is
0.99 (P(Negative|No Disease) = 0.99). Now, a person from the population gets tested, and the
test result is positive. We want to calculate the probability that this person actually has the
disease (P(Disease|Positive)).
Solution: Let's use Bayes' Theorem to calculate the probability of having the disease given a
positive test result (P(Disease|Positive)).
Where:
• P(Disease/Positive) is the probability of having the disease given a positive test result.
• P(Positive/Disease) is the probability of a positive test result given that the person has
the disease (0.98).
• P(Disease) is the prior probability of having the disease (1 in 10,000 or 0.0001).
• P(positive) is the probability of a positive test result.
Let's calculate P(positive) and then use Bayes' Theorem to find P(Disease/Positive) in Python:
# Given data
p_positive_given_disease = 0.98
p_disease = 0.0001
p_positive_given_no_disease = 1 - 0.99
p_no_disease = 1 - p_disease
# Calculate P(Positive)
p_positive = (p_positive_given_disease * p_disease) +
(p_positive_given_no_disease * p_no_disease)
Ans: 7
To calculate the 95% confidence interval for a sample of data with a mean of 50 and a standard
deviation of 5, we need to use the formula for a confidence interval for the population mean
Let's calculate the 95% confidence interval using the given information (mean = 50, standard
deviation = 5) and assume a sample size of 30:
import scipy.stats as st
# Given data
sample_mean = 50
sample_std = 5
confidence_level = 0.95
sample_size = 30
# Calculate the critical value (Z-score) for 95% confidence level
critical_value = st.norm.ppf((1 + confidence_level) / 2)
The 95% confidence interval for the population mean(μ) based on the sample data is (48.51,
51.49). This means that we are 95% confident that the true population mean falls within this
range.
Ans: 8
The margin of error (MOE) in a confidence interval (CI) is a measure of the uncertainty or
precision associated with the estimate of a population parameter based on a sample. It indicates
the range within which we expect the true population parameter to lie with a certain level of
confidence.
In statistical terms, when we calculate a confidence interval for a population parameter (e.g.,
mean, proportion), we use a sample statistic (e.g., sample mean, sample proportion) to estimate
the true population parameter. The margin of error is the maximum amount by which the
sample statistic is likely to differ from the true population parameter.
The margin of error is directly related to the level of confidence chosen for the interval and the
variability of the data in the sample. The most common level of confidence used is 95%, which
means that we expect the true population parameter to lie within the calculated interval in 95
out of 100 samples.
The formula for calculating the margin of error in a confidence interval is:
MOE = Z * (σ / √n)
Where:
• Z = Z-score associated with the desired level of confidence (e.g., 1.96 for a 95%
confidence level)
• σ = Standard deviation of the population (unknown in most cases, so we often use the
sample standard deviation as an estimate)
• n = Sample size
Now, as for how the sample size affects the margin of error, we can observe that the margin of
error is inversely proportional to the square root of the sample size. In other words, as the
sample size increases, the margin of error decreases.
Here's an example:
Let's say you want to estimate the average height of students at a particular university with a
95% confidence level and a margin of error of 2 inches. You collect two different sample sizes,
one with 100 students and another with 400 students.
For the sample with 100 students: MOE = Z * (σ / √n) Assuming σ (population standard
deviation) is 4 inches (just for illustration purposes): MOE = 1.96 * (4 / √100) = 1.96 * 0.4 ≈ 0.78
inches
For the sample with 400 students: MOE = Z * (σ / √n) MOE = 1.96 * (4 / √400) = 1.96 * 0.2 ≈ 0.39
inches
# Given data
data_point = 75
population_mean = 70
population_std_dev = 5
The calculated z-score represents the number of standard deviations the data point (75) is away
from the population mean (70). In this case:
Z = (75-70)/5
A z-score of 1 means that the data point is 1 standard deviation above the mean. Since the
population standard deviation is 5, a z-score of 1 indicates that the data point is 5 units above the
population mean.
Ans: 10
Let's define the hypotheses:
• Null Hypothesis (H0): The weight loss drug is not significantly effective, and the
population mean weight loss is equal to or less than 0 pounds. ( μ ≤ 0)
• Alternative Hypothesis (H1): The weight loss drug is significantly effective, and the
population mean weight loss is greater than 0 pounds. (μ > 0)
We will use a one-tailed t-test because the alternative hypothesis is directional (μ > 0).
sample_mean = 6
sample_std_dev = 2.5
sample_size = 50
population_mean_null = 0
Ans: 11
To calculate the 95% confidence interval for the true proportion of people who are satisfied with
their job, we can use the formula for the confidence interval for a proportion.
import math
# Given data
sample_proportion = 0.65
sample_size = 500
Ans: 12
To conduct a hypothesis test to determine if there is a significant difference in student
performance between the two teaching methods, we can use a two-sample t-test. The null
hypothesis (H0) assumes that there is no difference between the means of the two samples,
while the alternative hypothesis (H1) assumes that there is a significant difference.
• Null Hypothesis (H0): The two teaching methods have no significant difference in
student performance. (μA - μB = 0)
• Alternative Hypothesis (H1): The two teaching methods have a significant difference
in student performance. (μA - μB ≠ 0)
We will use a two-tailed t-test because the alternative hypothesis is non-directional (it doesn't
specify which mean is larger).
If the absolute value of the calculated t-statistic is greater than the critical t-value, we can reject
the null hypothesis (H0) and conclude that there is a significant difference in student
performance between the two teaching methods. Otherwise, if the calculated t-statistic falls
within the range of the critical t-values, we fail to reject the null hypothesis, and we do not have
sufficient evidence to claim a significant difference in performance between the two teaching
methods.
# Given data
sample_mean = 65
population_mean = 60
population_std_dev = 8
sample_size = 50
Ans: 14
To conduct a hypothesis test to determine if caffeine has a significant effect on reaction time, we
can use a one-sample t-test. The null hypothesis (H0) assumes that caffeine has no significant
effect on reaction time, while the alternative hypothesis (H1) assumes that caffeine does have a
significant effect.
• Alternative Hypothesis (H1): Caffeine does have a significant effect on reaction time.
(μ ≠ 0)
We will use a two-tailed t-test because the alternative hypothesis is non-directional
# Given data
sample_mean = 0.25
population_mean_null = 0
sample_std_dev = 0.05
sample_size = 30
If the absolute value of the calculated t-statistic is greater than the critical t-value, we can reject
the null hypothesis (H0) and conclude that caffeine has a significant effect on reaction time at
the 90% confidence level. Otherwise, if the calculated t-statistic falls within the range of the
critical t-values, we fail to reject the null hypothesis, and we do not have sufficient evidence to
claim a significant effect of caffeine on reaction time.
Q15. Calculate the 95% confidence interval for a
sample of data with a mean of 50 and a
standard deviation of 5 using Python. Interpret
the results.
Ans: 15
To calculate the 95% confidence interval for a sample of data with a mean of 50 and a standard
deviation of 5, we can use the formula for the confidence interval for a population mean. Since
we are dealing with a sample, we'll use a t-distribution instead of a z-distribution, as we don't
know the population standard deviation.
# Given data
sample_mean = 50
sample_std_dev = 5
sample_size = 30 # Assuming a sample size of 30 for demonstration
# Calculate the critical t-value for a 95% confidence level and the
given degrees of freedom
confidence_level = 0.95
critical_t_value = stats.t.ppf(1 - (1 - confidence_level) / 2,
df=degrees_of_freedom)
# Observed frequencies
observed_frequencies = np.array([22, 18, 20, 11, 13, 16])
Outcome1 20 , 15
Outcome2 10 , 25
Outcome3 15 , 20
# Check if the p-value is less than the significance level and make
the conclusion
if p_value < alpha:
print("Reject the null hypothesis. There is a significant
association between the groups and outcomes.")
else:
print("Fail to reject the null hypothesis. There is no significant
association between the groups and outcomes.")
The interpretation of the results is based on the p-value. If the p-value is less than the chosen
significance level (0.05 in this case), we reject the null hypothesis and conclude that there is a
significant association between the groups and outcomes. Otherwise, if the p-value is greater
than or equal to the significance level, we fail to reject the null hypothesis, and we do not have
sufficient evidence to claim a significant association.
Q18. A study of the prevalence of smoking in a
population of 500 individuals found that 60
individuals smoked. Use Python to calculate the
95% confidence interval for the true proportion
of individuals in the population who smoke.
Ans: 18
To calculate the 95% confidence interval for the true proportion of individuals in the population
who smoke, we can use the formula for the confidence interval for a population proportion.
# Given data
sample_proportion = 60 / 500
sample_size = 500
The 95% confidence interval for the true proportion of individuals in the population who smoke
is approximately (0.0968 to 0.1432). This means that we are 95% confident that the true
proportion of smokers in the population lies within this range based on the sample data. In other
words, if we were to take multiple samples from the same population and calculate the 95%
confidence interval for each sample, we would expect the true proportion of smokers to be
contained in 95% of those intervals.
Q19. Calculate the 90% confidence interval for
a sample of data with a mean of 75 and a
standard deviation of 12 using Python. Interpret
the results.
Ans: 19
To calculate the 90% confidence interval for a sample of data with a mean of 75 and a standard
deviation of 12, we can use the t-distribution since the sample size is small and the population
standard deviation is unknown.
# Given data
sample_mean = 75
sample_std_dev = 12
sample_size = 30 # Adjust the sample size as needed
The 90% confidence interval for the population mean is approximately (71.05 to 78.95). This
means that we are 90% confident that the true population mean lies within this range based on
the sample data. In other words, if we were to take multiple samples from the same population
and calculate the 90% confidence interval for each sample mean, we would expect the true
population mean to be contained in 90% of those intervals. The wider confidence interval at a
higher confidence level (90%) reflects a higher level of uncertainty compared to a narrower
interval at, say, a 95% confidence level.
plt.show()
Q21. A random sample of 1000 people was
asked if they preferred Coke or Pepsi. Of the
sample, 520 preferred Coke. Calculate a 99%
confidence interval for the true proportion of
people in the population who prefer Coke.
# Ans: 21
# Given data
sample_proportion = 520 / 1000
sample_size = 1000
# Calculate the critical z-value for a 99% confidence level
confidence_level = 0.99
critical_z_value = stats.norm.ppf(1 - (1 - confidence_level) / 2)
The 99% confidence interval for the true proportion of people in the population who prefer Coke
is approximately (0.4836 to 0.5564). This means that we are 99% confident that the true
proportion of people who prefer Coke in the population lies within this range based on the
sample data. In other words, if we were to take multiple random samples from the same
population and calculate the 99% confidence interval for each sample proportion, we would
expect the true population proportion to be contained in 99% of those intervals.
Q22. A researcher hypothesizes that a coin is
biased towards tails. They flip the coin 100
times and observe 45 tails. Conduct a chi-
square goodness of fit test to determine if the
observed frequencies match the expected
frequencies of a fair coin. Use a significance
level of 0.05.
Ans: 22
To conduct a chi-square goodness of fit test for the researcher's hypothesis, we need to
compare the observed frequencies (the number of tails observed in 100 coin flips) with the
expected frequencies of a fair coin (50 tails and 50 heads in 100 flips). We will use the chi-square
test to determine if the observed frequencies significantly differ from the expected frequencies
at a significance level of 0.05.
The null hypothesis (H0) for the chi-square goodness of fit test is that there is no significant
difference between the observed and expected frequencies, suggesting that the coin is fair. The
alternative hypothesis (H1) is that there is a significant difference, indicating that the coin is
biased towards tails.
# Given data
observed_tails = 45
total_flips = 100
expected_tails = total_flips / 2
# Observed frequencies
observed_frequencies = [observed_tails, total_flips - observed_tails]
# Check if the p-value is less than the significance level and make
the conclusion
if p_value < alpha:
print("Reject the null hypothesis. The coin is biased towards
tails.")
else:
print("Fail to reject the null hypothesis. The coin is fair.")
Interpretation of results:
• If the p-value is less than 0.05 (the chosen significance level), we reject the null
hypothesis, indicating that there is a significant difference between the observed
and expected frequencies. This suggests that the coin is biased towards tails.
• If the p-value is greater than or equal to 0.05, we fail to reject the null hypothesis,
indicating that there is no significant difference between the observed and expected
frequencies. This would imply that the coin is fair.
Q23. A study was conducted to determine if
there is an association between smoking status
(smoker or non-smoker) and lung cancer
diagnosis (yes or no). The results are shown in
the contingency table below. Conduct a chi-
square test for independence to determine if
there is a significant association between
smoking status and lung cancer diagnosis.
Lung Cancer: Yes Lung Cancer: No
Smoker: 60 , 140
Non-smoker: 30 , 170
Ans: 23
To conduct a chi-square test for independence between smoking status and lung cancer
diagnosis, we need to use the provided contingency table and compare the observed
frequencies with the expected frequencies under the assumption of independence. The null
hypothesis (H0) is that there is no association between smoking status and lung cancer
diagnosis, while the alternative hypothesis (H1) is that there is a significant association.
Ans: 24
To conduct a chi-square test for independence between chocolate preference and country of
origin (U.S. or U.K.), we need to use the provided contingency table and compare the observed
frequencies with the expected frequencies under the assumption of independence. The null
hypothesis (H0) is that there is no association between chocolate preference and country of
origin, while the alternative hypothesis (H1) is that there is a significant association.
# Check if the p-value is less than the significance level and make
the conclusion
if p_value < alpha:
print("Reject the null hypothesis. There is a significant
association between chocolate preference and country of origin.")
else:
print("Fail to reject the null hypothesis. There is no significant
association between chocolate preference and country of origin.")
# Check if the p-value is less than the significance level and make
the conclusion
if p_value < alpha:
print("Reject the null hypothesis. The population mean is
significantly different from 70.")
else:
print("Fail to reject the null hypothesis. There is no significant
difference between the population mean and 70.")
t-statistic: nan
P-value: nan
Fail to reject the null hypothesis. There is no significant difference
between the population mean and 70.
Ans: 26
ANOVA (Analysis of Variance) is a statistical technique used to compare means of three or more
groups or conditions. It allows us to determine if there are significant differences among the
means of these groups. However, ANOVA comes with certain assumptions that must be met to
ensure the validity and reliability of the results. Violating these assumptions can impact the
accuracy of the ANOVA analysis.
• Normality of Residuals: The residuals (the differences between the observed data
and the group means) should follow a normal distribution. This assumption is
essential, especially when the sample sizes are small. Violation of this assumption
can lead to incorrect p-values and confidence intervals.
Non-Normality: If the residuals are not normally distributed, ANOVA results may not be
reliable. For example, if the residuals follow a skewed distribution or have extreme outliers, the
ANOVA test may produce inaccurate results.
Heteroscedasticity: Heteroscedasticity occurs when the variances of the groups are not equal.
This can lead to unequal influence of different groups on the overall test result. For instance, if
one group has much higher variability than others, it may disproportionately affect the overall
ANOVA outcome.
Lack of Independence: If the data points within groups are not independent, ANOVA may
produce biased estimates of the group means. For example, if repeated measures are used, and
data points within each subject are correlated, it violates the independence assumption.
Ans: 27
The three types of ANOVA (Analysis of Variance) are:
• One-Way ANOVA: One-Way ANOVA is used when we want to compare the means of
three or more groups that are organized into a single categorical independent
variable. It is the most basic and commonly used form of ANOVA. For example, if we
want to compare the average test scores of students from three different schools,
where the schools are the categorical variable with three levels, we would use One-
Way ANOVA.
Ans: 28
The partitioning of variance in ANOVA is the process of breaking down the total variation in data
into two parts:
Between-Group Variance: It shows how much the group means differ from each other. If it's
large, it means the groups are significantly different.
Within-Group Variance: It shows how much individual data points vary within each group. It
represents random variation within groups.
Understanding this concept helps us know if the independent variable has a significant effect on
the dependent variable and if the model fits the data well. It also helps identify sources of
variability in the data.
Ans: 29
In a one-way ANOVA, we can calculate the Total Sum of Squares (SST), Explained Sum of
Squares (SSE), and Residual Sum of Squares (SSR) to understand the variability in the data and
assess the goodness of fit of the model
import numpy as np
import scipy.stats as stats
# Example data for three groups (replace this with your actual data)
group1 = [5, 8, 7, 6, 10]
group2 = [12, 9, 11, 13, 10]
group3 = [15, 18, 14, 17, 16]
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
df sum_sq mean_sq F
PR(>F)
Factor2 1.0 1.500000e+00 1.500000e+00 1.058824e-01
0.775769
Factor1 1.0 1.000000e+00 1.000000e+00 7.058824e-02
0.815363
Factor1:Factor2 1.0 3.155444e-30 3.155444e-30 2.227372e-31
1.000000
Residual 2.0 2.833333e+01 1.416667e+01 NaN
NaN
The ANOVA table will provide you with the following information:
• Main Effect of Factor1: This represents the significance of the effect of Factor1 on
the dependent variable, considering the other variables in the model.
• Main Effect of Factor2: This represents the significance of the effect of Factor2 on
the dependent variable, considering the other variables in the model.
Ans : 31
In a one-way ANOVA, the F-statistic is used to test whether there are significant differences
among the means of three or more groups. The p-value associated with the F-statistic indicates
the probability of observing the results (or more extreme results) under the assumption that the
null hypothesis is true.
• F-Statistic: The F-statistic of 5.23 represents the ratio of variability between the
group means to variability within the groups. A larger F-statistic indicates that the
variability between group means is relatively large compared to the variability
within each group.
• P-Value: The p-value of 0.02 indicates the probability of observing the obtained F-
statistic (or more extreme values) if the null hypothesis is true. In other words, it
represents the evidence against the null hypothesis. A p-value of 0.02 suggests that
there is a 2% chance of observing such a large F-statistic by random chance alone,
assuming that there are no true differences between the group means (i.e., the null
hypothesis is true).
.
Interpretation:
Since the p-value (0.02) is less than the significance level (often set at 0.05), we reject the null
hypothesis. This means that there are significant differences among the means of the groups. In
other words, the independent variable (the factor defining the groups) has a statistically
significant effect on the dependent variable.
Q32. In a repeated measures ANOVA, how
would you handle missing data, and what are
the potential consequences of using different
methods to handle missing data?
Ans: 32
Handling missing data in a repeated measures ANOVA is important because the presence of
missing values can affect the validity and reliability of the analysis. There are several methods to
handle missing data, each with its own implications and potential consequences. Some common
approaches include:
• Complete Case Analysis (Listwise Deletion): This method involves excluding any
participant with missing data from the analysis. It is the simplest approach, but it
may lead to a loss of valuable information, reduced statistical power, and potential
bias if the missingness is related to the outcome variable or other factors.
• Mean Imputation: This method replaces missing values with the mean of the
observed values for that variable. While this can preserve the sample size, it can
lead to biased estimates, underestimation of standard errors, and an artificial
increase in the apparent within-subject variability.
• Last Observation Carried Forward (LOCF): This method carries the last observed
value forward for missing data points. LOCF can introduce bias if there is a trend or
systematic change in the data over time.
The potential consequences of using different methods to handle missing data in a repeated
measures ANOVA include:
• Biased Estimates: Some methods, like mean imputation and LOCF, can introduce
bias in the estimated group means and treatment effects if the missing data
mechanism is not missing completely at random (MCAR).
• Loss of Power: Complete case analysis can result in reduced statistical power due
to the loss of participants with missing data, especially if the missingness is related
to the outcome variable.
Ans: 33
After conducting an analysis of variance (ANOVA) and obtaining a significant result indicating
that there are differences among the group means, post-hoc tests are often used to identify
which specific groups differ from each other. Post-hoc tests help to perform multiple pairwise
comparisons and control the family-wise error rate, which is the probability of making at least
one Type I error (false positive) among all the comparisons.
• Tukey's Honestly Significant Difference (HSD): Tukey's HSD test is one of the
most widely used post-hoc tests. It compares all possible pairs of group means and
determines whether their differences are significant. It is appropriate when the
sample sizes are equal across groups and when you want to control the overall Type
I error rate. This test tends to be more conservative than some other post-hoc tests.
• Fisher's Least Significant Difference (LSD): Fisher's LSD test is relatively less
conservative than Tukey's HSD, making it useful when sample sizes are equal, and
variances are equal or approximately equal. It is a good choice when there is a
specific hypothesis about which groups to compare.
.
Example situation:
Let's say a pharmaceutical company is testing the effectiveness of four different drugs (A, B, C,
and D) in reducing blood pressure. They randomly assign 100 hypertensive patients into four
groups, each receiving one of the four drugs. After the treatment period, they measure the
average reduction in blood pressure for each group and run an ANOVA to determine if there are
any significant differences among the drugs.
The ANOVA results show a significant difference among the group means (p < 0.05). To identify
which specific drug(s) are significantly different from others, they decide to use a post-hoc test.
Since they have equal sample sizes and want to control the overall Type I error rate, they opt for
Tukey's HSD test.
The Tukey's HSD test reveals that drug A and drug B show no significant difference in their
effects on blood pressure. However, both drug A and drug B produce significantly different
results compared to drug C and drug D. Drug C and drug D also show no significant difference
between them.
With this analysis, the pharmaceutical company can now confidently identify which drugs have a
meaningful impact on reducing blood pressure and make informed decisions regarding further
development and marketing strategies.
Q34. A researcher wants to compare the mean
weight loss of three diets: A, B, and C. They
collect data from 50 participants who were
randomly assigned to one of the diets. Conduct
a one-way ANOVA using Python to determine if
there are any significant differences between
the mean weight loss of the three diets. Report
the F-statistic and p-value, and interpret the
results.
# Ans: 34
import numpy as np
from scipy.stats import f_oneway
diet_B = [3.8, 3.5, 4.2, 4.0, 4.1, 3.9, 4.3, 3.6, 4.0, 4.1,
3.7, 3.9, 3.8, 4.2, 3.6, 3.9, 4.0, 4.1, 4.3, 4.2,
3.7, 3.8, 4.2, 3.6, 3.9, 4.0, 4.1, 4.3, 4.2, 3.7,
3.8, 3.6, 3.9, 3.8, 4.2, 3.6, 4.0, 4.1, 3.7, 3.9,
3.8, 4.2, 3.6, 3.9, 4.0, 4.1, 4.3, 4.2]
diet_C = [1.9, 1.8, 1.7, 1.5, 1.9, 2.0, 1.6, 1.7, 1.8, 2.1,
1.5, 1.6, 2.0, 1.7, 1.8, 1.9, 1.6, 1.5, 2.0, 1.7,
1.9, 1.8, 1.5, 1.7, 1.6, 1.9, 2.0, 1.8, 2.1, 1.5,
1.6, 1.7, 1.8, 1.9, 2.0, 1.7, 1.5, 2.0, 1.8, 1.6,
1.9, 1.8, 1.7, 1.5, 1.6, 1.7, 1.8]
print("F-statistic:", f_statistic)
print("p-value:", p_value)
F-statistic: 755.8582766092261
p-value: 4.334728192105363e-76
The F-statistic is the test statistic of the ANOVA. It measures the variability between group
means relative to the variability within groups. The larger the F-statistic, the more evidence
there is for the existence of significant differences among the group means.
The p-value represents the probability of observing such an extreme F-statistic under the
assumption that there are no significant differences among the group means. In other words, it
indicates the likelihood of obtaining the observed results due to random chance alone.
Interpretation:
If the p-value is less than the chosen significance level (e.g., 0.05), we reject the null hypothesis
and conclude that there are significant differences among the mean weight loss of the three
diets. If the p-value is greater than the significance level, we fail to reject the null hypothesis,
and we cannot conclude that there are significant differences among the mean weight loss of
the three diets.
Q35. A company wants to know if there are any
significant differences in the average time it
takes to complete a task using three different
software programs: Program A, Program B, and
Program C. They randomly assign 30
employees to one of the programs and record
the time it takes each employee to complete the
task. Conduct a two-way ANOVA using Python
to determine if there are any main effects or
interaction effects between the software
programs and employee experience level
(novice vs. experienced). Report the F-statistics
and p-values, and interpret the results.
# Ana: 35
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols
# Sample data
data = pd.DataFrame({
"Time": [12.3, 13.1, 11.5, 10.8, 13.5, 12.9, 9.7, 11.2, 10.4,
11.6,
14.2, 13.9, 15.1, 14.8, 15.9, 16.3, 17.0, 16.2, 13.8,
14.4,
11.9, 11.6, 10.5, 9.8, 10.1, 12.7, 12.4, 14.5, 13.7,
15.2],
"Program": ["A", "A", "A", "A", "A", "A", "B", "B", "B", "B",
"C", "C", "C", "C", "C", "C", "A", "A", "A", "A",
"B", "B", "B", "B", "C", "C", "C", "C", "C", "C"],
"Experience": ["Novice", "Novice", "Experienced", "Experienced",
"Novice", "Novice",
"Experienced", "Experienced", "Novice",
"Experienced",
"Novice", "Novice", "Experienced", "Experienced",
"Experienced",
"Experienced", "Experienced", "Experienced",
"Experienced", "Experienced",
"Novice", "Novice", "Novice", "Experienced",
"Experienced", "Experienced",
"Novice", "Experienced", "Experienced",
"Experienced"]
})
• The F-statistic for the main effect of software program (F_program) tests whether
there are significant differences in the average time it takes to complete the task
across the three different software programs.
• The p-value for the main effect of software program (p_program) represents the
probability of observing such an extreme F-statistic under the assumption that there
are no significant differences among the programs.
• If p_program is less than the chosen significance level (e.g., 0.05), we reject the null
hypothesis and conclude that there are significant differences in the average time
across the software programs.
.
• The F-statistic for the main effect of employee experience (F_experience) tests
whether there are significant differences in the average time it takes to complete the
task between novice and experienced employees.
• The p-value for the main effect of employee experience (p_experience) represents
the probability of observing such an extreme F-statistic under the assumption that
there are no significant differences between novice and experienced employees.
• If p_experience is less than the chosen significance level (e.g., 0.05), we reject the
null hypothesis and conclude that there are significant differences in the average
time between novice and experienced employees.
.
• The F-statistic for the interaction effect (F_interaction) tests whether there is a
significant interaction between the software program used and the employee
experience level. This means that the effect of one variable depends on the levels of
the other variable.
• The p-value for the interaction effect (p_interaction) represents the probability of
observing such an extreme F-statistic under the assumption that there is no
interaction between software program and employee experience.
• If p_interaction is less than the chosen significance level (e.g., 0.05), we reject the
null hypothesis and conclude that there is a significant interaction effect.
Q36. An educational researcher is interested in
whether a new teaching method improves
student test scores. They randomly assign 100
students to either the control group (traditional
teaching method) or the experimental group
(new teaching method) and administer a test at
the end of the semester. Conduct a two-sample
t-test using Python to determine if there are any
significant differences in test scores between
the two groups. If the results are significant,
follow up with a post-hoc test to determine
which group(s) differ significantly from each
other.
# Ans: 36
import numpy as np
import pandas as pd
from scipy.stats import ttest_ind
from statsmodels.stats.multicomp import MultiComparison
experimental_scores = [80, 82, 79, 85, 81, 83, 78, 82, 79, 84,
82, 80, 83, 79, 80, 81, 84, 82, 79, 81]
print("T-statistic:", t_stat)
print("p-value:", p_value)
T-statistic: -14.393098704677211
p-value: 5.755655840190321e-17
There is a significant difference in test scores between the control
and experimental groups.
Multiple Comparison of Means - Tukey HSD, FWER=0.05
=========================================================
group1 group2 meandiff p-adj lower upper reject
---------------------------------------------------------
Control Experimental 10.4 0.0 8.9372 11.8628 True
---------------------------------------------------------
Q37. A researcher wants to know if there are
any significant differences in the average daily
sales of three retail stores: Store A, Store B, and
Store C. They randomly select 30 days and
record the sales for each store on those days.
Conduct a repeated measures ANOVA using
Python to determine if there are any significant
differences in sales between the three stores. If
the results are significant, follow up with a
post- hoc test to determine which store(s) differ
significantly from each other.
Ans: 37
For the given scenario, we should use a one-way ANOVA for independent samples to compare
the average daily sales of the three retail stores. If the results are significant, we can follow up
with a post-hoc test (e.g., Tukey's HSD) to identify which store(s) have significantly different
sales.
import numpy as np
import pandas as pd
from scipy.stats import f_oneway
from statsmodels.stats.multicomp import MultiComparison
store_B_sales = [1300, 1250, 1350, 1200, 1100, 1400, 1150, 1300, 1250,
1350,
1200, 1100, 1400, 1150, 1300, 1250, 1350, 1200, 1100,
1400,
1150, 1300, 1250, 1350, 1200, 1100, 1400, 1150, 1300,
1250]
store_C_sales = [950, 1000, 900, 1050, 1100, 950, 1000, 900, 1050,
1100,
950, 1000, 900, 1050, 1100, 950, 1000, 900, 1050,
1100,
950, 1000, 900, 1050, 1100, 950, 1000, 900, 1050,
1100]
print("F-statistic:", f_statistic)
print("p-value:", p_value)
F-statistic: 67.72352941176479
p-value: 1.839617085900212e-18
There is a significant difference in average daily sales between the
three stores.
Multiple Comparison of Means - Tukey HSD, FWER=0.05
==========================================================
group1 group2 meandiff p-adj lower upper reject
----------------------------------------------------------
Store A Store B 206.6667 0.0 151.672 261.6613 True
Store A Store C -45.0 0.1307 -99.9946 9.9946 False
Store B Store C -251.6667 0.0 -306.6613 -196.672 True
----------------------------------------------------------
import numpy as np
from scipy.stats import f_oneway
Parameters:
arr1 (array-like): First array of data.
arr2 (array-like): Second array of data.
Returns:
F_value (float): The F-value from the variance ratio test.
p_value (float): The corresponding p-value for the test.
"""
# Convert the input arrays to numpy arrays
arr1 = np.array(arr1)
arr2 = np.array(arr2)
# Example usage:
data1 = [10, 15, 12, 18, 20]
data2 = [5, 8, 11, 14, 16]
F-value: 2.4032697547683926
p-value: 0.15967812288374558
Parameters:
alpha (float): The significance level (e.g., 0.05 for 5%
significance).
df_num (int): Degrees of freedom for the numerator.
df_den (int): Degrees of freedom for the denominator.
Returns:
crit_f_value (float): The critical F-value for the two-tailed
test.
"""
# Calculate the critical F-value for a two-tailed test
crit_f_value = f.ppf(1 - alpha / 2, df_num, df_den)
return crit_f_value
# Example usage:
significance_level = 0.05
degrees_of_freedom_num = 3
degrees_of_freedom_den = 20
crit_f_value = critical_f_value(significance_level,
degrees_of_freedom_num, degrees_of_freedom_den)
print("Critical F-value:", crit_f_value)
Parameters:
arr1 (array-like): First array of data.
arr2 (array-like): Second array of data.
Returns:
F_value (float): The F-value from the variance ratio test.
df_num (int): Degrees of freedom for the numerator.
df_den (int): Degrees of freedom for the denominator.
p_value (float): The corresponding p-value for the test.
"""
# Perform the variance ratio test (F-test)
F_value, p_value = f_oneway(arr1, arr2)
def main():
# Set seed for reproducibility
np.random.seed(42)
if __name__ == "__main__":
main()
F-value: 1.5143904526080045
Degrees of freedom (numerator): 49
Degrees of freedom (denominator): 49
p-value: 0.22141591563741264
Q41.The variances of two populations are
known to be 10 and 15. A sample of 12
observations is taken from each population.
Conduct an F-test at the 5% significance level
to determine if the variances are significantly
different.
# Ans: 41
Parameters:
arr1 (array-like): First array of data.
arr2 (array-like): Second array of data.
Returns:
F_value (float): The F-value from the variance ratio test.
df_num (int): Degrees of freedom for the numerator.
df_den (int): Degrees of freedom for the denominator.
p_value (float): The corresponding p-value for the test.
"""
# Perform the variance ratio test (F-test)
F_value, p_value = f_oneway(arr1, arr2)
def main():
# Set seed for reproducibility
np.random.seed(42)
# Sample size
sample_size = 12
if __name__ == "__main__":
main()
F-value: 6.08206374265242
Degrees of freedom (numerator): 11
Degrees of freedom (denominator): 11
p-value: 0.021924080753184683
Parameters:
arr1 (array-like): First array of data.
arr2 (array-like): Second array of data.
Returns:
F_value (float): The F-value from the variance ratio test.
df_num (int): Degrees of freedom for the numerator.
df_den (int): Degrees of freedom for the denominator.
p_value (float): The corresponding p-value for the test.
"""
# Perform the variance ratio test (F-test)
F_value, p_value = f_oneway(arr1, arr2)
def main():
# Set seed for reproducibility
np.random.seed(42)
# Sample size
sample_size = 25
# Sample variance
sample_variance = np.var(sample, ddof=1) # Use ddof=1 for
unbiased sample variance
if __name__ == "__main__":
main()
Parameters:
df_num (int): Degrees of freedom for the numerator.
df_den (int): Degrees of freedom for the denominator.
Returns:
mean (float): The mean of the F-distribution.
variance (float): The variance of the F-distribution.
"""
# Calculate the mean and variance of the F-distribution
mean = df_den / (df_den - 2)
variance = (2 * df_den**2 * (df_num + df_den - 2)) / (df_num *
(df_den - 2)**2 * (df_den - 4))
# Example usage:
degrees_of_freedom_num = 3
degrees_of_freedom_den = 20
mean, variance = f_distribution_mean_variance(degrees_of_freedom_num,
degrees_of_freedom_den)
print("Mean of F-distribution:", mean)
print("Variance of F-distribution:", variance)
Parameters:
arr1 (array-like): First array of data.
arr2 (array-like): Second array of data.
Returns:
F_value (float): The F-value from the variance ratio test.
df_num (int): Degrees of freedom for the numerator.
df_den (int): Degrees of freedom for the denominator.
p_value (float): The corresponding p-value for the test.
"""
# Perform the variance ratio test (F-test)
F_value, p_value = f_oneway(arr1, arr2)
def main():
# Set seed for reproducibility
np.random.seed(42)
# Sample variances of two populations
sample_variance1 = 25
sample_variance2 = 20
# Sample sizes
sample_size1 = 10
sample_size2 = 15
if __name__ == "__main__":
main()
Sample Variance 1: 25
Sample Variance 2: 20
F-value: 9.385483069468613
Degrees of freedom (numerator): 9
Degrees of freedom (denominator): 14
p-value: 0.005501846884736198
Q45. The following data represent the waiting
times in minutes at two different restaurants on
a Saturday night: Restaurant A: 24, 25, 28, 23,
22, 20, 27; Restaurant B: 31, 33, 35, 30, 32, 36.
Conduct an F-test at the 5% significance level
to determine if the variances are significantly
different.
# Ans: 45
Parameters:
arr1 (array-like): First array of data.
arr2 (array-like): Second array of data.
Returns:
F_value (float): The F-value from the variance ratio test.
df_num (int): Degrees of freedom for the numerator.
df_den (int): Degrees of freedom for the denominator.
p_value (float): The corresponding p-value for the test.
"""
# Perform the variance ratio test (F-test)
F_value, p_value = f_oneway(arr1, arr2)
def main():
# Set seed for reproducibility
np.random.seed(42)
# Sample variances
sample_variance_a = np.var(waiting_times_restaurant_a, ddof=1) #
Use ddof=1 for unbiased sample variance
sample_variance_b = np.var(waiting_times_restaurant_b, ddof=1)
if __name__ == "__main__":
main()
Parameters:
arr1 (array-like): First array of data.
arr2 (array-like): Second array of data.
Returns:
F_value (float): The F-value from the variance ratio test.
df_num (int): Degrees of freedom for the numerator.
df_den (int): Degrees of freedom for the denominator.
p_value (float): The corresponding p-value for the test.
"""
# Perform the variance ratio test (F-test)
F_value, p_value = f_oneway(arr1, arr2)
def main():
# Set seed for reproducibility
np.random.seed(42)
# Sample variances
sample_variance_a = np.var(group_a_scores, ddof=1) # Use ddof=1
for unbiased sample variance
sample_variance_b = np.var(group_b_scores, ddof=1)
if __name__ == "__main__":
main()