Statistics is a branch of mathematics that focuses on the collection, organization, analysis, and interpretation of data. It helps in understanding patterns, drawing conclusions and making informed decisions across various fields such as business, healthcare, economics and research.
1. Define Mean, Median and Mode with examples.
- Mean: The average of numbers. Example: (2+4+6)/3 = 4.
- Median: The middle value when numbers are ordered. Example: In [1, 3, 5], median = 3.
- Mode: The most frequently occurring value. Example: In [2, 2, 3, 4], mode = 2.
2. What is a Range in Statistics?
The range is the difference between the maximum and minimum values in a dataset. For example, in [4, 6, 9, 15], range = 15 - 4 = 11. It gives an idea of data spread but can be affected by outliers.
3. What is Variance?
Variance measures how far the data points are spread out from the mean. It is calculated as the average of squared deviations from the mean. A high variance means data points are widely spread.
4. What is Standard Deviation?
Standard deviation is the square root of variance, showing the average distance of values from the mean. Unlike variance, it is in the same unit as the data. Example: If test scores have low standard deviation, most students scored close to the average.
5. What are Outliers and how are they detected?
Outliers are extreme values that differ significantly from the rest of the data. They can be detected using statistical methods like the IQR rule (values outside Q1−1.5×IQR or Q3+1.5×IQR) or standard deviation. Example: In [5, 6, 7, 50], the value 50 is an outlier.
6. Difference between Descriptive and Inferential Statistics.
- Descriptive statistics summarize and describe data (e.g., mean, median, charts).
- Inferential statistics use sample data to make predictions or generalizations about a population (e.g., hypothesis testing, confidence intervals).
7. Define Independent and Dependent Events in Probability.
- Independent events: The outcome of one event does not affect the other (e.g., coin toss and rolling a die).
- Dependent events: The outcome of one event affects the other (e.g., drawing cards from a deck without replacement).
8. What is Correlation?
Correlation measures the relationship between two variables, ranging from -1 to +1. A positive correlation means as one variable increases, the other increases (e.g., height and weight). A negative correlation means as one increases, the other decreases (e.g., exercise time and weight).
9. What is the difference between Population and Sample?
A population is the complete set of items or individuals we want to study (e.g., all students in a school). A sample is a smaller subset taken from the population for analysis (e.g., 50 students chosen randomly). Samples are used because studying the entire population is often impractical.
10. What is a Normal Distribution?
A normal distribution is a bell-shaped curve where most data points lie near the mean. It is symmetric, with mean = median = mode. Many natural phenomena like heights and exam scores follow this distribution.
11. What is a Uniform Distribution?
In uniform distribution, all outcomes are equally likely. Example: Rolling a fair die gives equal probability (1/6) for each side. Unlike normal distribution, it has no peak.
12. What is Skewness in data?
Skewness measures the asymmetry of data distribution.
- Positive skew: Tail is longer on the right (e.g., income distribution).
- Negative skew: Tail is longer on the left.
- A skewness of 0 means perfectly symmetric distribution.
13. What is Kurtosis?
Kurtosis measures the "tailedness" of a distribution how heavy or light the tails are compared to a normal distribution. High kurtosis means more extreme values (outliers). Low kurtosis means fewer outliers.
14. What is a Frequency Distribution?
Frequency distribution is a table or graph showing how often each value occurs in a dataset. Example: In test scores, 5 students scored 50, 10 students scored 60, etc. It helps visualize patterns in data.
15. Define Percentiles and Quartiles.
- Percentiles: Divide data into 100 equal parts. Example: 90th percentile means a value greater than 90% of data.
- Quartiles: Divide data into 4 equal parts (Q1 = 25%, Q2 = median, Q3 = 75%).
16. What is a Histogram?
A histogram is a bar graph showing the frequency of data within intervals (bins). Example: Heights of students grouped in 5 cm ranges. It is useful for visualizing data distribution.
17. What is a Boxplot used for?
A boxplot (or whisker plot) shows the distribution of data using minimum, Q1, median, Q3 and maximum values. It also highlights outliers. Boxplots are useful for comparing distributions between groups.
18. What is the Central Limit Theorem (CLT)?
The CLT states that the sampling distribution of the sample mean approaches a normal distribution as the sample size increases, regardless of population distribution. This is important for hypothesis testing and confidence intervals.
19. What is a Time Series in Statistics?
A time series is a sequence of data points collected over time, usually at regular intervals. Example: Daily stock prices or monthly rainfall data. Time series analysis helps identify trends and seasonality.
20. What is a Sampling Bias?
Sampling bias occurs when the selected sample is not representative of the entire population. Example: Surveying only city residents to estimate national opinions. This leads to inaccurate results.
21. What is the usage of Box plot in Statistical Analysis?
Box plots are visualization plots that can be considered an important part of statistics. It is helpful in giving the Measure of Center, Measure of Dispersion and Measure of Position for the data distribution. The following are some aspects in which Box plots help:
- Finding out the median
- Detecting Outlier
- Identifying Data Skewness
- Visualizing Data distribution
22. What is Hypothesis Testing?
Hypothesis testing is a statistical process used to make decisions or inferences about a population based on sample data. It evaluates whether there is enough evidence to reject a claim about the population.
Steps in hypothesis testing:
- Formulate hypotheses (null and alternative).
- Choose a significance level (α, often 0.05).
- Collect and analyze sample data.
- Compute a test statistic (like z, t or chi-square).
- Compare with critical value or p-value to decide.
Example: A company claims the average battery life of its phone is 12 hours. We collect a sample, test it and determine whether to reject or accept this claim.
23. What is Null and Alternative Hypothesis?
Null Hypothesis (H0): The assumption that there is no effect or difference. It represents the "status quo."
Alternative Hypothesis (H1): The statement that contradicts the null hypothesis, suggesting there is an effect or difference.
Example:
- H0: The average exam score of students = 70.
- H1: The average exam score of students ≠ 70.
The test helps decide whether sample evidence supports rejecting H0 in favor of H1.
24. What is a p-value?
The p-value is the probability of observing the sample results (or something more extreme) assuming the null hypothesis is true. It measures the strength of evidence against H₀.
- A low p-value (< 0.05) means strong evidence against H0 -> reject it.
- A high p-value (> 0.05) means weak evidence -> fail to reject H0.
- Example: If a drug trial shows a p-value of 0.01, it means there’s only a 1% chance the observed results could occur if the drug had no effect.
25. What is a Confidence Interval?
A confidence interval (CI) is a range of values, derived from sample data, that is likely to contain the true population parameter.
- Example: If we calculate a 95% CI for average weight as [60, 65], it means we are 95% confident the true mean weight of the population lies between 60 and 65.
- Wider intervals = more uncertainty.
- CI depends on sample size (larger samples give narrower CIs) and variability.
26. One-tailed vs Two-tailed Test?
- One-tailed test: Looks for effect in only one direction.
Example: Testing if a new medicine increases life expectancy (only “greater than” is tested). - Two-tailed test: Looks for effect in both directions.
Example: Testing if a medicine changes life expectancy (could be “greater” or “less”).
Choice depends on the research question
27. What is a Z-test?
A z-test is used to determine if there is a significant difference between sample and population means when:
- The population variance is known.
- The sample size is large (n > 30).
- Example: A factory claims the average weight of a packet is 1 kg with σ = 0.1. From 50 packets, we test if the average weight significantly differs from 1 kg using a z-test.
28. What is a T-test?
A t-test is used to compare means when the population variance is unknown and sample size is small (n < 30).
Types:
- One-sample t-test: Compare sample mean vs population mean.
- Independent two-sample t-test: Compare means of two independent groups.
- Paired t-test: Compare means before and after treatment on the same group.
- Example: Comparing average marks of students in two classes with small sample sizes
29. What is ANOVA?
Analysis of Variance (ANOVA) is used to compare means of three or more groups. It tests whether at least one group mean is significantly different. Example: A teacher wants to compare average test scores of students from three different teaching methods. ANOVA tells if there’s a significant difference among groups. If ANOVA shows significance, post-hoc tests (like Tukey’s test) identify which groups differ.
30. What is a Chi-Square Test?
The chi-square test is used for categorical data to examine whether there is an association between two variables or whether observed data fits expected data.
- Example 1 (Goodness of Fit): Checking if dice are fair by comparing observed vs expected frequencies.
- Example 2 (Independence): Checking if gender is related to choice of shopping preference.
31. What are Sampling Techniques?
Sampling techniques decide how we select a subset of data from a population.
- Random Sampling: Each item has equal chance.
- Stratified Sampling: Divide population into strata (e.g., age groups), then sample from each.
- Cluster Sampling: Divide into clusters (e.g., cities) and randomly select clusters.
- Systematic Sampling: Select every k-th element from a list.
- Example: For a college survey, stratified sampling ensures fair representation from each department.
32. Parametric vs Non-Parametric Tests?
- Parametric Tests: Assume data follows a known distribution (usually normal). Examples: t-test, z-test, ANOVA.
- Non-Parametric Tests: Do not assume any distribution, useful for ordinal or non-normal data. Examples: Mann-Whitney U test, Kruskal-Wallis test.
- Example: If exam scores are normally distributed -> use t-test. If data is skewed or ranks -> use Mann-Whitney
33. What is a Type I Error?
Type I Error happens when we reject a true null hypothesis (false positive).
- Probability of making Type I error = α (significance level).
- Example: Declaring a medicine effective when it is actually not.
34. What is a Type II Error?
Type II Error occurs when we fail to reject a false null hypothesis (false negative).
- Probability of making Type II error = β.
- The Power of a test is 1 − β, which is the ability to correctly reject a false H₀.
- Example: Failing to detect that a medicine works when it actually does.
35. What are Degrees of Freedom?
- Degrees of Freedom (df) represent the number of independent values that can vary in a statistical calculation.
- Formula for variance of n observations: df = n − 1
- Example: If we have 5 numbers whose mean is fixed, only 4 can vary freely because the 5th is determined by the mean.
36. What is Maximum Likelihood Estimation (MLE)?
MLE is a method of estimating parameters of a statistical model by maximizing the likelihood function. It chooses parameter values that make the observed data most probable.
Formula: L(θ) = Π f(xᵢ | θ)
MLE = arg maxθ L(θ)
37. What is the difference between Bayesian and Frequentist Statistics?
- Frequentist: Probability is the long-run frequency of events. Parameters are fixed, unknown constants.
- Bayesian: Probability represents belief. Parameters are treated as random variables with prior distributions, updated using Bayes’ theorem.
38. What is Multicollinearity in Regression?
Multicollinearity occurs when independent variables in regression are highly correlated, making it hard to estimate coefficients accurately.
- Detected using Variance Inflation Factor (VIF).
- Problem: Inflated standard errors, unstable coefficients.
- Fix: Remove correlated features or use regularization (Ridge regression).
39. What is Heteroscedasticity?
Heteroscedasticity occurs when the variance of residuals in regression is not constant across values of independent variables.
- Violates assumptions of linear regression.
- Detected using plots (residuals vs fitted) or Breusch Pagan test.
- Fixed using log transformation or weighted least squares.
40. What is Autocorrelation?
Autocorrelation means residuals in regression/time series are correlated with each other.
- Detected using the Durbin Watson test.
- Problem: Invalid standard errors, misleading significance tests.
- Fix: Use time-series models like ARIMA.
41. Explain Stationarity in Time Series.
A time series is stationary if its statistical properties (mean, variance, autocorrelation) do not change over time.
- Non-stationary series often have trends or seasonality.
- Stationarity is required for many time series models like ARIMA.
- Achieved by differencing or detrending.
42. What is the AIC and BIC criterion in model selection?
AIC (Akaike Information Criterion): balances model fit and complexity.
Formula: AIC = 2k − 2ln(L)
BIC (Bayesian Information Criterion): stronger penalty for complexity.
Formula: BIC = k ln(n) − 2ln(L)
Where
- k = number of parameters,
- L = likelihood,
- n = sample size.
- Lower AIC/BIC = better model.
43. What is Logistic Regression and when is it used?
Logistic Regression is used for binary classification. It predicts probability using the logistic (sigmoid) function.
Formula: P(Y =1| X) = 1 / (1 + e^(−(β₀ + β₁X)))
It outputs values between 0 and 1, interpreted as probabilities.
44. What is the ROC Curve and AUC?
- ROC Curve (Receiver Operating Characteristic): plots True Positive Rate (TPR) vs False Positive Rate (FPR) at different thresholds.
- AUC (Area Under Curve): measures classifier’s ability to distinguish between classes. AUC = 0.5 (random), AUC = 1 (perfect).
45. What is Regularization in regression?
Regularization prevents overfitting by adding a penalty term to the loss function.
- Ridge (L2): penalty = λ Σ β²
- Lasso (L1): penalty = λ Σ |β| (can shrink coefficients to zero).
- Elastic Net: combination of L1 + L2.
46. What is Principal Component Analysis (PCA)?
PCA is a dimensionality reduction technique that transforms correlated variables into uncorrelated components (principal components).
- First component explains maximum variance.
- Achieved via eigen decomposition or singular value decomposition.
47. What is Multivariate Analysis of Variance (MANOVA)?
MANOVA extends ANOVA to multiple dependent variables. It tests whether mean differences among groups exist across multiple response variables simultaneously.
48. What is Survival Analysis?
Survival analysis studies time until an event occurs (e.g., death, failure).
- Survival function (S(t)): probability of surviving beyond time t.
- Hazard function (h(t)): risk of event at time t given survival until t.
- Common model: Cox Proportional Hazards Model.
49. What is Bootstrapping in Statistics?
Bootstrapping is a resampling method where multiple samples are drawn with replacement from the data to estimate statistics (mean, variance, CI).
- Useful when theoretical distribution is unknown.
- Provides more robust estimates.
50. What is the Difference Between Parametric and Non-Parametric Bootstrapping?
- Parametric Bootstrapping: assumes data follows a specific distribution, resamples are drawn from that distribution.
- Non-Parametric Bootstrapping: does not assume any distribution; resamples are drawn directly from observed data.
51. What is meant by standardization? Why do we sometimes standardize Normal Distribution?
Standardization refers to the process of transforming the data into a standard scale. Standardization is done by subtracting the mean and then dividing by standard deviation. It is done so that the data is centered around 0 and has the standard deviation of 1.
Standardization is sometimes implemented on Normal Distribution so that it is transformed into a more standardized scale. This is done so that:
- it is more comparable with respect to the original distribution which will further help in inferring how much the data point varies
- Allows various tests like Z-test and T-test, which largely assumes that the distribution is standardized.
- helps in Outlier Detection.