Pre-T1 Assignment 1
Topics: Re-sampling methods: Bias–Variance Trade-off. Hypothesis Testing and Variable
Selection, Sub sampling and Upsampling, SMOTE; Cross Validation (validation set, Leave- One-
Cut (LOO), k-fold strategies) and bootstrap; Evaluation measures-Error functions, Confusion
Matrix, Accuracy, Precision and Recall, F1 Score.
1. Briefly differentiate between bias and variance in machine learning. How do they relate to
the model's capacity to generalize?
2. Discuss the impact of bias and variance on the performance of machine learning models,
emphasizing their role in the trade-off for supervised learning. Illustrate your explanation
with a real-world example showcasing scenarios of underfitting and overfitting. (3)
3. Define subsampling, oversampling, and Synthetic Minority Over-sampling Technique
(SMOTE) in the context of addressing imbalanced datasets.
4. What is cross-validation? Briefly explain stratified k-Fold Cross-Validation and Time Series
Cross-Validation.
5. What is bootstrap resampling? Explain its purpose and briefly describe its working
mechanism.
6. Discuss the advantages of using bootstrap resampling for estimating the confidence
intervals of a machine learning model's performance metrics. Why is this technique
particularly useful for small datasets?
7. What do you understand by overfitting and underfitting of a machine learning model? n
in the context of machine learning? Provide concise definitions with examples.
8. Define the terms: True Positive (TP), True Negative (TN), False Positive (FP), and False
Negative (FN).
9. What is a confusion matrix? Why is it used?
10. A supervised learning model gives the following error components for a dataset:
𝐵𝑖𝑎𝑠 = 0.6, 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = 0.4, 𝐼𝑟𝑟𝑒𝑑𝑢𝑐𝑖𝑏𝑙𝑒 𝐸𝑟𝑟𝑜𝑟 = 0.2.
What is the total error of the model? Suppose the bias decreases to 0.5 while the variance
increases to 0.6, calculate the new total error. How will these changes affect model
performance? Suggest approaches to reduce the overall error, considering both high
variance and high bias scenarios.
11. A healthcare organization is developing a machine learning model to predict rare diseases
using patient records. The dataset contains 500,000 patient records, but only 2,500
records are labelled as patients diagnosed with the rare disease. Explain why the
imbalance in the dataset poses a significant problem for training a machine learning model
in this rare disease prediction scenario. Describe how SMOTE can be applied to balance
the dataset, providing a step-by-step explanation.
12. A data analyst is building a predictive model for stock price movements using a dataset
with 50,000 records. The goal is to fine-tune the model’s hyper-parameters while
considering the sequential nature of the data to avoid data leakage. Recommend an
appropriate cross-validation strategy and justify your choice. Discuss the pros and cons of
using Time Series Cross-Validation for this scenario, focusing on computational efficiency,
variance of performance estimates, and its suitability for time-dependent data.
13. A machine learning model is built to predict student performance in exams. The model
shows high accuracy on the training dataset but performs poorly on the test dataset.
Identify whether the model suffers from high bias or high variance. What steps would you
take to address the issue and improve test accuracy. Justify your suggestions.
14. A healthcare organization is using a machine learning model to predict whether a patient
has a specific disease (Positive) or not (Negative). The following confusion matrix is
obtained after testing the model on a dataset:
Predicted Positive Predicted Negative
Actual Positive 80 20
Actual Negative 30 170
Based on the given confusion matrix, explain the meaning of each value (80, 20, 30, and
170) in the context of the healthcare diagnosis system. Calculate the model's accuracy,
precision, recall, and F1 Score. If the model's primary goal is to minimize False Negatives
(e.g., to avoid missing disease cases), discuss whether this model performs well. Suggest
ways to improve the model if necessary.
15. A company has developed a spam detection model. On evaluating the model with a test
dataset, the following confusion matrix is obtained:
Predicted Spam Predicted Not Spam
Actual Spam 120 30
Actual Not Spam 50 300
Interpret the meaning of the values 120, 30, 50, and 300 in the context of spam detection.
Compute the accuracy, precision, recall, and F1 Score of the spam detection model. If the
company prioritizes minimizing False Positives (to ensure legitimate emails are not marked
as spam), assess whether the model is suitable. Provide recommendations to adjust the
model if improvements are needed.