1
Table of Contents
Question One. ............................................................................................................................. 3
I. Explain in detail the stages of the Machine Learning life cycle. ........................................................... 3
II. Explain in detail the difference between overfitting and underfitting in Machine Learning and ways to overcome
them. 3
III. What is the difference between regression and classification .............................................................. 4
IV. Write a pseudo algorithm for the K-means clustering. .................................................................. 4
V. Using examples and mathematical equations indicate the difference between Entropy and Gini Impurity in a Decision
Tree? 7
VI. Calculate the following performance metrics: accuracy, precision, recall, F1-score, and specificity and interpret
these metrics in the context of the Covid-19 disease prediction problem. ........................................................ 7
VII. Discuss the implications of changing the decision threshold of your model. How would increasing or decreasing the
threshold affect the confusion matrix and the derived metrics. ................................................................... 9
VIII. Which type of error (false positive or false negative) do you think is more critical to minimize, and why?
Propose a strategy to mitigate this type of error. ................................................................................... 9
IX. If the prevalence of the disease is very low (minority) and you suspect the dataset ................................ 10
QUESTION TWO. .......................................................................................................................... 12
I. What is the trade-off between bias and variance in Machine Learning? ................................................. 12
II. How can a dataset without the target variable be utilized in supervised learning algorithms? ..................... 12
III. Why is accuracy not always the ideal metric for model evaluation ..................................................... 12
IV. Given a dataset and a variety of Machine learning algorithms, how do you decide which algorithm to use. ....... 13
V. You are a data scientist at a real estate company. Your task is to build a model to predict house prices based on
various features such as location, size, number of bedrooms, and age of the house. .......................................... 13
a) Describe the steps you would take to clean and preprocess the dataset. ............................................. 13
b) Explain using example how you would encode categorical variables. .................................................... 13
c) Justify which machine learning algorithms would you consider for this problem. ..................................... 14
d) Describe the process of hyperparameter tuning. Which method would you use in this context and why........... 14
BIBILIOGRAPHY. ......................................................................................................................... 15
2
Question One.
I. Explain in detail the stages of the Machine Learning life cycle.
According to Hanen (2024) machine lifecycle refers to the series of stages involved
in developing, deploying and maintaining a machine learning model. It ensures the
systematic and iterative refinement of models to solve a specific problem
effectively.
• Problem Definition.
The Objective is to define the business problem and goals clearly.
Understanding the requirements, constraints and the desired outcome.
Predicting customer churn for a subscription service to improve retention
strategies is one of the examples for problem definition.
• Data Collection.
The Objective is to Gather relevant data to train the model. This can
include structured data or unstructured data.
Examples include historical customer interaction logs, subscription data
and customer support tickets.
• Feature Engineering
Objective is to select and transform variables into features that are most
relevant for the model.
Examples include creating new features like response times to tickets
etc.
• Exploratory Data Analysis.
Objective is to understand the data, identify patterns, relationships.
Examples include analysing the correlation between customer
engagement and churn rates through heatmaps and scatterplots.
• Model selections.
The objective is to choose an appropriate algorithm based on the
problem type and the data.
Examples when using churn prediction is to use classification algorithms
like logistic regression, random forest or gradient boosting.
II. Explain in detail the difference between overfitting and underfitting in Machine Learning and
ways to overcome them.
• According to GeeksforGeeks (2025), Overfitting occurs when machine
learning model performs exceptionally well on training data but poorly on
unseen data because it memorises patterns and noise.
• Underfitting happens when a model is too simple to capture underlying
trends.
3
• According to A, Calinescu (2019) To reduce overfitting one can use
regularisation methods, dropout and cross-validation, increasing the
amount of training data or simplifying the model.]
• To overcome underfitting the model’s complexity can be increased,
more relevant features can be incorporated, and training can continue for
more iterations.
III. What is the difference between regression and classification
According to TutorialsPoint (n.d) Regression predicts continuous values, such as
house prices or sales forecast. Classification predicts discrete categories, such as
spam versus non-spam emails.
In regression the output variable is numeric while in classification it is categorical.
IV. Write a pseudo algorithm for the K-means clustering.
According to GeeksforGeeks(2025), these are the following steps that should be
taken using python.
Step 1: Importing the necessary libraries
Importing the following libraries will be the first step:
• Numpy: For numerical operations.
• Matplotlib: For plotting data and results.
• Scikit learn: To create a synthetic dataset using make_blobs
Example:
import numpy as np
import [Link] as plt
from [Link] import make_blobs
Step 2: Creating Custom Dataset
Generating a synthetic dataset with make_blobs will be the second step.
• make_blobs (n_samples=500, n_features=2, centers=3): Generates 500 data
points in a 2D space, grouped into 3 clusters.
• [Link](X[:, 0], X[:, 1]): Plots the dataset in 2D, showing all the points.
• [Link](): Displays the plot.
Example:
X,y = make_blobs(n_samples = 500,n_features = 2,centers = 3,random_state =
23)
4
fig = [Link](0)
[Link](True)
[Link](X[:,0],X[:,1])
[Link]()
Step 3: Initializing Random Centroids
Initialize the centroids for K-Means clustering
• [Link](23): Ensures reproducibility by fixing the random seed.
• The for loop initializes k random centroids, with values between -2 and 2, for a
2D dataset.
Example:
k = 3
clusters = {}
[Link](23)
for idx in range(k):
center = 2*(2*[Link](([Link][1],))-1)
points = []
cluster = {
'center' : center,
'points' : []
}
clusters[idx] = cluster
clusters
Step 4: Plotting Random Initialized Center with Data Points
We will now plot the data points and the initial centroids.
• [Link](): Plots a grid.
• [Link](center[0], center[1], marker='*', c='red'): Plots the cluster
center as a red star (* marker).
Example:
[Link](X[:,0],X[:,1])
[Link](True)
for i in clusters:
center = clusters[i]['center']
[Link](center[0],center[1],marker = '*',c = 'red')
[Link]()
Step 5: Defining Euclidean Distance
Assigning data points to the nearest centroid, we define a distance function:
5
• [Link](): Computes the square root of a number or array element-wise.
• [Link](): Sums all elements in an array or along a specified axis.
Example:
def distance(p1,p2):
return [Link]([Link]((p1-p2)**2))
Step 6: Creating Assign and Update Functions
Defining functions to assign points to the nearest centroid and update the centroids
based on the average of the points assigned to each cluster.
• [Link](dis): Appends the calculated distance to the list dist.
• curr_cluster = [Link](dist): Finds the index of the closest cluster by
selecting the minimum distance.
• new_center = [Link](axis=0): Calculates the new centroid by taking
the mean of the points in the cluster.
Example:
def assign_clusters(X, clusters):
for idx in range([Link][0]):
dist = []
curr_x = X[idx]
for i in range(k):
dis = distance(curr_x,clusters[i]['center'])
[Link](dis)
curr_cluster = [Link](dist)
clusters[curr_cluster]['points'].append(curr_x)
return clusters
def update_clusters(X, clusters):
for i in range(k):
points = [Link](clusters[i]['points'])
if [Link][0] > 0:
new_center = [Link](axis =0)
clusters[i]['center'] = new_center
clusters[i]['points'] = []
return clusters
Step 7: Predicting the Cluster for the Data Points
Creating a function to predict the cluster for each data point based on the final
centroids.
• [Link]([Link](dist)): Appends the index of the closest cluster
(the one with the minimum distance) to pred.
Example:
def pred_cluster(X, clusters):
pred = []
for i in range([Link][0]):
6
dist = []
for j in range(k):
[Link](distance(X[i],clusters[j]['center']))
[Link]([Link](dist))
return pred
Step 8: Assigning, Updating and Predicting the Cluster Centers
Assigning points to clusters, update the centroids and predict the final cluster labels.
• assign_clusters(X, clusters): Assigns data points to the nearest centroids.
• update_clusters(X, clusters): Recalculates the centroids.
• pred_cluster(X, clusters): Predicts the final clusters for all data points.
Example:
clusters = assign_clusters(X,clusters)
clusters = update_clusters(X,clusters)
pred = pred_cluster(X,clusters)
Step 9: Plotting Data Points with Predicted Cluster Centre’s
Plot the data points, colored by their predicted clusters, along with the updated
centroids.
• center = clusters[i]['center']: Retrieves the center (centroid) of the current
cluster.
• [Link](center[0], center[1], marker='^', c='red'): Plots the cluster
center as a red triangle (^ marker).
Example:
[Link](X[:,0],X[:,1],c = pred)
for i in clusters:
center = clusters[i]['center']
[Link](center[0],center[1],marker = '^',c = 'red')
[Link]()
V. Using examples and mathematical equations indicate the difference between Entropy and Gini
Impurity in a Decision Tree?
According to GeeksforGeeks (2025) Entropy measures impurity using the formula
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = −∑(𝑝𝑖 log 2 𝑝𝑖 ) while Gini Impurity measures impurity with 𝐺𝑖𝑛𝑖 = 1 −
∑(𝑝𝑖2 ), where 𝑝𝑖 is the proportion of class 𝑖. The main difference is that Gini Impurity
penalizes misclassifications more severely, whereas Entropy is more sensitive to
changes in class probabilities. For a binary classification, Gini Impurity ranges from 0 to
0.5, while Entropy ranges from 0 to 1.
VI. Calculate the following performance metrics: accuracy, precision, recall, F1-score, and specificity
and interpret these metrics in the context of the Covid-19 disease prediction problem.
According to Brownlee, J. (2021) this is how we calculate performance metrics in the
context of the covid-19 disease prediction problem.
7
We have the confusion matrix:
Predicted Positive Predicted Negative
Actual Positive TP = 40 FN = 10
Actual Negative FP = 20 TN = 30
From this:
1. Accuracy
𝑇𝑃 + 𝑇𝑁 40 + 30 70
Accuracy = = = = 0.70
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁 40 + 10 + 20 + 30 100
= 70% of all patients are correctly classified (either as Covid or not).
2. Precision (Positive Predictive Value)
𝑇𝑃 40 40
Precision = = = = 0.67
𝑇𝑃 + 𝐹𝑃 40 + 20 60
When the model predicts Covid Positive, it is correct 67% of the time, so 33% of
positives predicted may be false alarms
3. Recall (Sensitivity, True Positive Rate)
𝑇𝑃 40 40
Recall = = = = 0.80
𝑇𝑃 + 𝐹𝑁 40 + 10 50
The model detects 80% of the actual Covid-positive patients, it misses 20%.
4. F1-score
Precision × Recall 0.67 × 0.80
𝐹1 = 2 ⋅ =2⋅ ≈ 0.73
Precision + Recall 0.67 + 0.80
About 73% — balances precision and recall.
5. Specificity (True Negative Rate)
𝑇𝑁 30 30
Specificity = = = = 0.60
𝑇𝑁 + 𝐹𝑃 30 + 20 50
The model correctly identifies 60% of non-Covid patients as negative.
8
VII. Discuss the implications of changing the decision threshold of your model. How would increasing
or decreasing the threshold affect the confusion matrix and the derived metrics.
According to Geron, A. (2023) Most classification models output a probability score for
“Positive.” You pick a threshold above which you call it positive, below which negative.
Changing that threshold shifts the balance between FP and FN and thus changes the
confusion matrix and all derived metrics.
Increasing the threshold:
• The model becomes stricter about labelling someone “Positive.”
• FP decreases — fewer healthy people are wrongly called positive → precision
up, specificity up.
• But FN increases — more actual positives are missed → recall down.
• Accuracy may go up or down depending on class balance and
misclassifications.
• F1 may worsen if recall decreases too much.
Decreasing the threshold:
• The model is more liberal in calling “Positive.”
• FN decreases — fewer infected people missed → recall up.
• But FP increases — more false alarms → precision down, specificity down.
• F1 might improve if the gain in recall offsets the loss in precision.
According to Brownlee J. (2021) this is the Impact on confusion matrix:
• As threshold increases: TP might go down, FN increases; FP goes down, TN may
go up.
• As threshold decreases: TP goes up, FN goes down; FP goes up, TN goes down.
In disease detection, adjusting the threshold is a key lever to manage the trade-off
between missing cases vs over-flagging.
VIII. Which type of error (false positive or false negative) do you think is more critical to minimize,
and why? Propose a strategy to mitigate this type of error.
More critical error: False Negative (FN)
According to Goodfellow, I., Bengio, Y. and Courville, A (2016) the below will discuss
the error types and model trade-offs.
According to Geron, A. (2023) the below will also discuss the practical methods to
handle false negatives and threshold tuning.
According to Han, J., Kamber, M. and Pei J. (2022) will provide foundational data mining
strategies for improving classification models.
9
A false negative means an infected person is classified as healthy which means the
model fails to detect Covid in someone who has it.
In the context of Covid-19:
• That person may not be isolated or treated, risking onward transmission.
• Delay in treatment may worsen health outcomes.
• It undermines public health efforts to contain spread.
Strategy to mitigate false negatives:
• Lower the decision threshold so that the model becomes more sensitive so it
can be able to catch more positives at the cost of more false positives.
• Use cost-sensitive learning or weighted loss functions: assign a higher cost
penalty to false negatives during training, so the model is biased to avoid missing
positives.
• Oversample the positive class or generate synthetic positive examples to give
the model more examples of positive cases so it learns their patterns better
• Ensemble methods / stacking: combine multiple models to improve
robustness and detection.
• Use a two-stage system: a sensitive screening model first, followed by a more
precise confirmatory model to reduce false positives.
The goal is to tilt the model toward higher sensitivity/recall, sacrificing some
specificity or precision if needed, because in a disease context, missing a sick person is
riskier than a false alarm.
IX. If the prevalence of the disease is very low (minority) and you suspect the dataset
According to Chawla, N.V. et al. (2002) he explains the comprehensive guide to learning
from imbalanced data
According to He, H. & Garcia, E.A. (2009) he provides a comprehensive guide to learning
from imbalanced data.
According to Sun, Y., Wong, A. & Kamel, M. (2009) he explains the review of imbalanced
classification methods.
When Covid cases are rare in your dataset (positive class is the minority), the model
may bias toward predicting “negative” to get high accuracy. Here are five standard
steps to deal with this:
Resampling (Oversampling / Undersampling)
• Oversample the minority class so the model sees more positive examples.
• Undersample the majority class to reduce negative examples to balance.
• Often a hybrid oversample + undersample are used to avoid extreme imbalance.
Synthetic Data Generation
10
• SMOTE (Synthetic Minority Over-sampling Technique) creates synthetic positive
examples by interpolating between existing ones.
• Helps the model generalize better rather than memorizing duplicates.
Use Class Weights / Cost-Sensitive Learning
• During training, assign a larger penalty or weight to misclassifying positive class,
so the model “pays more attention” to positives.
• Many algorithms such as logistic regression, SVM, tree models to support class
weights.
Use Appropriate Evaluation Metrics
• Instead of relying solely on accuracy which is misleading on imbalanced data,
use metrics sensitive to minority class, like precision, recall, F1-score, AUC-
ROC, PR-AUC, balanced accuracy.
• Also consider balanced accuracy = (sensitivity + specificity)/2 as a measure
that treats both classes fairly
Ensemble Methods / Specialized Algorithms
• Use boosting which focus more on “hard” cases.
• Use balanced random forests or algorithms designed for imbalanced data
• Use hybrid methods combining sampling and ensemble.
• Tune threshold via validation to optimize the trade-off.
Bonus additional step:
Cross-validation with stratified splits
• Ensure each fold has similar proportion of positive and negative examples —
prevents folds with zero or too few positives, which would distort metrics.
By combining these strategies, you reduce the adverse effects of imbalance and make
your model more robust in detecting the minority class (Covid-positive patients).
11
QUESTION TWO.
I. What is the trade-off between bias and variance in Machine Learning?
According to Hastie (2017) The bias-variance trade-off describes the balance between a
model’s ability to fit training data and its ability to generalize to unseen data.
• Bias is the error due to overly simplistic assumptions in the model, leading to
underfitting.
• Variance is the error due to the model being too sensitive to small fluctuations
in the training set, leading to overfitting.
A model with high bias performs poorly on both training and test data, while a high
variance model performs well on training data but poorly on new data. The goal is to
find a balance that minimizes total error by controlling model complexity.
II. How can a dataset without the target variable be utilized in supervised learning algorithms?
According to Ng (2020) Supervised learning requires a target variable (l to train the
model. If the dataset lacks this, it can be used in the following ways:
• Data Labelling: The use of manual or semi-automated labelling to assign target
values.
• Proxy Variables: Identifying a related variable that can serve as an approximate
target.
• Unsupervised Pretraining: Using the data for unsupervised learning to improve
supervised models later.
• Transfer Learning: Using pre-trained models on similar labelled datasets and
fine-tune on the unlabelled data.
• Active Learning: Querying the domain experts or systems to label the most
informative samples
III. Why is accuracy not always the ideal metric for model evaluation
According to Sammut, Web (2017) Accuracy measures the proportion of correctly
classified instances, but it can be misleading in cases such as:
• Imbalanced Datasets: When one class dominates, a model that predicts only
the majority class can still achieve high accuracy.
• Different Costs of Errors: In medical or fraud detection, false negatives can be
more costly than false positives.
• Lack of Insight: Accuracy does not show how well a model performs per class
or on edge cases.
12
Alternative metrics like Precision, Recall, F1-score, and AUC provide better insights
into model performance
IV. Given a dataset and a variety of Machine learning algorithms, how do you decide which algorithm
to use.
According to Geron (2019) Choosing an algorithm depends on:
• Data Type: Structured vs. unstructured
• Dataset Size: Smaller datasets → simpler models, larger → complex models
• Interpretability Needs: Linear models are easier to interpret than neural networks.
• Computation Resources: Deep learning requires more computational power.
• Performance on Validation Data: Compare models using cross-validation to pick the
one with best generalization
V. You are a data scientist at a real estate company. Your task is to build a model to predict house
prices based on various features such as location, size, number of bedrooms, and age of the house.
a) Describe the steps you would take to clean and preprocess the dataset.
According to Geron(2019) these are the steps one should take to clean and preprocess the
dataset.
• Handling Missing Values: Use mean/median for numeric, and mode or “unknown”
category for categorical.
• Outlier Detection: Remove or cap extreme values in features like price or size.
• Feature Scaling: Apply normalization or standardization to ensure fair comparison.
• Removing Duplicates and Errors: Drop repeated or inconsistent entries.
• Train-Test Split: Divide data into training and testing sets for validation.
b) Explain using example how you would encode categorical variables.
According to Brownlee, J (2020) these are the ways one would encode categorial variables.
One-Hot Encoding:
• Example: “Location” = {Urban, Suburban, Rural} → three binary columns: Urban(1,0,0),
Suburban (0,1,0), Rural(0,0,1).
• Works well for nominal variables without order.
Label Encoding:
• Example: “Condition” = {Poor, Fair, Good} → {0, 1, 2}.
• Suitable for ordinal variables with natural order.
Target Encoding:
• Replaces a category with the average target value per category. Useful for high-
cardinality features
13
c) Justify which machine learning algorithms would you consider for this problem.
Based on the information written by Hastie T, Tibshirani R, and Friedman J (2017) these are the
algorithms I would consider.
• Linear Regression: Simple and interpretable baseline model.
• Decision Tree Regressor: Captures non-linear relationships.
• Random Forest or Gradient Boosting: Handle complex patterns and feature
interactions effectively.
• Support Vector Regressor: Useful for smaller datasets with non-linear boundaries
d) Describe the process of hyperparameter tuning. Which method would you use in this context and
why
According to Raschka S. and Mirjalili V. (2020) Hyperparameter tuning involves finding
the optimal values for parameters that control the learning process.
These are the methods that should be used in this context.
Define the model and hyperparameters to tune.
Choose a search strategy:
• Grid Search: Exhaustively tries all parameter combinations.
• Random Search: Randomly samples parameter combinations.
• Bayesian Optimization: Uses past results to choose next parameters more
intelligently.
Use Cross-validation to evaluate each configuration.
Select the model with the best performance on validation data
Recommended Method would be:
For house price prediction, Randomized Search CV is efficient as it balances time and
performance with large parameter spaces.
14
BIBILIOGRAPHY.
Question One.
GeeksforGeeks (2025) Explain in detail the difference between overfitting and
underfitting in Machine Learning and ways to overcome them. Available at:
[Link]
decision-tree-ml/ (Accessed:16 September 2025)
GeeksforGeeks (2025) Write a pseudo algorithm for the K-means clustering. Available
at: [Link]
introduction/ (Accessed: 16 September 2025)
Brownlee, J. (2021) Performance Metrics for Classification Models. Machine Learning
Mastery. Available at: [Link]
classification-models/ (Accessed: 21 September 2025).
Chawla, N.V., Bowyer, K.W., Hall, L.O. and Kegelmeyer, W.P. (2002) ‘SMOTE: Synthetic
Minority Over-sampling Technique’, Journal of Artificial Intelligence Research, 16, pp.
321–357. Available at: [Link]
(Accessed: 21 September2025).
Géron, A. (2023) Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow,
3rd ed. Sebastopol: O’Reilly Media. Available at:
[Link]
(Accessed: 22 September 2025).
Goodfellow, I., Bengio, Y. and Courville, A. (2016) Deep Learning. Cambridge, MA: MIT
Press. Available at: [Link] (Accessed: 15 October 2025).
Han, J., Kamber, M. and Pei, J. (2022) Data Mining: Concepts and Techniques, 4th ed.
Burlington: Morgan Kaufmann. Available at: [Link]
mining/han/9780128117606 (Accessed: 23 September 2025).
He, H. and Garcia, E.A. (2009) ‘Learning from Imbalanced Data’, IEEE Transactions on
Knowledge and Data Engineering, 21(9), pp. 1263–1284. Available at:
[Link] (Accessed: 23 September 2025).
Sun, Y., Wong, A. and Kamel, M. (2009) ‘Classification of Imbalanced Data: A Review’,
International Journal of Pattern Recognition and Artificial Intelligence, 23(4), pp. 687–
719. Available at: [Link] (Accessed: 24
September 2025).
Google Developers (2023) Accuracy, Precision, and Recall. Available at:
[Link]
course/classification/accuracy-precision-recall (Accessed: 25 September 2025).
Question Two.
15
Hastie, T. Tibshirani, R. and Friedman, J. (2017). The Elements of Statistical Learning:
Data Mining, Inference, and Prediction. 2nd ed. Springer. Available at:
[Link] (Accessed: 01 October 2025).
Andrew, NG. (2020). Machine Learning Yearning: Technical Strategy for AI Engineers.
[Link]. Available at: [Link]
(Accessed: 01 October 2025)
Sammut, C. and Webb, GI. (2017). Encyclopaedia of Machine Learning and Data Mining.
Springer. Available at: [Link]
7687-1 (Accessed: 03 October 2025).
Géron, A. (2019) Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow.
2nd ed. O’Reilly Media. Available at: [Link]
machine-learning/9781492032632/ (Accessed: 03 October 2025).
Brownlee J (2020) Data Preparation for Machine Learning: Data Cleaning, Feature
Selection, and Data Transforms in Python. Machine Learning Mastery. Available at:
[Link]
(Accessed: 04 October 2025).
Raschka, S and Mirjalili, V (2020). Python Machine Learning. 3rd ed. Packt Publishing.
Available at: [Link] (Accessed: 05 October 2025).
16
17