0% found this document useful (0 votes)

22 views17 pages

Machine Learning Concepts and Algorithms

The document outlines key concepts in Machine Learning, including the life cycle stages, differences between overfitting and underfitting, and the distinction between regression and classification. It also discusses performance metrics in the context of Covid-19 predictions, the implications of changing decision thresholds, and the importance of minimizing false negatives. Additionally, it provides a pseudo algorithm for K-means clustering and compares Entropy and Gini Impurity in decision trees.

Uploaded by

mantsha mthanyelo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views17 pages

Machine Learning Concepts and Algorithms

Uploaded by

mantsha mthanyelo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

1

Table of Contents
Question One. ............................................................................................................................. 3
I. Explain in detail the stages of the Machine Learning life cycle. ........................................................... 3
II. Explain in detail the difference between overfitting and underfitting in Machine Learning and ways to overcome
them. 3
III. What is the difference between regression and classification .............................................................. 4
IV. Write a pseudo algorithm for the K-means clustering. .................................................................. 4
V. Using examples and mathematical equations indicate the difference between Entropy and Gini Impurity in a Decision
Tree? 7
VI. Calculate the following performance metrics: accuracy, precision, recall, F1-score, and specificity and interpret
these metrics in the context of the Covid-19 disease prediction problem. ........................................................ 7
VII. Discuss the implications of changing the decision threshold of your model. How would increasing or decreasing the
threshold affect the confusion matrix and the derived metrics. ................................................................... 9
VIII. Which type of error (false positive or false negative) do you think is more critical to minimize, and why?
Propose a strategy to mitigate this type of error. ................................................................................... 9
IX. If the prevalence of the disease is very low (minority) and you suspect the dataset ................................ 10
QUESTION TWO. .......................................................................................................................... 12
I. What is the trade-off between bias and variance in Machine Learning? ................................................. 12
II. How can a dataset without the target variable be utilized in supervised learning algorithms? ..................... 12
III. Why is accuracy not always the ideal metric for model evaluation ..................................................... 12
IV. Given a dataset and a variety of Machine learning algorithms, how do you decide which algorithm to use. ....... 13
V. You are a data scientist at a real estate company. Your task is to build a model to predict house prices based on
various features such as location, size, number of bedrooms, and age of the house. .......................................... 13
a) Describe the steps you would take to clean and preprocess the dataset. ............................................. 13
b) Explain using example how you would encode categorical variables. .................................................... 13
c) Justify which machine learning algorithms would you consider for this problem. ..................................... 14
d) Describe the process of hyperparameter tuning. Which method would you use in this context and why........... 14
BIBILIOGRAPHY. ......................................................................................................................... 15

2
Question One.
I. Explain in detail the stages of the Machine Learning life cycle.
According to Hanen (2024) machine lifecycle refers to the series of stages involved
in developing, deploying and maintaining a machine learning model. It ensures the
systematic and iterative refinement of models to solve a specific problem
effectively.

• Problem Definition.
The Objective is to define the business problem and goals clearly.
Understanding the requirements, constraints and the desired outcome.
Predicting customer churn for a subscription service to improve retention
strategies is one of the examples for problem definition.
• Data Collection.
The Objective is to Gather relevant data to train the model. This can
include structured data or unstructured data.
Examples include historical customer interaction logs, subscription data
and customer support tickets.
• Feature Engineering
Objective is to select and transform variables into features that are most
relevant for the model.
Examples include creating new features like response times to tickets
etc.
• Exploratory Data Analysis.
Objective is to understand the data, identify patterns, relationships.
Examples include analysing the correlation between customer
engagement and churn rates through heatmaps and scatterplots.
• Model selections.
The objective is to choose an appropriate algorithm based on the
problem type and the data.
Examples when using churn prediction is to use classification algorithms
like logistic regression, random forest or gradient boosting.

II. Explain in detail the difference between overfitting and underfitting in Machine Learning and
ways to overcome them.
• According to GeeksforGeeks (2025), Overfitting occurs when machine
learning model performs exceptionally well on training data but poorly on
unseen data because it memorises patterns and noise.
• Underfitting happens when a model is too simple to capture underlying
trends.

3
• According to A, Calinescu (2019) To reduce overfitting one can use
regularisation methods, dropout and cross-validation, increasing the
amount of training data or simplifying the model.]
• To overcome underfitting the model’s complexity can be increased,
more relevant features can be incorporated, and training can continue for
more iterations.

III. What is the difference between regression and classification

According to TutorialsPoint (n.d) Regression predicts continuous values, such as
house prices or sales forecast. Classification predicts discrete categories, such as
spam versus non-spam emails.
In regression the output variable is numeric while in classification it is categorical.

IV. Write a pseudo algorithm for the K-means clustering.

According to GeeksforGeeks(2025), these are the following steps that should be
taken using python.

Step 1: Importing the necessary libraries

Importing the following libraries will be the first step:

• Numpy: For numerical operations.

• Matplotlib: For plotting data and results.
• Scikit learn: To create a synthetic dataset using make_blobs
Example:
import numpy as np
import [Link] as plt
from [Link] import make_blobs

Step 2: Creating Custom Dataset

Generating a synthetic dataset with make_blobs will be the second step.

• make_blobs (n_samples=500, n_features=2, centers=3): Generates 500 data

points in a 2D space, grouped into 3 clusters.
• [Link](X[:, 0], X[:, 1]): Plots the dataset in 2D, showing all the points.
• [Link](): Displays the plot.
Example:
X,y = make_blobs(n_samples = 500,n_features = 2,centers = 3,random_state =
23)

4
fig = [Link](0)
[Link](True)
[Link](X[:,0],X[:,1])
[Link]()

Step 3: Initializing Random Centroids

Initialize the centroids for K-Means clustering

• [Link](23): Ensures reproducibility by fixing the random seed.

• The for loop initializes k random centroids, with values between -2 and 2, for a
2D dataset.

Example:
k = 3

clusters = {}
[Link](23)

for idx in range(k):

center = 2*(2*[Link](([Link][1],))-1)
points = []
cluster = {
'center' : center,
'points' : []
}

clusters[idx] = cluster

clusters

Step 4: Plotting Random Initialized Center with Data Points

We will now plot the data points and the initial centroids.
• [Link](): Plots a grid.
• [Link](center[0], center[1], marker='*', c='red'): Plots the cluster
center as a red star (* marker).
Example:
[Link](X[:,0],X[:,1])
[Link](True)
for i in clusters:
center = clusters[i]['center']
[Link](center[0],center[1],marker = '*',c = 'red')
[Link]()

Step 5: Defining Euclidean Distance

Assigning data points to the nearest centroid, we define a distance function:

5
• [Link](): Computes the square root of a number or array element-wise.
• [Link](): Sums all elements in an array or along a specified axis.
Example:
def distance(p1,p2):
return [Link]([Link]((p1-p2)**2))

Step 6: Creating Assign and Update Functions

Defining functions to assign points to the nearest centroid and update the centroids
based on the average of the points assigned to each cluster.
• [Link](dis): Appends the calculated distance to the list dist.
• curr_cluster = [Link](dist): Finds the index of the closest cluster by
selecting the minimum distance.
• new_center = [Link](axis=0): Calculates the new centroid by taking
the mean of the points in the cluster.
Example:
def assign_clusters(X, clusters):
for idx in range([Link][0]):
dist = []

curr_x = X[idx]

for i in range(k):
dis = distance(curr_x,clusters[i]['center'])
[Link](dis)
curr_cluster = [Link](dist)
clusters[curr_cluster]['points'].append(curr_x)
return clusters

def update_clusters(X, clusters):

for i in range(k):
points = [Link](clusters[i]['points'])
if [Link][0] > 0:
new_center = [Link](axis =0)
clusters[i]['center'] = new_center

clusters[i]['points'] = []
return clusters

Step 7: Predicting the Cluster for the Data Points

Creating a function to predict the cluster for each data point based on the final
centroids.
• [Link]([Link](dist)): Appends the index of the closest cluster
(the one with the minimum distance) to pred.
Example:
def pred_cluster(X, clusters):
pred = []
for i in range([Link][0]):

6
dist = []
for j in range(k):
[Link](distance(X[i],clusters[j]['center']))
[Link]([Link](dist))
return pred

Step 8: Assigning, Updating and Predicting the Cluster Centers

Assigning points to clusters, update the centroids and predict the final cluster labels.
• assign_clusters(X, clusters): Assigns data points to the nearest centroids.
• update_clusters(X, clusters): Recalculates the centroids.
• pred_cluster(X, clusters): Predicts the final clusters for all data points.
Example:
clusters = assign_clusters(X,clusters)
clusters = update_clusters(X,clusters)
pred = pred_cluster(X,clusters)

Step 9: Plotting Data Points with Predicted Cluster Centre’s

Plot the data points, colored by their predicted clusters, along with the updated
centroids.
• center = clusters[i]['center']: Retrieves the center (centroid) of the current
cluster.
• [Link](center[0], center[1], marker='^', c='red'): Plots the cluster
center as a red triangle (^ marker).
Example:
[Link](X[:,0],X[:,1],c = pred)
for i in clusters:
center = clusters[i]['center']
[Link](center[0],center[1],marker = '^',c = 'red')
[Link]()

V. Using examples and mathematical equations indicate the difference between Entropy and Gini
Impurity in a Decision Tree?
According to GeeksforGeeks (2025) Entropy measures impurity using the formula
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = −∑(𝑝𝑖 log 2 𝑝𝑖 ) while Gini Impurity measures impurity with 𝐺𝑖𝑛𝑖 = 1 −
∑(𝑝𝑖2 ), where 𝑝𝑖 is the proportion of class 𝑖. The main difference is that Gini Impurity
penalizes misclassifications more severely, whereas Entropy is more sensitive to
changes in class probabilities. For a binary classification, Gini Impurity ranges from 0 to
0.5, while Entropy ranges from 0 to 1.

VI. Calculate the following performance metrics: accuracy, precision, recall, F1-score, and specificity
and interpret these metrics in the context of the Covid-19 disease prediction problem.

According to Brownlee, J. (2021) this is how we calculate performance metrics in the

context of the covid-19 disease prediction problem.
7
We have the confusion matrix:

Predicted Positive Predicted Negative

Actual Positive TP = 40 FN = 10
Actual Negative FP = 20 TN = 30

From this:

1. Accuracy

𝑇𝑃 + 𝑇𝑁 40 + 30 70
Accuracy = = = = 0.70
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁 40 + 10 + 20 + 30 100

= 70% of all patients are correctly classified (either as Covid or not).

2. Precision (Positive Predictive Value)

𝑇𝑃 40 40
Precision = = = = 0.67
𝑇𝑃 + 𝐹𝑃 40 + 20 60

When the model predicts Covid Positive, it is correct 67% of the time, so 33% of
positives predicted may be false alarms

3. Recall (Sensitivity, True Positive Rate)

𝑇𝑃 40 40
Recall = = = = 0.80
𝑇𝑃 + 𝐹𝑁 40 + 10 50

The model detects 80% of the actual Covid-positive patients, it misses 20%.

4. F1-score

Precision × Recall 0.67 × 0.80

𝐹1 = 2 ⋅ =2⋅ ≈ 0.73
Precision + Recall 0.67 + 0.80

About 73% — balances precision and recall.

5. Specificity (True Negative Rate)

𝑇𝑁 30 30
Specificity = = = = 0.60
𝑇𝑁 + 𝐹𝑃 30 + 20 50

The model correctly identifies 60% of non-Covid patients as negative.

8
VII. Discuss the implications of changing the decision threshold of your model. How would increasing
or decreasing the threshold affect the confusion matrix and the derived metrics.
According to Geron, A. (2023) Most classification models output a probability score for
“Positive.” You pick a threshold above which you call it positive, below which negative.
Changing that threshold shifts the balance between FP and FN and thus changes the
confusion matrix and all derived metrics.
Increasing the threshold:

• The model becomes stricter about labelling someone “Positive.”

• FP decreases — fewer healthy people are wrongly called positive → precision
up, specificity up.
• But FN increases — more actual positives are missed → recall down.
• Accuracy may go up or down depending on class balance and
misclassifications.
• F1 may worsen if recall decreases too much.
Decreasing the threshold:

• The model is more liberal in calling “Positive.”

• FN decreases — fewer infected people missed → recall up.
• But FP increases — more false alarms → precision down, specificity down.
• F1 might improve if the gain in recall offsets the loss in precision.
According to Brownlee J. (2021) this is the Impact on confusion matrix:

• As threshold increases: TP might go down, FN increases; FP goes down, TN may

go up.
• As threshold decreases: TP goes up, FN goes down; FP goes up, TN goes down.
In disease detection, adjusting the threshold is a key lever to manage the trade-off
between missing cases vs over-flagging.

VIII. Which type of error (false positive or false negative) do you think is more critical to minimize,
and why? Propose a strategy to mitigate this type of error.
More critical error: False Negative (FN)
According to Goodfellow, I., Bengio, Y. and Courville, A (2016) the below will discuss
the error types and model trade-offs.
According to Geron, A. (2023) the below will also discuss the practical methods to
handle false negatives and threshold tuning.
According to Han, J., Kamber, M. and Pei J. (2022) will provide foundational data mining
strategies for improving classification models.

9
A false negative means an infected person is classified as healthy which means the
model fails to detect Covid in someone who has it.
In the context of Covid-19:

• That person may not be isolated or treated, risking onward transmission.

• Delay in treatment may worsen health outcomes.
• It undermines public health efforts to contain spread.
Strategy to mitigate false negatives:

• Lower the decision threshold so that the model becomes more sensitive so it
can be able to catch more positives at the cost of more false positives.
• Use cost-sensitive learning or weighted loss functions: assign a higher cost
penalty to false negatives during training, so the model is biased to avoid missing
positives.
• Oversample the positive class or generate synthetic positive examples to give
the model more examples of positive cases so it learns their patterns better
• Ensemble methods / stacking: combine multiple models to improve
robustness and detection.
• Use a two-stage system: a sensitive screening model first, followed by a more
precise confirmatory model to reduce false positives.
The goal is to tilt the model toward higher sensitivity/recall, sacrificing some
specificity or precision if needed, because in a disease context, missing a sick person is
riskier than a false alarm.

IX. If the prevalence of the disease is very low (minority) and you suspect the dataset
According to Chawla, N.V. et al. (2002) he explains the comprehensive guide to learning
from imbalanced data
According to He, H. & Garcia, E.A. (2009) he provides a comprehensive guide to learning
from imbalanced data.
According to Sun, Y., Wong, A. & Kamel, M. (2009) he explains the review of imbalanced
classification methods.
When Covid cases are rare in your dataset (positive class is the minority), the model
may bias toward predicting “negative” to get high accuracy. Here are five standard
steps to deal with this:
Resampling (Oversampling / Undersampling)

• Oversample the minority class so the model sees more positive examples.
• Undersample the majority class to reduce negative examples to balance.
• Often a hybrid oversample + undersample are used to avoid extreme imbalance.
Synthetic Data Generation

10
• SMOTE (Synthetic Minority Over-sampling Technique) creates synthetic positive
examples by interpolating between existing ones.
• Helps the model generalize better rather than memorizing duplicates.
Use Class Weights / Cost-Sensitive Learning

• During training, assign a larger penalty or weight to misclassifying positive class,

so the model “pays more attention” to positives.
• Many algorithms such as logistic regression, SVM, tree models to support class
weights.
Use Appropriate Evaluation Metrics

• Instead of relying solely on accuracy which is misleading on imbalanced data,

use metrics sensitive to minority class, like precision, recall, F1-score, AUC-
ROC, PR-AUC, balanced accuracy.
• Also consider balanced accuracy = (sensitivity + specificity)/2 as a measure
that treats both classes fairly
Ensemble Methods / Specialized Algorithms

• Use boosting which focus more on “hard” cases.

• Use balanced random forests or algorithms designed for imbalanced data
• Use hybrid methods combining sampling and ensemble.
• Tune threshold via validation to optimize the trade-off.
Bonus additional step:
Cross-validation with stratified splits

• Ensure each fold has similar proportion of positive and negative examples —
prevents folds with zero or too few positives, which would distort metrics.
By combining these strategies, you reduce the adverse effects of imbalance and make
your model more robust in detecting the minority class (Covid-positive patients).

11
QUESTION TWO.
I. What is the trade-off between bias and variance in Machine Learning?
According to Hastie (2017) The bias-variance trade-off describes the balance between a
model’s ability to fit training data and its ability to generalize to unseen data.

• Bias is the error due to overly simplistic assumptions in the model, leading to
underfitting.
• Variance is the error due to the model being too sensitive to small fluctuations
in the training set, leading to overfitting.
A model with high bias performs poorly on both training and test data, while a high
variance model performs well on training data but poorly on new data. The goal is to
find a balance that minimizes total error by controlling model complexity.

II. How can a dataset without the target variable be utilized in supervised learning algorithms?
According to Ng (2020) Supervised learning requires a target variable (l to train the
model. If the dataset lacks this, it can be used in the following ways:

• Data Labelling: The use of manual or semi-automated labelling to assign target

values.
• Proxy Variables: Identifying a related variable that can serve as an approximate
target.
• Unsupervised Pretraining: Using the data for unsupervised learning to improve
supervised models later.
• Transfer Learning: Using pre-trained models on similar labelled datasets and
fine-tune on the unlabelled data.
• Active Learning: Querying the domain experts or systems to label the most
informative samples

III. Why is accuracy not always the ideal metric for model evaluation
According to Sammut, Web (2017) Accuracy measures the proportion of correctly
classified instances, but it can be misleading in cases such as:

• Imbalanced Datasets: When one class dominates, a model that predicts only
the majority class can still achieve high accuracy.
• Different Costs of Errors: In medical or fraud detection, false negatives can be
more costly than false positives.
• Lack of Insight: Accuracy does not show how well a model performs per class
or on edge cases.

12
Alternative metrics like Precision, Recall, F1-score, and AUC provide better insights
into model performance

IV. Given a dataset and a variety of Machine learning algorithms, how do you decide which algorithm
to use.
According to Geron (2019) Choosing an algorithm depends on:

• Data Type: Structured vs. unstructured

• Dataset Size: Smaller datasets → simpler models, larger → complex models
• Interpretability Needs: Linear models are easier to interpret than neural networks.
• Computation Resources: Deep learning requires more computational power.
• Performance on Validation Data: Compare models using cross-validation to pick the
one with best generalization

V. You are a data scientist at a real estate company. Your task is to build a model to predict house
prices based on various features such as location, size, number of bedrooms, and age of the house.

a) Describe the steps you would take to clean and preprocess the dataset.
According to Geron(2019) these are the steps one should take to clean and preprocess the
dataset.

• Handling Missing Values: Use mean/median for numeric, and mode or “unknown”
category for categorical.
• Outlier Detection: Remove or cap extreme values in features like price or size.
• Feature Scaling: Apply normalization or standardization to ensure fair comparison.
• Removing Duplicates and Errors: Drop repeated or inconsistent entries.
• Train-Test Split: Divide data into training and testing sets for validation.

b) Explain using example how you would encode categorical variables.

According to Brownlee, J (2020) these are the ways one would encode categorial variables.

One-Hot Encoding:

• Example: “Location” = {Urban, Suburban, Rural} → three binary columns: Urban(1,0,0),

Suburban (0,1,0), Rural(0,0,1).
• Works well for nominal variables without order.

Label Encoding:

• Example: “Condition” = {Poor, Fair, Good} → {0, 1, 2}.

• Suitable for ordinal variables with natural order.

Target Encoding:

• Replaces a category with the average target value per category. Useful for high-
cardinality features

13
c) Justify which machine learning algorithms would you consider for this problem.
Based on the information written by Hastie T, Tibshirani R, and Friedman J (2017) these are the
algorithms I would consider.

• Linear Regression: Simple and interpretable baseline model.

• Decision Tree Regressor: Captures non-linear relationships.
• Random Forest or Gradient Boosting: Handle complex patterns and feature
interactions effectively.
• Support Vector Regressor: Useful for smaller datasets with non-linear boundaries

d) Describe the process of hyperparameter tuning. Which method would you use in this context and
why
According to Raschka S. and Mirjalili V. (2020) Hyperparameter tuning involves finding
the optimal values for parameters that control the learning process.
These are the methods that should be used in this context.
Define the model and hyperparameters to tune.
Choose a search strategy:

• Grid Search: Exhaustively tries all parameter combinations.

• Random Search: Randomly samples parameter combinations.
• Bayesian Optimization: Uses past results to choose next parameters more
intelligently.
Use Cross-validation to evaluate each configuration.
Select the model with the best performance on validation data
Recommended Method would be:
For house price prediction, Randomized Search CV is efficient as it balances time and
performance with large parameter spaces.

14
BIBILIOGRAPHY.
Question One.
GeeksforGeeks (2025) Explain in detail the difference between overfitting and
underfitting in Machine Learning and ways to overcome them. Available at:
[Link]
decision-tree-ml/ (Accessed:16 September 2025)
GeeksforGeeks (2025) Write a pseudo algorithm for the K-means clustering. Available
at: [Link]
introduction/ (Accessed: 16 September 2025)
Brownlee, J. (2021) Performance Metrics for Classification Models. Machine Learning
Mastery. Available at: [Link]
classification-models/ (Accessed: 21 September 2025).
Chawla, N.V., Bowyer, K.W., Hall, L.O. and Kegelmeyer, W.P. (2002) ‘SMOTE: Synthetic
Minority Over-sampling Technique’, Journal of Artificial Intelligence Research, 16, pp.
321–357. Available at: [Link]
(Accessed: 21 September2025).
Géron, A. (2023) Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow,
3rd ed. Sebastopol: O’Reilly Media. Available at:
[Link]
(Accessed: 22 September 2025).
Goodfellow, I., Bengio, Y. and Courville, A. (2016) Deep Learning. Cambridge, MA: MIT
Press. Available at: [Link] (Accessed: 15 October 2025).
Han, J., Kamber, M. and Pei, J. (2022) Data Mining: Concepts and Techniques, 4th ed.
Burlington: Morgan Kaufmann. Available at: [Link]
mining/han/9780128117606 (Accessed: 23 September 2025).
He, H. and Garcia, E.A. (2009) ‘Learning from Imbalanced Data’, IEEE Transactions on
Knowledge and Data Engineering, 21(9), pp. 1263–1284. Available at:
[Link] (Accessed: 23 September 2025).
Sun, Y., Wong, A. and Kamel, M. (2009) ‘Classification of Imbalanced Data: A Review’,
International Journal of Pattern Recognition and Artificial Intelligence, 23(4), pp. 687–
719. Available at: [Link] (Accessed: 24
September 2025).
Google Developers (2023) Accuracy, Precision, and Recall. Available at:
[Link]
course/classification/accuracy-precision-recall (Accessed: 25 September 2025).

Question Two.

15
Hastie, T. Tibshirani, R. and Friedman, J. (2017). The Elements of Statistical Learning:
Data Mining, Inference, and Prediction. 2nd ed. Springer. Available at:
[Link] (Accessed: 01 October 2025).
Andrew, NG. (2020). Machine Learning Yearning: Technical Strategy for AI Engineers.
[Link]. Available at: [Link]
(Accessed: 01 October 2025)
Sammut, C. and Webb, GI. (2017). Encyclopaedia of Machine Learning and Data Mining.
Springer. Available at: [Link]
7687-1 (Accessed: 03 October 2025).
Géron, A. (2019) Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow.
2nd ed. O’Reilly Media. Available at: [Link]
machine-learning/9781492032632/ (Accessed: 03 October 2025).
Brownlee J (2020) Data Preparation for Machine Learning: Data Cleaning, Feature
Selection, and Data Transforms in Python. Machine Learning Mastery. Available at:
[Link]
(Accessed: 04 October 2025).
Raschka, S and Mirjalili, V (2020). Python Machine Learning. 3rd ed. Packt Publishing.
Available at: [Link] (Accessed: 05 October 2025).

16
17

Scikit-learn Machine Learning Guide
No ratings yet
Scikit-learn Machine Learning Guide
20 pages
Data Mining and Machine Learning Concepts
No ratings yet
Data Mining and Machine Learning Concepts
14 pages
Machine Learning Concepts and Techniques
100% (1)
Machine Learning Concepts and Techniques
36 pages
Classification Techniques in Machine Learning
No ratings yet
Classification Techniques in Machine Learning
16 pages
Unsupervised Learning Algorithms Explained
No ratings yet
Unsupervised Learning Algorithms Explained
3 pages
Code3 Auftragsimport Lab Report
No ratings yet
Code3 Auftragsimport Lab Report
6 pages
Supervised Learning: Classification Overview
No ratings yet
Supervised Learning: Classification Overview
14 pages
Python Machine Learning Guide
No ratings yet
Python Machine Learning Guide
30 pages
Machine Learning Algorithms Overview
No ratings yet
Machine Learning Algorithms Overview
17 pages
Machine Learning Algorithm Classifications
No ratings yet
Machine Learning Algorithm Classifications
10 pages
Machine Learning Overview and Techniques
No ratings yet
Machine Learning Overview and Techniques
34 pages
Big Data Algorithms Practical Guide
No ratings yet
Big Data Algorithms Practical Guide
20 pages
NumPy in Machine Learning Exam 2022
No ratings yet
NumPy in Machine Learning Exam 2022
14 pages
Supervised vs Unsupervised Learning Explained
No ratings yet
Supervised vs Unsupervised Learning Explained
26 pages
Supervised Machine Learning Guide
No ratings yet
Supervised Machine Learning Guide
30 pages
Machine Learning Engineer Interview Prep
No ratings yet
Machine Learning Engineer Interview Prep
14 pages
Essential Machine Learning Algorithms Guide
No ratings yet
Essential Machine Learning Algorithms Guide
12 pages
Top Data Science Interview Questions
No ratings yet
Top Data Science Interview Questions
38 pages
Web Analytics and Machine Learning Guide
No ratings yet
Web Analytics and Machine Learning Guide
40 pages
Data Science Interview Questions Guide
No ratings yet
Data Science Interview Questions Guide
185 pages
Exam Questions on AI and Machine Learning
No ratings yet
Exam Questions on AI and Machine Learning
15 pages
K-Nearest Neighbor Classification Lab
No ratings yet
K-Nearest Neighbor Classification Lab
27 pages
Supervised vs Unsupervised Learning Explained
No ratings yet
Supervised vs Unsupervised Learning Explained
22 pages
Big Data Clustering and Classification Guide
No ratings yet
Big Data Clustering and Classification Guide
30 pages
Machine Learning Techniques Overview
No ratings yet
Machine Learning Techniques Overview
13 pages
Machine Learning Concepts Explained
No ratings yet
Machine Learning Concepts Explained
10 pages
Synthetic Data for LDA Evaluation in MATLAB
No ratings yet
Synthetic Data for LDA Evaluation in MATLAB
14 pages
Machine Learning Basics with Scikit-Learn
No ratings yet
Machine Learning Basics with Scikit-Learn
52 pages
Uber Fare Prediction Using ML Models
No ratings yet
Uber Fare Prediction Using ML Models
25 pages
Implementing Key ML Algorithms in Python
No ratings yet
Implementing Key ML Algorithms in Python
51 pages
Predictive Modeling Steps & ML Comparisons
No ratings yet
Predictive Modeling Steps & ML Comparisons
24 pages
Supervised Learning Algorithms Overview
No ratings yet
Supervised Learning Algorithms Overview
13 pages
Machine Learning Classifier Case Study
No ratings yet
Machine Learning Classifier Case Study
5 pages
Classification vs Regression in ML
No ratings yet
Classification vs Regression in ML
15 pages
K-Nearest Neighbor and Classification Techniques
No ratings yet
K-Nearest Neighbor and Classification Techniques
23 pages
Machine Learning Key Tasks Explained
No ratings yet
Machine Learning Key Tasks Explained
16 pages
Intro to Machine Learning Basics
No ratings yet
Intro to Machine Learning Basics
8 pages
Machine Learning Algorithms Overview
No ratings yet
Machine Learning Algorithms Overview
33 pages
Machine Learning Algorithms Overview
No ratings yet
Machine Learning Algorithms Overview
18 pages
Overview of Classification Algorithms
No ratings yet
Overview of Classification Algorithms
75 pages
Understanding Principal Component Analysis
No ratings yet
Understanding Principal Component Analysis
12 pages
K-Means Clustering in Machine Learning
No ratings yet
K-Means Clustering in Machine Learning
12 pages
Python Data Analysis & Prediction Guide
No ratings yet
Python Data Analysis & Prediction Guide
6 pages
Machine Learning Lab Manual: Python & R
No ratings yet
Machine Learning Lab Manual: Python & R
15 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
73 pages
Overview of Machine Learning Algorithms
No ratings yet
Overview of Machine Learning Algorithms
36 pages
Intro to Machine Learning for Data Science
No ratings yet
Intro to Machine Learning for Data Science
37 pages
Understanding Machine Learning Types
No ratings yet
Understanding Machine Learning Types
57 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
68 pages
Python Syntax and Data Structures Guide
No ratings yet
Python Syntax and Data Structures Guide
19 pages
Machine Learning: Clustering & Classification
No ratings yet
Machine Learning: Clustering & Classification
30 pages
Supervised vs Unsupervised Learning Guide
No ratings yet
Supervised vs Unsupervised Learning Guide
18 pages
Coincent Data Science Assignment Overview
100% (2)
Coincent Data Science Assignment Overview
23 pages
Machine Learning Practical Record 2024
No ratings yet
Machine Learning Practical Record 2024
23 pages
Data Mining: Clustering & Classification
No ratings yet
Data Mining: Clustering & Classification
23 pages
Machine Learning Lab File for B.Tech AI
No ratings yet
Machine Learning Lab File for B.Tech AI
21 pages
Data Analyst Interview Questions Guide
No ratings yet
Data Analyst Interview Questions Guide
16 pages
Predicting Mechanical System Failures
No ratings yet
Predicting Mechanical System Failures
2 pages
Machine Learning Algorithms Explained
No ratings yet
Machine Learning Algorithms Explained
13 pages
Data Mining and Machine Learning Course
No ratings yet
Data Mining and Machine Learning Course
7 pages
Data Warehousing Mining MCQs
No ratings yet
Data Warehousing Mining MCQs
12 pages
Predictive Analytics Course Syllabus
No ratings yet
Predictive Analytics Course Syllabus
4 pages
Data Mining Classification Techniques
No ratings yet
Data Mining Classification Techniques
51 pages
Support Vector Clustering Method
No ratings yet
Support Vector Clustering Method
4 pages
Dependency-Based Software Clustering Algorithm
No ratings yet
Dependency-Based Software Clustering Algorithm
10 pages
Unsupervised Learning: K-Means Clustering
No ratings yet
Unsupervised Learning: K-Means Clustering
18 pages
Image Processing Unit-4
No ratings yet
Image Processing Unit-4
44 pages
Geographic Information, Geospatial Technologies and Spatial Data Science For Health (Justine Blanford) (Z-Library)
100% (2)
Geographic Information, Geospatial Technologies and Spatial Data Science For Health (Justine Blanford) (Z-Library)
394 pages
Data Mining Course Outline IT-446
No ratings yet
Data Mining Course Outline IT-446
4 pages
Real-Time Credit Card Fraud Detection
No ratings yet
Real-Time Credit Card Fraud Detection
65 pages
(s35) - EADCR - Energy Aware Distance Based Cluster Head Selection and Routing Protocol For Wireless Sensor Networks
No ratings yet
(s35) - EADCR - Energy Aware Distance Based Cluster Head Selection and Routing Protocol For Wireless Sensor Networks
21 pages
Mehta Et Al. - 2019 - A High-Bias, Low-Variance Introduction To Machine PDF
No ratings yet
Mehta Et Al. - 2019 - A High-Bias, Low-Variance Introduction To Machine PDF
116 pages
M.Tech in AI & Machine Learning Syllabus
No ratings yet
M.Tech in AI & Machine Learning Syllabus
50 pages
Content-Based Music Track Recommender
No ratings yet
Content-Based Music Track Recommender
6 pages
Introduction to Data Science Basics
No ratings yet
Introduction to Data Science Basics
35 pages
Challenges of K-means in High Dimensions
No ratings yet
Challenges of K-means in High Dimensions
2 pages
1BAIA103 I - Model Question Paper I & Solution
No ratings yet
1BAIA103 I - Model Question Paper I & Solution
27 pages
Python Companion for Linear Algebra
No ratings yet
Python Companion for Linear Algebra
192 pages
Machine Learning for Big Data Analytics
No ratings yet
Machine Learning for Big Data Analytics
6 pages
Medical Image Feature Extraction
No ratings yet
Medical Image Feature Extraction
38 pages
Morphological Study of Adlay in Mindanao
No ratings yet
Morphological Study of Adlay in Mindanao
10 pages
Insurance Fraud Detection with ML
No ratings yet
Insurance Fraud Detection with ML
47 pages
k-Anonymity Clustering for Data Privacy
No ratings yet
k-Anonymity Clustering for Data Privacy
11 pages
BCS602 Machine Learning Overview
No ratings yet
BCS602 Machine Learning Overview
35 pages
Weather Forecasting with ML & DL Techniques
No ratings yet
Weather Forecasting with ML & DL Techniques
6 pages
Enhancing Meme Token Market Transparency: A Multi-Dimensional Entity-Linked Address Analysis For Liquidity Risk Evaluation
No ratings yet
Enhancing Meme Token Market Transparency: A Multi-Dimensional Entity-Linked Address Analysis For Liquidity Risk Evaluation
8 pages
CLUMPP and DISTRUCT Workflow Guide
No ratings yet
CLUMPP and DISTRUCT Workflow Guide
2 pages
A Systematic Analysis of The Science of Sandboxing: Michael Maass, Adam Sales, Benjamin Chung and Joshua Sunshine
No ratings yet
A Systematic Analysis of The Science of Sandboxing: Michael Maass, Adam Sales, Benjamin Chung and Joshua Sunshine
36 pages
Machine Learning Important Questions Guide
No ratings yet
Machine Learning Important Questions Guide
8 pages
Types and Sources of Statistical Data
No ratings yet
Types and Sources of Statistical Data
10 pages

Machine Learning Concepts and Algorithms

Uploaded by

Machine Learning Concepts and Algorithms

Uploaded by

1

III. What is the difference between regression and classification

IV. Write a pseudo algorithm for the K-means clustering.

Step 1: Importing the necessary libraries

• Numpy: For numerical operations.

Step 2: Creating Custom Dataset

• make_blobs (n_samples=500, n_features=2, centers=3): Generates 500 data

Step 3: Initializing Random Centroids

• [Link](23): Ensures reproducibility by fixing the random seed.

for idx in range(k):

Step 4: Plotting Random Initialized Center with Data Points

Step 5: Defining Euclidean Distance

Step 6: Creating Assign and Update Functions

def update_clusters(X, clusters):

Step 7: Predicting the Cluster for the Data Points

Step 8: Assigning, Updating and Predicting the Cluster Centers

Step 9: Plotting Data Points with Predicted Cluster Centre’s

According to Brownlee, J. (2021) this is how we calculate performance metrics in the

Predicted Positive Predicted Negative

= 70% of all patients are correctly classified (either as Covid or not).

2. Precision (Positive Predictive Value)

3. Recall (Sensitivity, True Positive Rate)

Precision × Recall 0.67 × 0.80

About 73% — balances precision and recall.

5. Specificity (True Negative Rate)

The model correctly identifies 60% of non-Covid patients as negative.

• The model becomes stricter about labelling someone “Positive.”

• The model is more liberal in calling “Positive.”

• As threshold increases: TP might go down, FN increases; FP goes down, TN may

• That person may not be isolated or treated, risking onward transmission.

• During training, assign a larger penalty or weight to misclassifying positive class,

• Instead of relying solely on accuracy which is misleading on imbalanced data,

• Use boosting which focus more on “hard” cases.

• Data Labelling: The use of manual or semi-automated labelling to assign target

• Data Type: Structured vs. unstructured

b) Explain using example how you would encode categorical variables.

• Example: “Location” = {Urban, Suburban, Rural} → three binary columns: Urban(1,0,0),

• Example: “Condition” = {Poor, Fair, Good} → {0, 1, 2}.

• Linear Regression: Simple and interpretable baseline model.

• Grid Search: Exhaustively tries all parameter combinations.

You might also like