Machine Learning 700 Study Guide
Machine Learning 700 Study Guide
i
Registered with the Department of Higher Education as a Private Higher Education Institution under the Higher Education
Act, [Link] Certificate No. 2000/HE07/008
LEARNER GUIDE
PREPARED ON BEHALF OF
All rights reserved; no part of this publication may be reproduced in any form or by any means, including photocopying
machines, without the written permission of the Institution.
ii
Grus, J (2019). Data Science
from Scratch First Principles
with Python. 2nd Ed Allen
Institute of AI: O’REILLY
ISBN: 9781492041139
iii
1 Chapter 1: Introduction to Machine Learning .................................................................................... 6
1
3.5 Decision Tree Regression .......................................................................................................... 51
2
6.10 Dropout ................................................................................................................................... 189
3
The Information Technology (IT) qualification at Richfield College is a dynamic and future-focused
program designed to equip students with advanced technical, analytical, and problem-solving skills. At
the core of the qualification is a commitment to academic excellence, industry alignment, and innovation,
fostering graduates who are proficient in addressing modern technological challenges. This qualification
strategically integrates theoretical knowledge with practical applications, preparing students for various
roles in the IT sector. The IT program is structured to address the growing complexity of the evolving
technological landscape.
The Higher certificate in Information Technology (HCIT) program is a foundational stepping-stone for
students who wish to pursue further studies or enter the workforce. Graduates of this program are well-
prepared to articulate to the Diploma in IT (DIT) or the Bachelor of Science in IT (BSc IT) qualifications,
providing a seamless transition for those seeking to deepen their knowledge and skills in specialized IT
areas. Additionally, the program equips students with the essential competencies for entry-level IT roles
such as IT Support Technicians, Junior Web/ System Developers, IT Administrators, etc.
The Diploma in Information Technology (DIT) is a comprehensive and practical program designed to build
a strong foundation in IT principles while equipping students with the hands-on skills required to meet
industry demands. Focused on both theoretical knowledge and applied learning, this qualification
prepares students for intermediate-level roles in IT and serves as a stepping-stone for further academic
progression or specialization. Graduates of this program are well-prepared to articulate to the Bachelor
of Science in IT (BSc IT) qualification. The curriculum covers programming, networking, database
management, system analysis etc., ensuring graduates possess the competencies to solve real-world IT
challenges effectively.
The Bachelor of Science in IT (BSc IT) program is structured to address the growing complexity of the
evolving technological landscape. Through carefully curated modules, students gain a deep
understanding of software development, database management, cloud computing, cybersecurity, IT
management, artificial intelligence, machine learning, networking etc. Graduates of this program are
well-prepared to articulate to the Bachelor of Science Honours in IT qualification. The curriculum is
designed to bridge the gap between academic learning and real-world applications, thus fostering
4
innovation and an entrepreneurial mindset. Students are encouraged to participate in research and
practical learning.
The focus on emerging technologies within the IT qualification highlights commitment to academic
excellence and industry relevance. By integrating advanced concepts such as Artificial Intelligence,
Machine Learning, Data Science, Cloud Computing, Big Data, IoT, etc., the program equips students with
the skill-set needed to navigate and innovate in a rapidly evolving technological landscape. The program
aligns with industry courses from globally recognized leading tech giants such as Oracle, AWS, IBM, etc.
ensures that graduates possess the credentials to validate their expertise in emerging and disruptive
technologies.
Machine Learning 700 provides an in-depth exploration of key machine learning concepts, techniques,
and applications. The module covers supervised, unsupervised, and reinforcement learning, with hands-
on experience using Python libraries such as NumPy, Pandas, and Scikit-Learn. Students learn data
preprocessing, feature engineering, and core machine learning algorithms, including regression and
classification techniques. The course also delves into model evaluation, optimization, and
hyperparameter tuning. Additionally, it introduces deep learning concepts, covering neural networks and
frameworks like TensorFlow and PyTorch, along with large language models (LLMs) and their applications
in natural language processing. By the end of the module, students will be equipped to build, assess, and
optimize advanced AI-driven models.
5
1 Chapter 1: Introduction to Machine Learning
LEARNING OUTCOMES
After reading this Section of the guide, the learner should be able to:
Machine learning (ML) is a subfield of artificial intelligence (AI) that focuses on developing models
that learn from data to make predictions without explicit programming. It is widely applied in
industries such as healthcare, finance, marketing, and autonomous systems. Machine learning
models are trained to recognize patterns and make predictions by minimizing error and optimizing
performance. While ML is often considered the heart of data science, it is just one component of a
broader workflow that includes data collection, cleaning, and transformation.
1.2 Modeling
Before delving into machine learning algorithms, it is crucial to understand the concept of modeling.
A model is a mathematical framework that defines a relationship between different variables. For
example, in finance, a model may predict future revenue based on advertising spend, while in
medicine, a model may estimate a patient's risk of developing a disease. Machine learning models
6
are different from these traditional models because they learn from data rather than being explicitly
programmed. These models can improve their predictions as more data becomes available.
• Business Models: A formula that predicts profit based on revenue and expenses.
• Recipe Models: A structured relationship between the number of servings and ingredient
quantities.
• Poker Models: Probabilistic estimations of a player’s chances of winning based on revealed cards
.
There exists a myriad of ML algorithms, but the selection choice will rely on the type of outputs
required for the system. ML algorithms are stratified into two categories: unsupervised and
supervised learning algorithms.
In supervised learning, the model is trained using labeled data, meaning that each input is associated
with an output. The goal is to learn a mapping function from input to output. Supervised learning
works with a set of labelled features (input-output) named training dataset. Every observation in the
training dataset must have an input and output object. The supervised ML algorithm will use this
training set to predict or classify unknown features during the testing phase and observations that
were not considered or included in the training dataset will be classified as unknown instances.
Examples include:
Classification can be either binary (distinguishing between two classes) or multiclass (distinguishing
between more than two classes). For instance, binary classification involves tasks like identifying
7
spam versus non-spam emails, where the question is "Is this email spam?" The iris classification is a
multiclass problem, as is predicting a website’s language based on its text.
Regression tasks, on the other hand, involve predicting a continuous value, such as a real number.
An example of regression is predicting a person’s annual income based on factors like education, age,
and location. The predicted value is a continuous number that can vary within a range. Another
example is forecasting the yield of a corn farm based on previous yields and weather conditions. A
simple way to distinguish between classification and regression is to consider whether the possible
outcomes are continuous. In regression, the output is continuous (e.g., annual income), while in
classification, the outputs are discrete (e.g., identifying the language of a website).
In unsupervised learning, the model identifies patterns in data without predefined labels. The
opposite of supervised learning is unsupervised learning. With unsupervised learning, the ML
algorithm detects hidden patterns in a given dataset. With unsupervised learning there is no wrong
or right answer; it's just a case of running the ML algorithm and seeing what patterns and outcomes
occur. Unsupervised learning can be thought of as a learner without a teacher because there is no
need for the training dataset. The unsupervised learning algorithm will stratify features based on
similarity and dissimilarity of hidden patterns (see Figure 1.3).
Reinforcement learning (RL) involves training agents to make decisions by interacting with an
environment and receiving feedback in the form of rewards or penalties. Examples include:
8
1.4 Overfitting and Underfitting
A major challenge in machine learning is ensuring that a model generalizes well to unseen data. Two
common issues are:
✓ Overfitting: The model learns noise instead of the underlying pattern, performing well on training
data but poorly on new data.
✓ Underfitting: The model is too simple to capture patterns in the data, leading to poor
performance even on the training set.
Consider a dataset of student exam scores based on study hours. A model that is too simple (e.g., a
straight line) underfits the data, while a highly complex polynomial model that fits every data point
overfits.
9
Fig 1.1 Overfitting vs. Underfitting
Fig 1.1 visually illustrate the concepts of underfitting and overfitting, two common pitfalls in machine
learning model training. The code generates sample data that follows a sine wave pattern with some
added random noise, mimicking a realistic dataset. It then attempts to fit two different types of
models to this data. First, it fits a simple linear regression model, which assumes a straight-line
relationship.
This model demonstrates underfitting, as the straight line is too simple to capture the true, non-
linear pattern of the data. In the resulting graph, this is represented by a green line that poorly
follows the blue data points. The model's simplicity prevents it from learning the fundamental
relationships within the data. Next, the code fits a highly complex polynomial regression model with
a degree of 10. This model attempts to fit a curve that passes through, or extremely close to, every
10
single data point, thus representing overfitting. In the image, this is represented by the red line,
which is overly complex and traces every minor fluctuation of the training data. While it fits the
training data almost perfectly, it does so by learning the noise within the dataset, making it unlikely
to predict new, unseen data points accurately.
Finally, the image shows that if the correct degree was chosen, or the model was simpler than a
degree 10 polynomial, the model may have been more generalized and fit the data better. The blue
dots in the image represent the data that the models are being fit on. The three concepts here are,
underfitting when the model is too simple, overfitting when it is too complex, and generalization
when it is neither of those two. These examples show the critical importance of finding the right level
of model complexity to achieve a balance that enables generalization of new data. This visualization
highlights how a model should strike a balance between complexity and generalization.
1.5 Correctness
Evaluating the correctness of a machine learning model requires more than simply measuring its
accuracy. Accuracy, defined as the proportion of correct predictions over the total number of
predictions, is often an inadequate measure, especially when dealing with imbalanced datasets. For
example, a test that predicts leukemia based on a child's name might achieve over 98% accuracy,
but such a model would be meaningless because it does not use relevant medical features.
To better undeectness, machine learning models are evaluated using a confusion matrix, which
categorizes predictions into four groups:
✓ True Positive (TP): The model correctly predicts a positive case (e.g., detecting spam
correctly).
✓ False Positive (FP): The model incorrectly predicts a positive case when it is actually negative
(Type I error).
✓ False Negative (FN): The model incorrectly predicts a negative case when it is actually positive
(Type II error).
✓ Precision: The proportion of correctly predicted positive cases among all predicted positive
cases.
✓ Recall: The proportion of actual positive cases that were correctly predicted.
✓ F1 Score: The harmonic mean of precision and recall, balancing both measures.
The tradeoff between precision and recall is crucial when designing machine learning models. For
example, in medical diagnosis, a model with high recall (capturing most positive cases) may be
preferred to minimize false negatives, whereas in spam detection, a model with high precision
(minimizing false positives) might be more important
The bias-variance tradeoff is a fundamental concept that helps explain the balance between model
complexity and generalization. It describes two types of errors that a machine learning model can
make.
Bias
The error introduced when a model is too simplistic to capture the underlying structure of the data.
High-bias models (e.g., linear regression with very few features) tend to underfit the data, leading to
systematically incorrect predictions.
Variance
The error due to the model being too sensitive to small fluctuations in the training data. High-
variance models (e.g., deep neural networks with too many parameters) tend to overfit, capturing
noise rather than meaningful patterns.
If a model has high bias, it performs poorly even on the training data, indicating underfitting. One
way to address this is by adding more features or using a more complex model. Conversely, if a model
has high variance, it performs well on training data but poorly on unseen data, indicating overfitting.
12
To reduce variance, we can:
• Use regularization techniques such as L1/L2 penalties to constrain the model complexity.
A well-balanced machine learning model must optimize both bias a to achieve low total error. This
balance is crucial when designing models for real-world applications, where both prediction accuracy
and generalization are important.
Feature extraction
Feature extraction involves transforming raw data into numerical representations that machine
learning models can process effectively. This transformation is particularly important when working
with unstructured data such as text or images. For instance, in text data, words and phrases can be
converted into numerical vectors using techniques such as word embeddings, which capture
semantic relationships between words. Similarly, in image data, meaningful features such as pixel
intensities, edges, or texture patterns can be extracted to facilitate classification tasks. Feature
extraction ensures that raw data is transformed into a structured format that enhances the learning
process of machine learning algorithms.
Feature selection
Feature selection focuses on identifying the most relevant features from a given dataset to improve
model performance and generalizability. The presence of too many irrelevant or redundant features
can lead to overfitting, where the model learns noise rather than meaningful patterns. To mitigate
this issue, various feature selection techniques are employed. Filter methods rely on statistical tests,
such as correlation coefficients or mutual information, to assess the relevance of each feature
independently of the learning algorithm. In contrast, wrapper methods evaluate subsets of features
13
based on their impact on model performance, often using iterative techniques such as recursive
feature elimination or forward selection. By selecting only the most informative features, feature
selection enhances model efficiency, reduces computational complexity, and improves
interpretability.
These two processes are essential components of the machine learning pipeline, as they enable
models to learn effectively from data while avoiding unnecessary complexity.
Machine learning has significantly improved healthcare by enabling early disease detection,
personalized treatment plans, and predictive analytics. ML models analyze medical images, such as
X-rays and MRIs, to detect diseases like cancer with high accuracy. In genomics, ML is used for drug
discovery and precision medicine, tailoring treatments to individual patients based on genetic
profiles. Additionally, wearable health devices use ML to monitor vital signs and detect abnormalities
in real-time, assisting in preventive care.
The financial sector leverages machine learning for fraud detection, algorithmic trading, credit
scoring, and customer service automation. ML models identify fraudulent transactions by detecting
anomalies in user behavior. In investment banking, ML-driven algorithms analyze market trends and
execute high-frequency trades, optimizing returns. Credit scoring models assess borrowers' risk levels
by analyzing historical data and predicting default probabilities. Additionally, chatbots and virtual
assistants powered by ML enhance customer interactions in banking services.
14
E-Commerce and Retail
The transportation industry benefits from ML in traffic management, predictive maintenance, and
autonomous vehicles. Ride-hailing services like Uber use ML to optimize route selection, reduce wait
times, and set dynamic fares. Autonomous vehicles, such as those developed by Tesla and Waymo,
rely on deep learning models to interpret sensor data, recognize obstacles, and make driving
decisions in real-time. ML also aids airlines in flight scheduling, fuel optimization, and predictive
aircraft maintenance.
Machine learning plays a crucial role in predictive maintenance, quality control, and supply chain
optimization in manufacturing. ML models analyze sensor data to detect early signs of equipment
failure, preventing costly downtime. In quality control, computer vision-based ML systems inspect
products for defects with higher accuracy than human inspectors. Additionally, ML optimizes supply
chain logistics by predicting demand fluctuations and improving inventory management.
15
Natural Language Processing (NLP) and Chatbots
Natural Language Processing, a subset of ML, powers applications such as chatbots, virtual assistants,
and sentiment analysis. Voice assistants like Siri, Alexa, and Google Assistant leverage ML to
understand and process spoken commands. Businesses use AI-powered chatbots to handle customer
queries, automate responses, and enhance user engagement. Sentiment analysis tools analyze social
media posts and customer reviews to gauge public opinion and brand perception.
In the education sector, ML enables personalized learning experiences, automated grading, and
student performance prediction. Adaptive learning platforms tailor educational content based on
students’ learning styles and progress. ML models analyze student engagement and performance
data to identify areas requiring improvement. Additionally, plagiarism detection tools leverage ML to
identify academic misconduct in research papers and assignments.
Machine learning revolutionizes agriculture by optimizing crop yields, detecting plant diseases, and
automating irrigation systems. ML-powered drones and satellite imagery help farmers monitor soil
health, assess crop conditions, and detect pest infestations. Smart irrigation systems use ML to
analyze weather data and soil moisture levels, ensuring efficient water usage. Additionally, predictive
analytics helps farmers determine the best planting and harvesting times.
The entertainment industry benefits from ML in content recommendation, video editing, and music
generation. Streaming platforms like Netflix and Spotify use ML algorithms to suggest personalized
content based on user preferences. AI-powered tools assist in video editing by automating scene
transitions, color correction, and special effects. Additionally, generative models, such as OpenAI’s
GPT and DeepMind’s WaveNet, create human-like text and music compositions.
16
Machine learning continues to revolutionize industries by enabling automation, enhancing decision-
making, and improving efficiency. As advancements in ML and artificial intelligence (AI) continue,
their applications will further expand, driving innovation across diverse fields. Whether in healthcare,
finance, cybersecurity, or entertainment, ML remains a powerful tool shaping the future of
technology and society.
NumPy
NumPy is one of the fundamental packages for scientific computing in Python. It contains
functionality for multidimensional arrays, high-level mathematical functions such as linear algebra
operations and the Fourier transform, and pseudorandom number generators.
Pandas
Pandas introduce additional data structures for managing datasets in Python. Its primary data
structure is the DataFrame, which is conceptually like the NotQuiteABase Table class, but with far
more functionality and enhanced performance. If you need to clean, slice, group, and manipulate
datasets in Python, pandas is an indispensable tool.
Scikit-learn
Scikit-learn is arguably the most popular Python library for machine learning. It includes all the
models we've implemented and many others that we haven't. In real-world applications, you
wouldn't build a decision tree or optimization algorithm from scratch; instead, you'd rely on scikit-
learn to handle the heavy lifting. Its documentation is filled with examples showcasing its capabilities
and providing a deeper understanding of machine learning.
The goal of this exercise is to familiarize yourself with fundamental Python libraries used in machine
learning, such as NumPy, Pandas, Scikit-learn, and Matplotlib. First, install these libraries using pip
install NumPy pandas scikit-learn matplotlib. Next, load and explore a sample dataset using Pandas
by reading the famous Iris dataset from a URL and displaying basic statistical summaries with
[Link](). Then, train a simple machine learning model using Scikit-learn’s Logistic Regression.
17
The dataset is split into training and testing sets with train_test_split(), and the model is trained using
the fit() function. Finally, the accuracy of the model is evaluated on the test set using the score()
method. This exercise provides a hands-on introduction to essential steps in machine learning: data
loading, exploration, model training, and evaluation.
import pandas as pd
df = pd.read_csv("[Link]
data/master/[Link]")
print([Link]())
# Prepare data
X = [Link](columns=['species'])
y = df['species']
18
# Train model
model = LogisticRegression()
[Link](X_train, y_train)
# Evaluate accuracy
1. What is machine learning, and how does it differ from traditional programming?
2. Explain the differences between supervised, unsupervised, and reinforcement learning. Provide
examples of each.
4. Define the bias-variance tradeoff and explain why it is important in machine learning.
5. What are feature extraction and feature selection? Why are they important?
19
2 Chapter Two: Data Preprocessing and Feature Engineering
LEARNING OUTCOMES
After reading this Section of the guide, the learner should be able to:
Before applying machine learning (ML) algorithms, it is crucial to understand the nature of the data
being used. The success of an ML model depends on the quality, structure, and type of data available.
Datasets can generally be classified into two broad categories: structured and unstructured data.
Handling these different types of data requires specific techniques and tools to ensure accurate
analysis and meaningful predictions.
Structured Data
Structured data refers to information that is highly organized and stored in a well-defined format,
such as tables in relational databases, spreadsheets, or structured text files (CSV, JSON, XML). In
structured datasets, each row represents an observation (data instance), and each column
corresponds to a specific attribute (feature). The structured nature of this data makes it easy to
process using database management systems (DBMS), SQL queries, and data manipulation tools such
as Pandas in Python. Fig 2.1 below is an example of a structured database.
20
Fig 2.1 Example of a structured dataset used for customer purchase analysis.
Unstructured Data
Unstructured data refers to information that does not follow a predefined format or schema. It
includes a variety of data types such as text, images, audio, and video, which cannot be easily stored
in tabular form. Handling unstructured data requires advanced machine learning techniques such as
Natural Language Processing (NLP) for textual data and Convolutional Neural Networks (CNNs) for
image analysis. Fig 2.2 below shows the different data types(unstructured data).
21
Understanding the structure of data is essential before applying machine learning models. Structured
data is well-organized and easily processed using traditional algorithms, while unstructured data
requires advanced techniques such as deep learning and NLP for meaningful analysis. By selecting
the appropriate preprocessing methods, machine learning practitioners can extract valuable insights
from both structured and unstructured datasets, leading to more accurate and efficient models
In real-world applications, datasets are rarely perfect; they often contain missing values,
inconsistencies, errors, or duplicate records. Data cleaning is a critical preprocessing step in machine
learning that ensures the dataset is accurate, reliable, and suitable for model training. Poor-quality
data can lead to inaccurate predictions and unreliable models, making effective data cleaning
essential for improving model performance and generalizability.
Data cleaning involves several key tasks, including handling missing values, identifying and removing
outliers, and eliminating duplicate records. These steps help maintain data integrity and ensure that
machine learning algorithms can extract meaningful patterns from the dataset.
Missing values occur when certain observations lack data for specific attributes. These gaps in the
dataset can arise due to human error, sensor failures, incomplete data collection, or data corruption.
If not addressed properly, missing values can introduce bias, reduce model accuracy, and lead to
incorrect conclusions.
1. Missing Completely at Random (MCAR): The missing values occur independently of any
factor, making them unpredictable.
2. Missing at Random (MAR): The missing values depend on observed data but not on the
missing data itself.
22
3. Missing Not at Random (MNAR): The missing values are dependent on the missing data itself,
making imputation more complex.
Handling missing values is an essential step in data preprocessing, as unaddressed missing data can
lead to misleading results and poor model performance. The choice of method depends on the extent
of missing data, the nature of the dataset, and the machine learning task. While simple approaches
like mean imputation work well for numerical data, more sophisticated techniques, such as
predictive imputation, are necessary when dealing with complex missing data patterns.
Effective data cleaning ensures that machine learning models can learn from high-quality, well-
structured data, leading to more reliable and accurate predictions.
If only a small percentage of data points contain missing values, removing these rows can
prevent contamination of the dataset. This approach is suitable when missing values occur
randomly and their removal does not introduce bias. However, removing data can lead to
loss of valuable information if the dataset is already small. The code below demonstrates
how to remove the rows with missing values.
✓ Mean imputation (suitable for numerical data that follows a normal distribution).
The code below demonstrates how to fill in missing data using mean and mode.
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)
import pandas as pd
# Sample data
data = {'Age': [25, 30, None, 45, 50], 'Salary': [50000, None, 60000, 80000, 90000]}
df = [Link](data)
imputer = SimpleImputer(strategy="mean")
[Link][:, :] = imputer.fit_transform(df)
print(df)
24
3. Predicting Missing Values Using Machine Learning
In cases where missing values are extensive, machine learning models can predict them using
existing features. k-Nearest Neighbors (k-NN) or regression models can estimate missing
values by analyzing patterns in the available data. The code below uses k-NN Imputer to fill
in the missing values.
imputer = KNNImputer(n_neighbors=5)
df_filled = imputer.fit_transform(df)
Outliers are data points that are significantly different from the rest of the dataset. They can occur
due to measurement errors or genuine extreme values. Outlier detection is crucial for improving
model accuracy and robustness. Common techniques for handling outliers:
✓ Cap extreme values: Replace extreme values with a predefined threshold (e.g., 5th and 95th
percentiles).
import numpy as np
print(filtered_data)
25
2.2.3 Removing Duplicates
Duplicate records can cause bias in the dataset, leading to misleading model predictions. Removing
duplicates ensures that each observation is unique. Cleaning redundant data prevents models from
being skewed by repeated observations. The drop_duplicates() method removes duplicate rows
from the dataset in-place, modifying the original DataFrame.
df.drop_duplicates(inplace=True)
Feature scaling ensures that all variables contribute equally to model training. If one feature has
values in the range of thousands and another in decimal points, models like Support Vector Machines
(SVM) and k-Nearest Neighbors (k-NN) may become biased. Feature scaling improves the
performance of distance-based algorithms.
# Sample dataset
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(data)
standardizer = StandardScaler()
standardized_data = standardizer.fit_transform(data)
26
2.4 Principal Component Analysis (PCA)
In machine learning and data analysis, dimensionality reduction is a crucial preprocessing step used to
reduce the number of input features in a dataset while retaining as much relevant information as
possible. High-dimensional data can pose several challenges, collectively referred to as the curse of
dimensionality. These challenges include:
✓ Increased computational cost – As the number of features grows, so does the complexity of the
model, requiring more processing power and memory.
✓ Overfitting risk – A model with too many features may learn noise instead of meaningful patterns,
leading to poor generalization on new data.
✓ Reduced interpretability – When a dataset has a large number of features, understanding the
relationships between variables becomes increasingly difficult.
Dimensionality reduction techniques address these problems by simplifying the dataset while preserving
its core structure and information. Among these techniques, Principal Component Analysis (PCA) is one
of the most widely used methods.
Principal Component Analysis (PCA) is a statistical technique that transforms a high-dimensional dataset
into a lower-dimensional space while maintaining as much variance (information) as possible. Rather
than simply removing features, PCA creates new features that are linear combinations of the original
ones. These new features, called principal components, capture the directions of maximum variance in
the dataset. The primary objectives of PCA are:
27
In the example below, Principal Component Analysis (PCA) is applied to a dataset containing two highly
correlated features to demonstrate how PCA identifies the direction of maximum variance and
transforms the dataset into a new coordinate system. The dataset consists of Feature X, which is
randomly generated, and Feature Y, which is a linear function of Feature X with added noise, creating a
strong correlation between the two features. The first scatter plot (blue) represents the original dataset,
where the elongated cluster of points suggests redundancy in the data, as both features contain nearly
identical information. To extract the most meaningful insights, the dataset is standardized to ensure all
features contribute equally, followed by the application of PCA, which finds the principal components—
new axes that maximize variance. The second scatter plot (red) illustrates the transformed dataset, where
Principal Component 1 (PC1) captures the majority of the variance, while Principal Component 2 (PC2)
contributes only minimally (0.59%). The explained variance ratio confirms that PC1 captures 99.4% of the
variance, making PC2 largely insignificant. This transformation enables dimensionality reduction while
preserving essential patterns in the data, leading to more efficient machine learning models with reduced
computational complexity and improved interpretability.
import numpy as np
import pandas as pd
28
import [Link] as plt
[Link](42)
# Create a DataFrame
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)
# Apply PCA
df_pca = pca.fit_transform(df_scaled)
[Link](figsize=(12, 5))
29
[Link](1, 2, 1)
[Link]("Feature X")
[Link]("Feature Y")
[Link]()
[Link](1, 2, 2)
[Link]()
[Link]()
explained_variance = pca.explained_variance_ratio_
explained_variance
30
2.4.3 Choosing the Right Number of Principal Components
A key challenge in PCA is determining how many principal components to retain. The explained variance
ratio helps in making this decision by indicating the proportion of total variance captured by each
component.
The sum of all eigenvalues represents the total variance in the dataset. Each principal component’s
eigenvalue represents the fraction of total variance it explains. The cumulative explained variance helps
determine how many components are necessary to retain meaningful information.
✓ A scree plot displays the explained variance ratio for each principal component.
✓ The "elbow point" in the plot indicates the optimal number of components.
✓ Example: If the first three components explain 92% of the variance, the rest can be
discarded.
3. Cross-Validation
✓ Test different numbers of components and evaluate model performance to determine the
best balance between accuracy and complexity.
Fig 2.4 illustrates the effect of PCA on a synthetic two-dimensional dataset. The first plot (top left) shows
the original data points, colored to distinguish between them. The algorithm proceeds by first finding the
direction of maximum variance, labeled “Component 1.” This is the direction (or vector) in the data that
contains most of the information, or in other words, the direction along which the features are most
correlated with each other. Then, the algorithm finds the direction that contains the most information
while being orthogonal (at a right angle) to the first direction. In two dimensions, there is only one
31
possible orientation that is at a right angle, but in higher-dimensional spaces there would be (infinitely)
many orthogonal directions. Although the two components are drawn as arrows, it doesn’t really matter
where the head and the tail are; we could have drawn the first component from the center up to 140 the
top left instead of down to the bottom right. The directions found using this process are called principal
components, as they are the main directions of variance in the data. In general, there are as many
principal components as original features.
The second plot (top right) shows the same dataset but with a rotation applied, such that the first
principal component aligns with the x-axis and the second principal component aligns with the y-axis.
Before this rotation, the data had the mean subtracted to center it around zero. After the rotation, the
two axes become uncorrelated, meaning the correlation matrix of the data in this new configuration has
zeros off the diagonal. PCA can be used for dimensionality reduction by retaining only a subset of the
principal components. In this case, we could keep just the first principal component, as shown in the
32
third panel (bottom left), which reduces the data from two dimensions to one. It’s important to note that
instead of keeping just one of the original features, we’ve identified and preserved the most important
direction (from top left to bottom right in the first panel) — the first principal component. Finally, by
reversing the rotation and adding the mean back to the data, we get the last panel. The points are now
back in the original feature space, but only the information from the first principal component is
preserved. This transformation is often used to remove noise or to visualize the retained information
through the principal components.
One of the most common applications of PCA is to visualize high-dimensional datasets. As discussed in
Chapter 1, visualizing data with more than two features can be challenging. For the Iris dataset, we could
use a pair plot (Figure 1-3 in Chapter 1) to show combinations of two features for a partial view of the
data. However, for the Breast Cancer dataset, a pair plot becomes unmanageable because it has 30
features, resulting in 420 scatter plots (30 * 14). Analyzing all of these plots in detail would be impossible.
A simpler visualization method is to create histograms for each feature, comparing the two classes:
benign and malignant cancers.
Before applying PCA, we scale the data using the StandardScaler to ensure each feature has unit
variance. Applying the PCA transformation is as straightforward as any other preprocessing step. We start
by creating a PCA object, fitting it to the data to identify the principal components, and then applying the
rotation and dimensionality reduction by calling the transform method. By default, PCA rotates (and
shifts) the data without reducing its dimensionality. To reduce the dimensionality, we must specify how
many components to retain when initializing the PCA object:
33
Eigenfaces for feature extraction
Another application of PCA that we mentioned earlier is feature extraction. The idea behind feature
extraction is that it is possible to find a representation of your data that is better suited to analysis than
34
the raw representation you were given. A great example of an application where feature extraction is
helpful is with images. Images are made up of pixels, usually stored as red, green, and blue (RGB)
intensities. Objects in images are usually made up of thousands of pixels, and only together are they
meaningful. We will give a very simple application of feature extraction on images using PCA, by working
with face images from the Labeled Faces in the Wild dataset. This dataset contains face images of
celebrities downloaded from the Internet, and it includes faces of politicians, singers, actors, and athletes
from the early 2000s. We use gray- scale versions of these images, and scale them down for faster
processing.
There are 3,023 images, each 87×65 pixels large, belonging to 62 different people:
35
This is where PCA comes in. Computing distances in the original pixel space is quite a bad way to measure
similarity between faces. When using a pixel representation to compare two images, we compare the
grayscale value of each individual pixel to the value of the pixel in the corresponding position in the other
image. This representa- tion is quite different from how humans would interpret the image of a face, and
36
it is hard to capture the facial features using this raw representation. For example, using pixel distances
means that shifting a face by one pixel to the right corresponds to a drastic change, with a completely
different representation. We hope that using distan- ces along principal components can improve our
accuracy. Here, we enable the whitening option of PCA, which rescales the principal components to have
the same scale. This is the same as using StandardScaler after the transformation. Reusing the data again,
whitening corresponds to not only rotating the data, but also rescaling it so that the center panel is a
circle instead of an ellipse:
We fit the PCA object to the training data and extract the first 100 principal compo- nents. Then we
transform the training and test data:
The new data has 100 features, the first 100 principal components. Now, we can use the new
representation to classify our images using a one-nearest-neighbors classifier:
Our accuracy improved quite significantly, from 26.6% to 35.7%, confirming our intuition that the
principal components might provide a better representation of the data.
37
For image data, we can also easily visualize the principal components that are found. Remember that
components correspond to directions in the input space. The input space here is 50×37-pixel grayscale
images, so directions within this space are also 50×37-pixel grayscale images.
Although we cannot fully comprehend all the details captured by these components, we can make
educated guesses about what aspects of the face images they represent. For example, the first
component seems to primarily capture the contrast between the face and its background, while the
second component appears to reflect lighting differences between the left and right sides of the face,
and so on. While this representation is somewhat more interpretable than the raw pixel values, it still
differs significantly from how a human would perceive a face. Since PCA is based on pixel values, factors
38
such as the alignment of facial features (like the eyes, chin, and nose) and the lighting conditions strongly
influence how similar two images are in terms of their pixel representations. However, alignment and
lighting are not typically the primary factors humans use to judge facial similarity. People tend to focus
on attributes like age, gender, facial expression, and hairstyle, which are difficult to derive from pixel
intensity alone. It's important to remember that algorithms often interpret data—especially visual data
like images—in ways that differ from human perception.
Let’s come back to the specific case of PCA, though. We introduced the PCA transformation as rotating
the data and then dropping the components with low variance. Another useful interpretation is to try to
find some numbers (the new feature values after the PCA rotation) so that we can express the test points
as a weighted sum of the principal components.
In this context, x0, x1, and so on represent the coefficients of the principal components for a given data
point, essentially providing the image's representation in the transformed space. Another way to gain
insight into the workings of a PCA model is by examining the reconstructions of the original data using
only a subset of the components. As demonstrated in figure above, after removing the second
component and reaching the third panel, we reversed the rotation and reintroduced the mean to obtain
new points in the original space with the second component excluded, as shown in the final panel. A
similar transformation can be applied to face images by reducing the data to a smaller number of
principal components, then reversing the transformation back into the original space. This return to the
original feature space can be accomplished using the inverse_transform method. In this case, we visualize
the reconstruction of several faces using 10, 50, 100, 500, or 2,000 components.
39
40
41
You can see that when we use only the first 10 principal components, only the essence of the picture,
like the face orientation and lighting, is captured. By using more and more principal components, more
and more details in the image are preserved. This corresponds to extending the sum to include more and
more terms. Using as many components as there are pixels would mean that we would not discard any
information after the rotation, and we would reconstruct the image perfectly. We can also try to use PCA
to visualize all the faces in the dataset in a scatter plot using the first two principal components, with
classes given by who is shown in the image, similarly to what we did for the cancer dataset:
42
2.6 Summary of Key Concepts
Concept Key Takeaways
43
6. How does PCA work, and when should you use it?
8. How do you decide how many principal components to retain?
Practical Questions
1. Load the Iris dataset, preprocess it, and apply PCA to reduce the dimensionality.
2. Apply PCA on the Breast Cancer dataset and check if model accuracy improves after reducing
dimensions.
3. Generate synthetic high-dimensional data using NumPy and visualize it after PCA transformation.
1. A hospital has thousands of patient records with various medical measurements. They want to predict
heart disease risk while reducing redundant features. Would you recommend PCA? Why or why not?
3. A financial analyst is using 100+ stock market indicators to predict prices. Some indicators are
correlated. How might PCA improve the model?
44
3 Chapter 3: Supervised Learning – Regression Algorithms
LEARNING OUTCOMES
After reading this Section of the guide, the learner should be able to:
• Understand supervised learning and the role of regression in predicting continuous values.
• Deeply explore common regression algorithms: Linear Regression, Polynomial Regression, and
Decision Tree Regression.
• Break down key mathematical concepts underlying these models.
• Implement a practical house price prediction model using code, visualizations, and interpret
the results.
Supervised learning is a fundamental branch of machine learning where an algorithm learns from
labeled data. Each data point consists of input features (X) and a corresponding output (Y). The goal
of the algorithm is to learn a mapping function f: X → Y that accurately predicts outputs for unseen
data. The training process involves adjusting model parameters to minimize the error between
predicted and actual values.
✓ Learning Function: The model approximates the relationship between inputs and outputs
using a mathematical function.
45
This framework is the foundation for many applications, such as predicting house prices based on
features like square footage, location, and age of the property. Supervised learning is the foundation
for various real-world applications such as predicting house prices, stock trends, and sales revenue
forecasting. Recent studies (Zhang & Lee, 2021) highlight that the performance of supervised
learning models significantly depends on the quality of labeled data and the appropriate selection of
hypothesis functions.
3.2 Regression
Regression is a supervised learning technique specifically used for predicting continuous numerical
outcomes. In regression, the model learns a mapping from a set of features to a continuous output.
Real-World Examples:
✓ House Prices: Predicting the selling price of a home based on features such as square footage,
number of bedrooms, and location.
✓ Stock Prices: Forecasting future stock prices using historical price trends and financial indicators.
✓ Sales Revenue: Estimating future sales based on advertising expenditure and market trends.
Common Regression Algorithms include Linear, Polynomial and Decision Tree Regression.
Linear Regression assumes a linear relationship between the independent variable XXX and the
dependent variable YYY. It is represented by the equation:
Where:
Linear regression operates under several key assumptions, including linearity, independence,
homoscedasticity (constant variance of errors), and normally distributed errors (Nguyen & Chen,
46
2019). The model parameters, typically represented as 𝑚 (slope) and b (intercept), are estimated
using the least squares method, which minimizes the sum of the squared differences between actual
and predicted values. One of the primary advantages of linear regression is its interpretability; each
coefficient directly represents the change in the output variable for a one-unit change in the
corresponding input variable, making it a valuable tool for understanding relationships within data.
import numpy as np
model = LinearRegression()
[Link](X, y)
new_house = [Link]([[2000]])
predicted_price = [Link](new_house)
47
# Visualizing the regression line
[Link]("Square Footage")
[Link]("Price ($1000s)")
[Link]()
[Link]()
48
3.4 Polynomial Regression
When the relationship between XXX and YYY is non-linear, Polynomial Regression can be used to
model the data. It fits a polynomial equation to the data:
Polynomial regression offers greater flexibility by capturing complex relationships through the
inclusion of higher-degree terms. However, this flexibility comes with the risk of overfitting, as
higher-degree polynomials may fit the training data exceptionally well but fail to generalize to unseen
data. To achieve an effective model, careful selection of the polynomial degree is essential, balancing
bias and variance to avoid both overfitting and underfitting.
49
from [Link] import PolynomialFeatures
poly_model.fit(X, y)
# Predict the house price for 2000 sqft using polynomial regression
predicted_price_poly = poly_model.predict(new_house)
[Link]("Square Footage")
[Link]("Price ($1000s)")
[Link]()
[Link]()
50
3.5 Decision Tree Regression
Decision Tree Regression is a non-parametric model that splits the data into branches based on
feature values. Each terminal node (leaf) represents a decision rule, and the prediction is typically
the average of the outcomes in that node.
Decision trees structure their model by recursively partitioning the feature space, where each node
selects the feature and threshold that best minimizes prediction error. One of their key advantages
is interpretability, as the decision-making process at each split is transparent and can be easily
visualized. However, decision trees are prone to overfitting, especially if they are too deep or not
properly pruned. Recent advancements in machine learning have led to the development of
ensemble methods such as Random Forests and Gradient Boosting, which build upon decision trees
to enhance performance and reduce variance.
Decision tree regressors split data based on decision rules that are easy to follow. The splitting
criterion (often the mean squared error) is computed at each node to decide the best split. While
decision trees are intuitive and interpretable, they are also sensitive to small changes in the data.
Ensemble methods are commonly used to mitigate this instability.
tree_model = DecisionTreeRegressor()
tree_model.fit(X, y)
predicted_price_tree = tree_model.predict(new_house)
51
3.5.1 Decision Trees
Decision trees are popular for both classification and regression tasks. They work by learning a
sequence of if/else questions that guide the decision-making process. These questions are like those
asked in a game of 20 Questions. For example, if you need to distinguish between bears, hawks,
penguins, and dolphins, you might begin by asking if the animal has feathers, which would narrow
down the options to two possibilities. If the answer is "yes," you could further ask if the animal can
fly, helping you differentiate between hawks and penguins. If the answer is "no" (indicating the
animal does not have feathers), you’d still need another question to distinguish between dolphins
and bears, such as whether the animal has fins.
In this example, each node in the decision tree represents either a question or a terminal (leaf) node,
which holds the final answer. The edges connect the answers to the next question that needs to be
asked based on the current answer.
In machine learning terms, we have developed a model to classify four different animal types (hawks,
penguins, dolphins, and bears) using three features: "has feathers," "can fly," and "has fins." Rather
than manually creating these models, we can employ supervised learning to learn them directly from
data.
52
Building Decision Trees
Now, let's go through the process of constructing a decision tree for a 2D classification dataset. The
dataset, named two_moons, consists of two half-moon shapes, with each class containing 75 data
points.
Building a decision tree involves figuring out the sequence of if/else questions that will lead to the
correct classification in the most efficient manner. In machine learning, these questions are referred
to as tests (not to be confused with the test set, which is used to assess the model's ability to
generalize). While data typically is not structured with simple binary yes/no features like in the
animal example, it's usually represented with continuous features. The tests for continuous data are
framed as "Is feature i greater than value a?"
When constructing a decision tree, if it grows until all leaves are pure (i.e., until every leaf node
perfectly classifies the training data), the resulting model tends to be too complex and is at risk of
overfitting. A pure leaf indicates that the tree has achieved perfect accuracy on the training set,
where each data point in a leaf correctly predicts the majority class. This overfitting is visible on the
left side of the figure, where regions of class 1 are located within areas that should belong to class 0.
On the far right, a narrow band predicted as class 0 around a point from class 0 appears unnatural,
influenced by outlier points far from the main cluster in that class.
To avoid overfitting, two main strategies are commonly employed: halting the tree-building process
early (pre-pruning) or building the tree fully and then pruning away nodes that contribute little value
(post-pruning). Pre-pruning methods involve limiting the tree’s maximum depth, restricting the
number of leaves, or enforcing a minimum number of data points required in a node before it can
be split further.
53
In scikit-learn decision trees are implemented using the DecisionTreeRegressor and
DecisionTreeClassifier classes. It’s worth noting that scikit-learn supports only pre-pruning and not
post-pruning.
Let us explore the impact of pre-pruning in more detail using the Breast Cancer dataset. First, we
import the dataset and split it into training and test sets. Then, we create a model using the default
configuration, where the tree grows until all leaves are pure. We also fix
the random_state parameter to ensure consistent results for tie-breaking during tree construction.
As expected, the accuracy on the training set is 100% because the leaves are pure, and the tree has
grown deep enough to memorize all the labels in the training data. However, the accuracy on the
test set is slightly lower than the approximately 95% accuracy observed with the linear models we
previously discussed.
If the depth of a decision tree is not restricted, it can become excessively deep and complex. As a
result, unpruned trees are more prone to overfitting and may struggle to generalize effectively to
new data. To mitigate this, we can apply pre-pruning to prevent the tree from perfectly fitting the
training data. One method is to halt the tree's growth once it reaches a certain depth. For example,
by setting max_depth=4, the tree can only ask up to four questions. Limiting the tree’s depth helps
reduce overfitting, which results in a decrease in training set accuracy but an improvement in
accuracy on the test set.
We can visualize the tree using the export_graphviz function from the tree module. This writes a
file in the .dot file format, which is a text file format for storing graphs. We set an option to color
54
the nodes to reflect the majority class in each node and pass the class and features names so the
tree can be properly labeled:
The visualization of the decision tree offers a clear view of how the algorithm makes predictions and
serves as an excellent example of a machine learning model that can be easily explained to non-
experts. However, even with a tree of depth four, as shown here, the visualization can still become
somewhat intricate. Deeper trees, especially those with depths of 10 or more, are even more
challenging to interpret. One effective method for understanding the tree is to focus on the path that
most of the data follows. The n_samples displayed in each node represents the number of samples
in that node, while the value indicates the number of samples in each class. By tracing the branches
to the right, we see that a split based on "worst radius <= 16.795" leads to a node with 8 benign and
134 malignant samples. Further splits break down the 8 benign samples, and out of the 142 samples
that initially followed the right path, 132 end up in the far-right leaf. Following the left path at the
root, where "worst radius > 16.795," we find 25 malignant and 259 benign samples. Most of the
55
benign samples end up in the second leaf from the right, with only a few remaining in the other
leaves.
Rather than attempting to interpret the entire tree, which can be overwhelming, we can use valuable
metrics to summarize the tree’s behavior. One of the most widely used metrics is feature
importance, which indicates how significant each feature is in the tree’s decision-making process.
This value ranges from 0 to 1 for each feature, with 0 meaning the feature is not used at all and 1
indicating the feature perfectly predicts the target. The sum of all feature importances will always
equal 1.
In this case, we can see that the feature used in the first split, "worst radius," is by far the most
important feature. This aligns with our previous observation during the tree analysis, where the first
level effectively separates the two classes.
56
However, if a feature has a low feature_importance, it doesn’t necessarily mean the feature is
insignificant. It simply suggests that the tree didn't prioritize that feature, possibly because another
feature provides similar information.
Unlike coefficients in linear models, feature importances are always positive and do not indicate
which class a feature predicts. The feature importances reveal that "worst radius" is important, but
they do not specify whether a higher radius indicates a benign or malignant sample. In fact, the
relationship between features and classes may not be so straightforward, as shown in the following
example.
Although our focus here has been on decision trees for classification, the same principles apply to
decision trees used for regression, as seen in the DecisionTreeRegressor. The process of using and
analyzing regression trees is very similar to that of classification trees. However, one key difference
with tree-based models for regression is important to note: the DecisionTreeRegressor, like other
tree-based regression models, cannot extrapolate. This means it is unable to make predictions
beyond the range of the training data.
57
To illustrate this, let’s examine a dataset containing historical computer memory (RAM) prices. The
figure below shows this dataset, with the date on the x-axis and the price per megabyte of RAM for
each year on the y-axis.
Note the logarithmic scale on the y-axis. When plotted on a logarithmic scale, the relationship
appears fairly linear, making it relatively easy to predict, though with some fluctuations.
We will use historical data up to the year 2000 to forecast future RAM prices, treating the date as
the sole feature. Two simple models will be compared: a DecisionTreeRegressor and a
LinearRegression. The prices are rescaled using a logarithmic transformation to make the
relationship more linear. While this transformation doesn't affect the DecisionTreeRegressor, it has
a significant impact on the LinearRegression model. After training both models and making
predictions, we apply the exponential function to reverse the logarithmic transformation. We will
visualize predictions on the entire dataset, but for a proper evaluation, only the test dataset should
be considered.
58
The contrast between the two models is quite clear. The linear model fits the data with a line, as
expected, and provides a solid forecast for the test data (years after 2000), though it overlooks
some of the finer details in both the training and test data. In contrast, the tree model perfectly
predicts the training data since it was allowed to grow without complexity restrictions and
essentially memorized the dataset. However, when the model encounters data beyond the training
range, it simply predicts the last known value. This limitation arises because the tree model cannot
extrapolate or make predictions outside the scope of the training data. This issue is common to all
tree-based models.
Strengths, weaknesses, and parameters As discussed earlier, the parameters that control model
complexity in decision trees are the pre-pruning parameters that stop the building of the tree
before it is fully developed. Usually, picking one of the pre-pruning strategies—setting either
59
Parameters like max_depth, max_leaf_nodes, or min_samples_leaf are sufficient to prevent
overfitting in decision trees.
Decision trees offer two significant advantages over many of the algorithms we've covered so far:
the resulting model is easy to visualize and interpret, especially for smaller trees, and they are
completely unaffected by the scaling of the data. Since each feature is processed independently,
and the splits do not depend on the scaling of the data, decision trees do not require preprocessing
steps such as normalization or standardization. This makes them particularly effective when dealing
with features on different scales or a combination of binary and continuous features.
However, the main drawback of decision trees is that they are prone to overfitting, even with pre-
pruning, which leads to poor generalization performance. As a result, ensemble methods are often
preferred over single decision trees in most applications.
Ensemble methods combine multiple machine learning models to create more robust models.
While there are several ensemble techniques, two have proven particularly effective for both
classification and regression tasks across various datasets, using decision trees as their core
components: random forests and gradient boosted decision trees.
Task
Steps:
60
5. Visualize predictions vs. actual values.
Conceptual Questions
1. What is supervised learning, and how does it differ from unsupervised learning?
4. How does polynomial regression address non-linear relationships, and what are its limitations?
5. What is the bias-variance tradeoff, and how can it impact model performance?
6. How does decision tree regressors differ from linear models in terms of interpretability and
overfitting?
Coding-Based Questions
1. Implement a linear regression model using a dataset of your choice and interpret the model
coefficients.
2. Compare the performance of linear, polynomial, and decision tree regression on the same
dataset.
3. Visualize the decision boundaries or regression curves for each model and explain any differences
you observe.
1. A real estate company wants to predict house prices in a new market. What factors would you
consider when selecting a regression model?
3. Given a dataset with several hundred features, discuss how you apply dimensionality reduction
techniques before performing regression analysis.
61
4 Chapter Four: Supervised learning - Classification Algorithms
LEARNING OUTCOMES
After reading this Section of the guide, the learner should be able to:
4.1 Classification
Classification is a type of supervised learning where the goal is to predict a categorical outcome based
on input features. Unlike regression, which predicts continuous values, classification models categorize
inputs into discrete labels. According to Zhang & Wang (2021), classification algorithms are widely used
in automated decision-making, fraud detection, and medical diagnostics, proving their significance in
real-world applications.
𝑓:𝑋→𝑌
f:X→Y, where 𝑋 is the feature set and 𝑌 is a categorical label. The decision boundary of a classification
model separates different categories in the feature space.
62
Common Classification Algorithms
import pandas as pd
import numpy as np
[Link](42)
63
X = [Link](100, 2) # Two features
# Split data
model = LogisticRegression()
[Link](X_train, y_train)
y_pred = [Link](X_test)
Output:
The model achieves 85% accuracy in classifying the data into two classes.
64
4.3 k-Nearest Neighbors (k-NN)
The k-Nearest Neighbors (k-NN) algorithm classifies a data point based on the majority class of its k
closest neighbors in the feature space. k-NN performs well when data is well-distributed but struggles
with high-dimensional datasets due to the curse of dimensionality.
Key Properties:
• Distance metric: Typically uses Euclidean distance to measure closeness between points.
The k-NN algorithm is one of the simplest machine learning methods. Building the model simply involves
storing the training dataset. When making a prediction for a new data point, the algorithm finds the
closest data points in the training set, known as its “nearest neighbors.” In its most basic form, the k-NN
algorithm considers only one neighbor the closest training point to the data point we are predicting for.
The prediction output is then the same as the output of this nearest [Link].
65
Here, we introduced three new data points, represented by stars. For each of these points, we identified
the closest point in the training set. The prediction made by the one-nearest-neighbor algorithm is the
label of that closest point, indicated by the color of the cross. Instead of considering just the nearest
neighbor, we can also factor in a chosen number, k, of neighbors. This is where the term "k-nearest
neighbors" originates. When looking at multiple neighbors, we use a voting process to determine the
label. For each test point, we count how many neighbors belong to class 0 and how many belong to class
1, then assign the label of the majority class—the most frequent class among the k-nearest neighbors.
The example below illustrates this using the three closest neighbors:
Once again, the prediction is represented by the color of the cross. You can observe that the prediction
for the new data point at the top left differs from the one made using only a single neighbor. Although
this example is for a binary classification problem, the same approach can be used for datasets with
multiple classes. In such cases, we count how many neighbors belong to each class and predict the most
frequent class. Now, let’s explore how we can implement the k-nearest neighbors algorithm using scikit-
learn. First, we divide our data into training and test sets so we can assess the model's ability to
66
generalize.
# Load dataset
iris = load_iris()
X, y = [Link], [Link]
# Split data
67
knn = KNeighborsClassifier(n_neighbors=5)
[Link](X_train, y_train)
y_pred = [Link](X_test)
Output:
Imagine a hobbyist botanist who wants to identify the species of iris flowers she has found. She has
recorded measurements for each iris, including the length and width of both the petals and the sepals,
all in centimeters. Additionally, she has measurements for some irises that have already been identified
by an expert botanist as belonging to the species setosa, versicolor, or virginica. For these, she is certain
of the species. Let's assume these are the only species the botanist will encounter in the wild. Our
objective is to develop a machine learning model that can learn from the known iris measurements so
that it can predict the species of new irises. We aim to build a machine learning model that can predict
the species of an iris based on new measurements. We can now begin building the machine learning
model. We will use a k-nearest neighbors (KNN) classifier, which is simple to understand. The process of
building this model mainly involves storing the training set. To predict the label for a new data point, the
algorithm identifies the training point that is closest to the new one and assigns its label to the new point.
The "k" in k-nearest neighbors means that, rather than just using the closest neighbor, we can take into
account a set number of closest neighbors (for example, three or five). The prediction is then based on
68
the majority class of these neighbors. We will discuss this in more detail, but for now, we will consider
just one neighbor. In scikit-learn, machine learning models are implemented in classes known as
Estimator classes. The K-nearest neighbors classification algorithm is provided by the
KNeighborsClassifier class in the neighbors module. To use the model, we need to create an instance of
this class and configure its parameters. The most crucial parameter for KNeighborsClassifier is the
number of neighbors, which we will set to 1.
The knn object encapsulates the algorithm that will be used to build the model from the training data,
as well the algorithm to make predictions on new data points. It will also hold the information that the
algorithm has extracted from the training data. In the case of KNeighborsClassifier, it will just store the
training set.
The fit method returns the KNN object itself and modifies it in place, providing a string representation of
the classifier. This representation displays the parameters used to create the model. Most of these
parameters are set to their default values, but it also includes n_neighbors=1, which is the value we
specified. While scikit-learn models have many parameters, the majority are related to optimization or
specialized use cases, so you don’t need to worry about the other parameters shown in the output.
Printing a scikit-learn model can produce lengthy strings, but don’t be discouraged by this. We will go
over all the key parameters. From here on, we won’t display the output of fit since it doesn’t provide any
new information.
69
Note that we made the measurements of this single flower into a row in a two-dimensional NumPy array,
as scikit-learn always expects two-dimensional arrays for the data.
Making Predictions
We can now make predictions using this model on new data for which we might not know the correct
labels. Imagine we found an iris in the wild with a sepal length of 5 cm, a sepal width of 2.9 cm, a petal
length of 1 cm, and a petal width of 0.2 cm. What species of iris would this be? We can put this data into
a NumPy array, again by calculating the shape that is, the number of samples (1) multiplied by the
number of features (4):
70
For this model, the accuracy of the test set is around 0.97, meaning the model correctly predicted 97%
of the irises in the test set. Under certain mathematical assumptions, this suggests the model will be
accurate 97% of the time when applied to new irises. In our hobby botanist scenario, this high accuracy
indicates that the model is likely reliable enough for practical use. In upcoming chapters, we will explore
ways to improve the model's performance and discuss the challenges of fine-tuning it.
The task involved three species: setosa, versicolor, and virginica, making it a multiclass classification
problem. In classification, the categories or species are called classes, and each iris has a species label.
The Iris dataset consists of two NumPy arrays: one for the features (denoted as X in scikit-learn) and one
for the correct labels (denoted as y). X is a two-dimensional array where each row corresponds to a data
point, and each column represents a feature, while y is a one-dimensional array containing the class
labels for each sample. We split the dataset into a training set to build the model and a test set to evaluate
how well the model generalizes to new data. We selected the k-nearest neighbors algorithm for
classification, which makes predictions based on the closest neighbors in the training set. We
implemented this using the KNeighborsClassifier class, setting the parameters accordingly. After fitting
the model with the fit method and passing in the training data and labels, we evaluated its performance
with the score method, which returned an accuracy of 97%. This indicates the model made correct
predictions 97% of the time on the test set, giving us confidence that it will predict new iris
measurements with similar accuracy.
wX+b=0
71
For non-linearly separable data, SVM uses the kernel trick (e.g., RBF kernel) to transform data into a
higher-dimensional space.
digits = datasets.load_digits()
X, y = [Link], [Link]
# Split data
svm_model = SVC(kernel='rbf')
svm_model.fit(X_train, y_train)
y_pred = svm_model.predict(X_test)
72
accuracy = accuracy_score(y_test, y_pred)
Output:
We can apply both Logistic Regression and LinearSVC models to the forge dataset and visualize the
decision boundary.
In this figure, the first feature of the forge dataset is plotted on the x-axis, and the second feature is
plotted on the y-axis, as in the previous example. The decision boundaries for both LinearSVC and
LogisticRegression are shown as straight lines, dividing the area classified as class 1 (above the line) from
the area classified as class 0 (below the line). This means any new data point above the black line will be
classified as class 1 by the respective classifier, while any point below the black line will be classified as
class 0.
Both models produce similar decision boundaries, though they misclassify two points. By default, both
models use L2 regularization, similar to how Ridge regression applies regularization.
For both LogisticRegression and LinearSVC, the trade-off between model complexity and regularization
strength is controlled by the parameter C. Higher values of C result in less regularization, meaning the
73
models try to fit the training data as accurately as possible. In contrast, lower values of C encourage the
models to find coefficient vectors (w) that are closer to zero, thus imposing more regularization.
An interesting characteristic of the C parameter is how it influences the model's focus. Lower values of C
lead the algorithms to focus on adjusting to the majority of data points, while higher values of C make
the models prioritize correctly classifying each individual data point.
On the left side, a very small value of C results in a model with strong regularization. Most of the class 0
points are at the top, and most of the class 1 points are at the bottom. The highly regularized model
chooses a nearly horizontal decision boundary, misclassifying two points. In the middle plot, with a
slightly higher C value, the model places more focus on the two misclassified samples, causing the
decision boundary to tilt. On the far right, with a very high C value, the model tilts the decision boundary
significantly, correctly classifying all class 0 points. However, one class 1 point remains misclassified, as
it's not possible to separate all points in this dataset using a straight line. This model, which attempts to
correctly classify every point, may not capture the overall structure of the data well, indicating it is likely
overfitting.
Similar to regression models, linear models for classification can seem restrictive in low-dimensional
spaces, as they only allow for decision boundaries that are straight lines or planes. However, in higher-
dimensional spaces, linear classification models become much more powerful, and the risk of overfitting
increases as more features are added.
Now, let's dive deeper into analyzing LinearLogistic using the Breast Cancer dataset.
74
The default value of C=1 provides quite good performance, with 95% accuracy on both the training and
the test set. But as training and test set performance are very close, it is likely that we are underfitting.
Let’s try to increase C to fit a more flexible model:
Using C=100 results in higher training set accuracy, and also a slightly increased test set accuracy,
confirming our intuition that a more complex model should perform better. We can also investigate what
happens if we use an even more regularized model than the default of C=1, by setting C=0.01:
As anticipated, when moving further left on the scale in the figure, starting from an underfit model, both
the training and test set accuracies decrease compared to the default parameters.
Finally, let's examine the coefficients learned by the models using the three different settings of the
regularization parameter C:
75
Since LogisticRegression uses L2 regularization by default, the resulting plot is similar to the one produced
by Ridge regression. As the regularization strength increases, the coefficients are gradually reduced,
though they never reach zero. Upon closer examination of the plot, an intriguing pattern appears with
the third coefficient, related to "mean perimeter." When C=100 and C=1, the coefficient is negative, but
at C=0.001, it becomes positive, with a magnitude even larger than at C=1. Interpreting a model like this
might lead one to assume that a feature's coefficient directly corresponds to the class it’s associated with.
For example, one might think that a high "texture error" feature is linked to a "malignant" sample.
However, the sign change in the "mean perimeter" coefficient indicates that, depending on the model
used, a high "mean perimeter" could suggest either "benign" or "malignant." This example underscores
the importance of being careful when interpreting the coefficients of linear models.
If we desire a more interpretable model, using L1 regularization might help, as it limits the model to using
only a few features. Here is the coefficient plot and classification accuracies for L1 regularization
76
Many linear classification models are designed for binary classification and don't naturally extend to
multiclass problems, with logistic regression being a notable exception. To adapt a binary classifier for
multiclass classification, the one-vs.-rest strategy is commonly used. In this approach, a separate binary
model is trained for each class to distinguish that class from all others, leading to as many binary models
as there are classes. To make a prediction, all binary classifiers are applied to a test point, and the
classifier with the highest score for its class "wins," assigning that class label to the test point.
With one classifier per class, there is a separate coefficient vector (w) and intercept (b) for each class.
The class with the highest classification score, determined by the formula:
is chosen as the predicted label. While the mathematical details of multiclass logistic regression differ
slightly from the one-vs.-rest approach, both methods result in a separate coefficient vector and intercept
for each class, with the same prediction method applied. Let’s now apply the one-vs.-rest approach to a
77
simple three-class classification dataset, where each class consists of data sampled from a Gaussian
distribution.
We see that the shape of the coef_ is (3, 2), meaning that each row of coef_ contains the coefficient
vector for one of the three classes and each column holds the coefficient value for a specific feature
(there are two in this dataset). The intercept_ is now a one-dimensional array, storing the intercepts for
each class. Let’s visualize the lines given by the three binary classifiers:
78
You can see that all the points belonging to class 0 in the training data are above the line corresponding
to class 0, which means they are on the “class 0” side of this binary classifier. The points in class 0 are
above the line corresponding to class 2, which means they are classified as “rest” by the binary classifier
for class 2. The points belonging to class 0 are to the left of the line corresponding to class 1, which means
the binary classifier for class 1 also classifies them as “rest.” Therefore, any point in this area will be
classified as class 0 by the final classifier (the result of the classification confidence formula for classifier
0 is greater than zero, while it is smaller than zero for the other two classes). But what about the triangle
in the middle of the plot? All three binary classifiers classify points there as “rest.” Which class would a
point be assigned to? The answer is the one with the highest value for the classification formula: the
class of the closest line.
79
4.4.1 Strengths, Weaknesses, and Parameters
The main parameter in linear models is the regularization parameter, which is called alpha in regression
models and C in LinearSVC and LogisticRegression. Higher values of alpha or smaller values of C result in
simpler models. Tuning these parameters is especially important for regression models, and it’s common
to search for them on a logarithmic scale. Additionally, you must decide between using L1 or L2
regularization. If you believe that only a few features are truly significant, L1 regularization is preferable.
Otherwise, L2 regularization is usually the default. L1 regularization is also helpful for interpretability, as
it selects only a few important features, making it easier to explain how those features affect the
predictions.
Linear models are fast to train and predict, and they scale well to large datasets. They also handle sparse
data well. For datasets with hundreds of thousands or millions of samples, the solver='sag' option in
LogisticRegression and Ridge can be more efficient. Other scalable options
include SGDClassifier and SGDRegressor, which offer even more efficient versions of the linear models
discussed.
One advantage of linear models is that they are easy to understand, as they rely on simple formulas for
both regression and classification. However, the interpretation of the coefficients can sometimes be
80
unclear, especially when features are highly correlated, which can make the coefficients difficult to
interpret.
Linear models tend to perform well when the number of features significantly exceeds the number of
samples. They are often used with very large datasets because training more complex models may not
be practical. However, for lower-dimensional datasets, other models might provide better generalization
performance.
Key Metrics:
81
5 Chapter 5: Unsupervised Learning – Clustering Algorithms
LEARNING OUTCOMES
After reading this Section of the guide, the learner should be able to:
5.1 Introduction
Unsupervised learning is a category of machine learning where models analyze and learn patterns from
unlabeled data. Unlike supervised learning, where models require labeled outputs (e.g., spam vs. non-
spam emails), unsupervised learning identifies hidden structures in data without explicit supervision.
Unsupervised learning is crucial in fields like finance, marketing, and healthcare, where vast amounts of
unstructured data must be analyzed without human annotation.
5.2 Clustering
Clustering is an unsupervised learning technique that groups similar data points into clusters based on
their feature similarities. It is widely used for:
82
Image segmentation – Dividing images into meaningful regions.
Mathematical Representation
import pandas as pd
[Link](42)
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)
df['Cluster'] = kmeans.fit_predict(df_scaled)
# Visualize clusters
84
[Link]('Annual Income ($)')
[Link]('Spending Score')
[Link]()
Output:
85
Agglomerative (Bottom-Up) – Each data point starts as its own cluster, and clusters merge iteratively.
Divisive (Top-Down) – The entire dataset starts as a single cluster and splits iteratively.
# Plot dendrogram
[Link](figsize=(10, 5))
[Link]('Customer Index')
[Link]('Distance')
[Link]()
Expected Output
86
5.5 Density-Based Clustering (DBSCAN)
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) identifies dense clusters and
separates noise points. DBSCAN is widely used for anomaly detection and geospatial analysis.
Advantages of DBSCAN:
DBSCAN Clustering
# Apply DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
df['DBSCAN_Cluster'] = dbscan.fit_predict(df_scaled)
# Visualize clusters
87
[Link](df['Annual_Income'], df['Spending_Score'], c=df['DBSCAN_Cluster'],
cmap='viridis')
[Link]('Annual Income ($)')
[Link]('Spending Score')
[Link]('Customer Segmentation using DBSCAN')
[Link]()
Output:
88
✓ Elbow Method (for k-Means) – Determines the optimal number of clusters.
Task:
1 Load a real-world dataset (e.g., Mall Customer Segmentation dataset on Kaggle).
2 Apply k-Means, Hierarchical Clustering, and DBSCAN.
3 Compare the clusters using visualizations and silhouette scores.
As we described earlier, clustering is the task of partitioning the dataset into groups, called clusters. The
goal is to split up the data in such a way that points within a single cluster are very similar and points in
different clusters are different. Similarly to classification algorithms, clustering algorithms assign (or
predict) a number to each data point, indicating which cluster a particular point belongs to.
k-Means Clustering k-means clustering is one of the simplest and most commonly used clustering
algorithms. It tries to find cluster centers that are representative of certain regions of the data. The
algorithm alternates between two steps: assigning each data point to the closest cluster center, and then
setting each cluster center as the mean of the data points that are assigned to it. The algorithm is finished
when the assignment of instances to clusters no longer changes. The following example illustrates the
algorithm on a synthetic dataset:
89
Cluster centers are shown as triangles, while data points are shown as circles. Colors indicate cluster
membership. We specified that we are looking for three clusters, so the algorithm was initialized by
declaring three data points randomly as cluster centers (see “Initialization”). Then the iterative algorithm
starts. First, each data point is assigned to the cluster center it is closest to (see “Assign Points (1)”). Next,
the cluster centers are updated to be the mean of the assigned points (see “Recompute Centers (1)”).
Then the process is repeated two more times. After the third iteration, the assignment of points to cluster
centers remained unchanged, so the algorithm stops. Given new data points, k-means will assign each to
the closest cluster center. The next example shows the boundaries of the cluster centers:
90
Applying k-means with scikit-learn is quite straightforward. Here, we apply it to the synthetic data that
we used for the preceding plots. We instantiate the KMeans class, and set the number of clusters we are
looking for. Then we call the fit method with the data:
During the algorithm, each training data point in X is assigned a cluster label. You can find these labels in
the kmeans.labels_ attribute:
You can see that clustering is somewhat similar to classification, in that each item gets a label. However,
there is no ground truth, and consequently the labels themselves have no a priori meaning. Let’s go back
to the example of clustering face images that we discussed before. It might be that the cluster 3 found
by the algorithm contains only faces of your friend Bela. You can only know that after you look at the
pictures, though, and the number 3 is arbitrary. The only information the algorithm gives you is that all
faces labeled as 3 are similar. For the clustering we just computed on the two-dimensional toy dataset,
that means that we should not assign any significance to the fact that one group was labeled 0 and
another one was labeled 1. Running the algorithm again might result in a differ- ent numbering of clusters
because of the random nature of the initialization. Here is a plot of this data again. The cluster centers
are stored in the cluster_centers_ attribute, and we plot them as triangles:
91
We can also use more or fewer cluster centers
Even if you know the “right” number of clusters for a given dataset, k-means might not always be able to
recover them. Each cluster is defined solely by its center, which means that each cluster is a convex shape.
As a result of this, k-means can only cap- ture relatively simple shapes. k-means also assumes that all
clusters have the same “diameter” in some sense; it always draws the boundary between clusters to be
exactly in the middle between the cluster centers. That can sometimes lead to surprising results:
92
One might have expected the dense region in the lower left to be the first cluster, the dense region in the
upper right to be the second, and the less dense region in the cen- ter to be the third. Instead, both
cluster 0 and cluster 1 have some points that are far away from all the other points in these clusters that
“reach” toward the center. k-means also assumes that all directions are equally important for each
cluster. The following plot shows a two-dimensional dataset where there are three clearly separated
parts in the data. However, these groups are stretched toward the diagonal. As k-means only considers
the distance to the nearest cluster center, it can’t handle this kind of data:
93
k-means also performs poorly if the clusters have more complex shapes:
94
Vector quantization, or seeing k-means as decomposition Even though k-means is a clustering algorithm,
there are interesting parallels between k-means and the decomposition methods like PCA and NMF that
we discussed ear- lier. You might remember that PCA tries to find directions of maximum variance in the
data, while NMF tries to find additive components, which often correspond to “extremes” or “parts” of
the data. Both methods tried to express the data points as a sum over some components. k-means, on
the other hand, tries to rep- resent each data point using a cluster center. You can think of that as each
point being represented using only a single component, which is given by the cluster center. This view of
k-means as a decomposition method, where each point is represented using a single component, is called
vector quantization.
Let’s do a side-by-side comparison of PCA, NMF, and k-means, showing the components extracted, as
well as reconstructions of faces from the test set using 100 components. For k-means, the reconstruction
is the closest cluster center found on the training set:
95
An interesting aspect of vector quantization using k-means is that we can use many more clusters than
input dimensions to encode our data. Let’s go back to the two_moons data. Using PCA or NMF, there is
nothing much we can do to this data, as it lives in only two dimensions. Reducing it to one dimension
with PCA or NMF would completely destroy the structure of the data. But we can find a more expressive
representation with k-means, by using more cluster centers:
96
We used 10 cluster centers, which means each point is now assigned a number between 0 and 9. We can
see this as the data being represented using 10 components (that is, we have 10 new features), with all
features being 0, apart from the one that represents the cluster center the point is assigned to. Using
this 10-dimensional repre- sentation, it would now be possible to separate the two half-moon shapes
using a lin- ear model, which would not have been possible using the original two features. It is also
possible to get an even more expressive representation of the data by using the distances to each of the
cluster centers as features. This can be accomplished using the transform method of kmeans:
K-means is a widely used clustering algorithm, favored for its simplicity, ease of implementation, and fast
execution. It scales well to large datasets, and scikit-learn offers an even more scalable version called
MiniBatchKMeans, which is capable of handling very large datasets. However, one limitation of k-means
97
is that it depends on a random initialization, meaning the results can vary depending on the random
seed. To address this, scikit-learn runs the algorithm 10 times with different random initializations and
selects the best outcome. Other drawbacks include the algorithm's restrictive assumptions about cluster
shapes and the need to specify the number of clusters beforehand, which may not always be known in
real-world scenarios.
Agglomerative clustering encompasses a group of algorithms that follow similar principles: the algorithm
begins by treating each data point as its own cluster, then iteratively merges the two most similar clusters
until a predefined stopping condition is met. In scikit-learn, this stopping condition is set by the desired
number of clusters, and clusters are merged until that number is reached. The similarity between clusters
is measured using different linkage criteria, each defining how to evaluate the proximity between two
clusters.
• Ward: The default option, Ward merges clusters in such a way that the increase in the overall
variance within all clusters is minimized. This typically results in clusters of similar size.
• Average: This method merges clusters with the smallest average distance between all points in
the two clusters.
• Complete: Also known as maximum linkage, this criterion merges clusters based on the smallest
maximum distance between any two points in the clusters.
Ward is suitable for most datasets, and will be used in the examples here. However, if the clusters have
significantly different sizes, such as when one cluster is much larger than the others, average or complete
linkage may produce better results.
The following plot demonstrates the process of agglomerative clustering applied to a two-dimensional
dataset, where the goal is to identify three clusters.
98
Initially, each point is its own cluster. Then, in each step, the two clusters that are closest are merged. In
the first four steps, two single-point clusters are picked and these are joined into two-point clusters. In
step 5, one of the two-point clusters is extended to a third point, and so on. In step 9, there are only
three clusters remain- ing. As we specified that we are looking for three clusters, the algorithm then
stops. Let’s have a look at how agglomerative clustering performs on the simple threecluster data we
used here. Because of the way the algorithm works, agglomerative clustering cannot make predictions
for new data points. Therefore, Agglomerative Clustering has no predict method. To build the model and
get the cluster member- ships on the training set, use the fit_predict method instead.
99
As expected, the algorithm recovers the clustering perfectly. While the scikit-learn implementation of
agglomerative clustering requires you to specify the number of clusters you want the algorithm to find,
agglomerative clustering methods provide some help with choosing the right number, which we will
discuss next.
Agglomerative clustering produces what is known as a hierarchical clustering. The clustering proceeds
iteratively, and every point makes a journey from being a single point cluster to belonging to some final
cluster. Each intermediate step provides a clustering of the data (with a different number of clusters). It
is sometimes helpful to look at all possible clusterings jointly. The next example shows an overlay of all
the possible clusterings shown before, providing some insight into how each cluster breaks up into
smaller clusters:
While this visualization provides a very detailed view of the hierarchical clustering, it relies on the two-
dimensional nature of the data and therefore cannot be used on datasets that have more than two
features. There is, however, another tool to visualize hierarchical clustering, called a dendrogram, that
can handle multidimensional datasets.
Unfortunately, scikit-learn currently does not have the functionality to draw dendrograms. However, you
can generate them easily using SciPy. The SciPy clustering algorithms have a slightly different interface to
the scikit-learn clustering algorithms. SciPy provides a function that takes a data array X and computes a
linkage array, which encodes hierarchical cluster similarities. We can then feed this linkage array into the
scipy dendrogram function to plot the dendrogram:
100
The dendrogram illustrates the data points as numbered from 0 to 11 at the bottom. A tree structure is
then created with these points, each representing individual clusters, and new parent nodes are added
as pairs of clusters are merged. Starting from the bottom and moving upward, the first merger involves
points 1 and 4. Next, points 6 and 9 form a cluster, and this merging continues. At the top level, there are
two main branches: one includes points 11, 0, 5, 10, 7, 6, and 9, while the other consists of points 1, 4,
3, 2, and 8. These two branches represent the two largest clusters on the left side of the plot.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is another powerful clustering
algorithm. Its key advantages are that it does not require specifying the number of clusters beforehand,
it can identify clusters with complex shapes, and it can detect outliers or points that don't belong to any
cluster. Although DBSCAN is slower than both agglomerative clustering and k-means, it can still handle
relatively large datasets.
DBSCAN operates by identifying dense regions in the feature space, where many data points are located
close to each other. These dense regions are considered potential clusters, separated by areas that are
relatively sparse. Points within a dense region are termed core samples (or core points). Two key
101
parameters control DBSCAN: min_samples and eps. If there are at least min_samples points within a
distance of eps from a given point, it is labeled as a core sample.
The algorithm starts by selecting an arbitrary point and finds all points within a distance of eps. If fewer
than min_samples points are found within this radius, the point is marked as noise, meaning it does not
belong to any cluster. If more than min_samples points are within eps, the point is labeled as a core
sample and assigned to a new cluster. The algorithm then visits all neighboring points within eps. If those
neighbors haven't been assigned a cluster, they are given the same cluster label. If any of the neighbors
are core samples, their neighbors are recursively visited, and the cluster grows until there are no more
core samples within eps. The process repeats with an unvisited point, and the algorithm continues until
all points have been processed.
In the end, there are three kinds of points: core points, points that are within distance eps of core points
(called boundary points), and noise. When the DBSCAN algorithm is run on a particular dataset multiple
times, the clustering of the core points is always the same, and the same points will always be labeled as
noise. However, a boundary point might be neighbor to core samples of more than one cluster. Therefore,
the cluster membership of boundary points depends on the order in which points are vis- ited. Usually
there are only few boundary points, and this slight dependence on the order of points is not important.
Let’s apply DBSCAN on the synthetic dataset we used to demonstrate agglomerative clustering. Like
agglomerative clustering, DBSCAN does not allow predictions on new test data, so we will use the
fit_predict method to perform clustering and return the cluster labels in one step:
As you can see, all data points were assigned the label -1, which stands for noise. This is a consequence
of the default parameter settings for eps and min_samples, which are not tuned for small toy datasets.
The cluster assignments for different values of min_samples and eps are shown below, and visualized:
102
In this plot, points belonging to clusters are represented by solid markers, while noise points are shown
in white. Core samples are indicated by larger markers, and boundary points are represented by smaller
ones. As eps increases (moving from left to right in the figure), the clusters expand to include more
points, but this may also lead to the merging of distinct clusters into one. Conversely,
increasing min_samples (moving from top to bottom) results in fewer points being classified as core
samples, and more points being labeled as noise.
The eps parameter is typically more influential, as it sets the proximity threshold for points to be
considered as part of the same cluster. If eps is too small, no points may qualify as core samples, and all
points could be labeled as noise. On the other hand, if eps is too large, all points may end up in a single
cluster.
103
The min_samples parameter mainly impacts whether points in less dense areas are treated as outliers or
included in their own clusters. Lowering min_samples causes smaller groups to be labeled as noise. For
instance, when min_samples is set to 3, three clusters are formed: one with four points, one with five
points, and one with three points. However, when min_samples is increased to 5, the smaller clusters
(with three and four points) are considered noise, leaving only the cluster with five points.
Although DBSCAN does not require specifying the number of clusters directly, adjusting eps affects the
number of clusters identified. Finding the optimal eps value can be easier when the data is scaled using
methods like StandardScaler or MinMaxScaler, as these techniques ensure that all features are on a
similar scale. The outcome of running DBSCAN on the two_moons dataset is shown here, where the
algorithm successfully identifies the two half-circles and separates them using the given settings.
One of the challenges in applying clustering algorithms is that it is very hard to assess how well an
algorithm worked, and to compare outcomes between different algo- rithms. After talking about the
algorithms behind k-means, agglomerative clustering, and DBSCAN, we will now compare them on some
real-world datasets.
The adjusted rand index provides intuitive results, with a random cluster assignment having a score of 0
and DBSCAN (which recovers the desired clustering perfectly) having a score of 1.
A common mistake when evaluating clustering in this way is to use accuracy_score instead of
adjusted_rand_score, normalized_mutual_info_score, or some other clustering metric. The problem in
using accuracy is that it requires the assigned clus- ter labels to exactly match the ground truth. However,
the cluster labels themselves are meaningless—the only thing that matters is which points are in the
same cluster:
105
Evaluating clustering without ground truth Although we have just shown one way to evaluate clustering
algorithms, in practice, there is a big problem with using measures like ARI. When applying clustering
algo- rithms, there is usually no ground truth to which to compare the results. If we knew the right
clustering of the data, we could use this information to build a supervised model like a classifier.
Therefore, using metrics like ARI and NMI usually only helps in developing algorithms, not in assessing
success in an application. There are scoring metrics for clustering that don’t require ground truth, like
the sil- houette coefficient. However, these often don’t work well in practice. The silhouette score
computes the compactness of a cluster, where higher is better, with a perfect score of 1. While compact
clusters are good, compactness doesn’t allow for complex shapes. Here is an example comparing the
outcome of k-means, agglomerative clustering, and DBSCAN on the two-moons dataset using the
silhouette score:
106
As observed, k-means achieves the highest silhouette score, even though we may prefer the results
produced by DBSCAN. A more effective approach for evaluating clusters is to use robustness-based
clustering metrics. These metrics involve running an algorithm after adding noise to the data or using
different parameter settings and comparing the results. The idea is that if many algorithm configurations
and variations in the data produce the same outcome, the result is likely reliable. Unfortunately, this
approach is not available in scikit-learn as of this writing.
Even with a very robust clustering or a high silhouette score, we still cannot determine if the clustering
has any semantic meaning or if it reflects an aspect of the data we care about. Returning to the face
image example, we might want to identify groups of similar faces, such as distinguishing between men
and women, young and old, or people with beards versus those without. Suppose we cluster the data
into two groups, and all algorithms agree on which points should be grouped together. However, we still
cannot be certain that the clusters correspond to the concepts we are interested in. The clusters could
have been formed based on factors like side views versus front views, photos taken at night versus during
the day, or pictures captured with different types of phones (iPhones versus Androids). The only way to
determine whether the clustering aligns with our interests is through manual analysis of the clusters.
Comparing algorithms on the faces dataset Let’s apply the k-means, DBSCAN, and agglomerative
clustering algorithms to the Labeled Faces in the Wild dataset, and see if any of them find interesting
structure. We will use the eigenface representation of the data, as produced by PCA(whiten=True), with
100 components:
We saw earlier that this is a more semantic representation of the face images than the raw pixels. It will
also make computation faster. A good exercise would be for you to run the following experiments on the
original data, without PCA, and see if you find similar clusters.
Analyzing the faces dataset with DBSCAN. We will start by applying DBSCAN, which we just discussed:
107
We see that all the returned labels are –1, so all of the data was labeled as “noise” by DBSCAN. There are
two things we can change to help this: we can make eps higher, to expand the neighborhood of each
point, and set min_samples lower, to consider smaller groups of points as clusters. Let’s try changing
min_samples first:
Even when considering groups of three points, everything is labeled as noise. So, we need to increase
eps:
Using a much larger eps of 15, we get only a single cluster and noise points. We can use this result to find
out what the “noise” looks like compared to the rest of the data. To understand better what’s happening,
let’s look at how many points are noise, and how many points are inside the cluster:
Comparing these images to the random sample of face images, we can guess why they were labeled as
noise: the fifth image in the first row shows a per- son drinking from a glass, there are images of people
wearing hats, and in the last image there’s a hand in front of the person’s face. The other images contain
108
odd angles or crops that are too close or too wide. This kind of analysis—trying to find “the odd one
out”—is called outlier detection. If this was a real application, we might try to do a better job of cropping
images, to get more homogeneous data. There is little we can do about people in photos sometimes
wearing hats, drinking, or holding something in front of their faces, but it’s good to know that these are
issues in the data that any algorithm we might apply needs to handle. If we want to find more interesting
clusters than just one large one, we need to set eps smaller, somewhere between 15 and 0.5 (the default).
Let’s have a look at what different values of eps result in:
For low settings of eps, all points are labeled as noise. For eps=7, we get many noise points and many
smaller clusters. For eps=9 we still get many noise points, but we get one big cluster and some smaller
clusters. Starting from eps=11, we get only one large cluster and noise. What is interesting to note is that
there is never more than one large cluster. At most, there is one large cluster containing most of the
points, and some smaller clusters. This indicates that there are not two or three different kinds of face
images in the data that are very distinct, but rather that all images are more or less equally similar to (or
dissimilar from) the rest. The results for eps=7 look most interesting, with many small clusters. We can
investigate this clustering in more detail by visualizing all of the points in each of the 13 small clusters:
109
Some of the clusters correspond to people with very distinct faces (within this dataset), such as Sharon
or Koizumi. Within each cluster, the orientation of the face is also quite fixed, as well as the facial
expression. Some of the clusters contain faces of multiple people, but they share a similar orientation
and expression. This concludes our analysis of the DBSCAN algorithm applied to the faces dataset. As you
can see, we are doing a manual analysis here, different from the much more automatic search approach
we could use for supervised learning based on R 2 score or accuracy. Let’s move on to applying k-means
and agglomerative clustering. Analyzing the faces dataset with k-means. We saw that it was not possible
to create more than one big cluster using DBSCAN. Agglomerative clustering and k-means are much more
likely to create clusters of even size, but we do need to set a target number of clusters. We could set the
number of clusters to the known number of people in the dataset, though it is very unlikely that an
110
unsupervised clustering algorithm will recover them. Instead, we can start with a low number of clusters,
like 10, which might allow us to analyze each of the clusters:
As you can see, k-means clustering partitioned the data into relatively similarly sized clusters from 64 to
386. This is quite different from the result of DBSCAN. We can further analyze the outcome of k-means
by visualizing the cluster centers. As we clustered in the representation produced by PCA, we need to
rotate the cluster centers back into the original space to visualize them, using pca.inverse_transform:
The cluster centers found by k-means are very smooth versions of faces. This is not very surprising, given
that each center is an average of 64 to 386 face images. Working with a reduced PCA representation adds
to the smoothness of the images (compared to the faces reconstructed using 100 PCA dimensions). The
clustering seems to pick up on different orientations of the face, different expressions (the third cluster
center seems to show a smiling face), and the presence of shirt collars (see the second-to-last cluster
center). For a more detailed view, we show for each cluster center the five most typical images in the
cluster (the images assigned to the cluster that are closest to the cluster center) and the five most atypical
images in the cluster (the images assigned to the cluster that are furthest from the cluster center):
111
112
Analyzing the faces dataset with agglomerative clustering. Now, let’s look at the results of agglomerative
clustering:
Agglomerative clustering also produces relatively equally sized clusters, with cluster sizes between 26
and 623. These are more uneven than those produced by k-means, but much more even than the ones
produced by DBSCAN. We can compute the ARI to measure whether the two partitions of the data given
by agglomerative clustering and k-means are similar:
113
An ARI of only 0.13 means that the two clusterings labels_agg and labels_km have little in common. This
is not very surprising, given the fact that points further away from the cluster centers seem to have little
in common for k-means. Next, we might want to plot the dendrogram. We’ll limit the depth of the tree
in the plot, as branching down to the individual 2,063 data points would result in an unreadably dense
plot:
Creating 10 clusters, we cut across the tree at the very top, where there are 10 vertical lines. In the
dendrogram for the toy data, you could see by the length of the branches that two or three clusters might
capture the data appropriately. For the faces data, there doesn’t seem to be a very natural cutoff point.
There are some branches that represent more distinct groups, but there doesn’t appear to be a particular
number of clusters that is a good fit. This is not surprising, given the results of DBSCAN, which tried to
cluster all points together. Let’s visualize the 10 clusters, as we did for k-means earlier. Note that there is
no notion of cluster center in agglomerative clustering (though we could compute the mean), and we
simply show the first couple of points in each cluster. We show the number of points in each cluster to
the left of the first image:
114
While some of the clusters seem to have a semantic theme, many of them are too large to be actually
homogeneous. To get more homogeneous clusters, we can run the algorithm again, this time with 40
clusters, and pick out some of the clusters that are particularly interesting:
115
Summary of Clustering Methods
This section highlighted the fact that clustering is a largely qualitative process, often most useful during
the exploratory phase of data analysis. We explored three clustering algorithms: k-means, DBSCAN, and
agglomerative clustering. Each method offers a way to control the level of granularity in clustering. While
k-means and agglomerative clustering allow you to specify the number of clusters, DBSCAN lets you
define proximity using the eps parameter, which indirectly affects the cluster size. All three algorithms
are capable of handling large, real-world datasets, are relatively easy to understand, and support
clustering into multiple groups.
Each algorithm has its own unique strengths. K-means provides a way to characterize clusters by their
centers and can be seen as a decomposition method, representing each data point by its cluster's center.
DBSCAN has the advantage of detecting "noise points" that do not belong to any cluster and can
116
automatically determine the number of clusters. Unlike the other two methods, DBSCAN can identify
clusters with complex shapes, as demonstrated in the two_moons example. However, DBSCAN can
sometimes create clusters of vastly different sizes, which may be either a benefit or a drawback.
Agglomerative clustering, on the other hand, offers a hierarchical view of possible data partitions, which
can be easily examined using dendrograms.
The second category of machine learning algorithms we will explore is unsupervised learning. In
unsupervised learning, there is no predefined output or supervisor guiding the algorithm. Instead, the
algorithm is given input data and is tasked with independently identifying patterns or extracting insights
from it.
The first plot above shows a synthetic two-class classification dataset with two features. The first feature,
represented along the x-axis, ranges from 10 to 15, while the second feature, shown along the y-axis,
ranges from approximately 1 to 9. The subsequent four plots demonstrate four different methods for
transforming the data to create more standardized ranges. The StandardScaler in scikit-learn
standardizes each feature by adjusting its mean to 0 and its variance to 1, ensuring that all features are
on the same scale. However, this method does not control the specific minimum and maximum values
of the features. The RobustScaler functions similarly to the StandardScaler but uses the median and
quartiles instead of the mean and variance. This makes the RobustScaler less sensitive to outliers (data
points that are significantly different from the rest), which could otherwise affect scaling methods based
on the mean and variance. The MinMaxScaler shifts and scales the data so that all feature values fall
117
between 0 and 1. In the case of this two-dimensional dataset, the transformation ensures all data points
are within a rectangle bounded by 0 and 1 on both axes. Lastly, the Normalizer scales the data differently:
it adjusts each data point so that the length of its feature vector (its Euclidean norm) equals 1. This
normalization essentially projects the data points onto a unit circle (or sphere in higher dimensions), with
each point scaled by the inverse of its length. This technique is typically used when only the direction (or
angle) of the data matters, rather than its magnitude.
After reviewing the different types of transformations, let's apply them using scikit-learn. We'll use the
cancer dataset from Chapter 2 for this demonstration. Preprocessing steps like scaling are generally
performed before applying a supervised machine learning algorithm. For instance, if we want to use a
kernel SVM (SVC) on the cancer dataset, we might first apply the MinMaxScaler to preprocess the data.
To do this, we would start by loading the dataset and splitting it into training and test sets. This separation
is important because it allows us to assess the performance of the supervised model on unseen data
after completing the preprocessing steps.
118
The transformed data has the same shape as the original data—the features are simply shifted and
scaled. You can see that all of the features are now between 0 and 1, as desired. To apply the SVM to the
scaled data, we also need to transform the test set. This is again done by calling the transform method,
this time on X_test:
The transformed data has the same shape as the original data—the features are simply shifted and
scaled. You can see that all of the features are now between 0 and 1, as desired. To apply the SVM to the
scaled data, we also need to transform the test set. This is again done by calling the transform method,
this time on X_test:
Maybe somewhat surprisingly, you can see that for the test set, after scaling, the mini- mum and
maximum are not 0 and 1. Some of the features are even outside the 0–1 range! The explanation is that
the MinMaxScaler (and all the other scalers) always applies exactly the same transformation to the
training and the test set. This means the transform method always subtracts the training set minimum
119
and divides by the training set range, which might be different from the minimum and range for the test
set. Scaling Training and Test Data the Same Way It is important to apply exactly the same transformation
to the training set and the test set for the supervised model to work on the test set. The following
example illustrates what would happen if we were to use the minimum and range of the test set instead:
The first panel is an unscaled two-dimensional dataset, with the training set shown as circles and the test
set shown as triangles. The second panel is the same data but scaled using the MinMaxScaler. Here, we
120
called fit on the training set, and then called transform on the training and test sets. You can see that the
dataset in the second panel looks identical to the first; only the ticks on the axes have changed. Now all
the features are between 0 and 1. You can also see that the minimum and maximum feature values for
the test data (the triangles) are not 0 and 1. The third panel shows what would happen if we scaled the
training set and test set separately. In this case, the minimum and maximum feature values for both the
training and the test set are 0 and 1. But now the dataset looks different. The test points moved
incongruously to the training set, as they were scaled differently. We changed the arrangement of the
data in an arbitrary way. Clearly this is not what we want to do. As another way to think about this,
imagine your test set is a single point. There is no way to scale a single point correctly, to fulfill the
minimum and maximum requirements of the MinMaxScaler. But the size of your test set should not
change your processing.
The Effect of Preprocessing on Supervised Learning Now let’s go back to the cancer dataset and see the
effect of using the MinMaxScaler on learning the SVC (this is a different way of doing the same scaling
we did in Chapter 2). First, let’s fit the SVC on the original data again for comparison:
As we saw before, the effect of scaling the data is quite significant. Even though scaling the data doesn’t
involve any complicated math, it is good practice to use the scaling mechanisms provided by scikit-learn
instead of reimplementing them yourself, as it’s easy to make mistakes even in these simple
computations. You can also easily replace one preprocessing algorithm with another by changing the
121
class you use, as all of the preprocessing classes have the same interface, consisting of the fit and
transform methods:
Non-negative matrix factorization is another unsupervised learning algorithm that aims to extract useful
features. It works similarly to PCA and can also be used for dimensionality reduction. As in PCA, we are
trying to write each data point as a weighted sum of some components. But whereas in PCA we wanted
components that were orthogonal and that explained as much variance of the data as possible, in NMF,
we want the components and the coefficients to be non-negative; that is, we want both the components
and the coefficients to be greater than or equal to zero. Consequently, this method can only be applied
to data where each feature is non-negative, as a non-negative sum of non-negative components cannot
become negative.
The process of decomposing data into a non-negative weighted sum is particularly helpful for data that
is created as the addition (or overlay) of several independent sources, such as an audio track of multiple
people speaking, or music with many instruments. In these situations, NMF can identify the original
components that make up the combined data. Overall, NMF leads to more interpretable components
than PCA, as negative components and coefficients can lead to hard-to-interpret cancellation effects. The
eigenfaces, for example, contain both positive and negative parts, and as we mentioned in the
description of PCA, the sign is arbitrary. Before we apply NMF to the face dataset, let’s briefly revisit the
synthetic data.
Applying NMF to synthetic data In contrast to when using PCA, we need to ensure that our data is positive
for NMF to be able to operate on the data. This means where the data lies relative to the origin (0, 0)
matters for NMF. Therefore, you can think of the non-negative components that are extracted as
122
directions from (0, 0) toward the data. The following example shows the results of NMF on the two-
dimensional toy data:
For NMF with two components, as shown on the left, it is clear that all points in the data can be written
as a positive combination of the two components. If there are enough components to perfectly
reconstruct the data (as many components as there are features), the algorithm will choose directions
that point toward the extremes of the data.
If we only use a single component, NMF creates a component that points toward the mean, as pointing
there best explains the data. You can see that in contrast with PCA, reducing the number of components
not only removes some directions, but creates an entirely different set of components! Components in
NMF are also not ordered in any specific way, so there is no “first non-negative component”: all
components play an equal part.
NMF uses a random initialization, which might lead to different results depending on the random seed.
In relatively simple cases such as the synthetic data with two com- ponents, where all the data can be
explained perfectly, the randomness has little effect (though it might change the order or scale of the
components). In more complex sit- uations, there might be more drastic changes. Applying NMF to face
images Now, let’s apply NMF to the Labeled Faces in the Wild dataset we used earlier. The main
parameter of NMF is how many components we want to extract. Usually this is lower than the number
of input features (otherwise, the data could be explained by making each pixel a separate component).
123
First, let’s inspect how the number of components impacts how well the data can be reconstructed using
NMF:
124
The quality of the back-transformed data is similar to when using PCA, but slightly worse. This is expected,
as PCA finds the optimum directions in terms of reconstruction. NMF is usually not used for its ability to
reconstruct or encode data, but rather for finding interesting patterns within the data. As a first look into
the data, let’s try extracting only a few components (say, 15).
125
These components are all positive, and so resemble prototypes of faces much more so than the
components shown for PCA. For example, one can clearly see that component 3 shows a face rotated
somewhat to the right, while component 7 shows a face somewhat rotated to the left. Let’s look at the
images for which these components are particularly strong:
126
As expected, faces that have a high coefficient for component 3 are faces looking to the right, while faces
with a high coefficient for component 7 are looking to the left. As mentioned earlier, extracting patterns
like these works best for data with additive structure, including audio, gene expression, and text data.
Let’s walk through one example on synthetic data to see what this might look like. Let’s say we are
interested in a signal that is a combination of three different sources:
127
Manifold Learning with t-SNE While PCA is often a good first approach for transforming your data so that
you might be able to visualize it using a scatter plot, the nature of the method (applying a rotation and
then dropping directions) limits its usefulness, as we saw with the scatter plot of the Labeled Faces in the
Wild dataset. There is a class of algorithms for visuali- zation called manifold learning algorithms that
allow for much more complex map- pings, and often provide better visualizations. A particularly useful
one is the t-SNE algorithm.
Not to be confused with the much larger MNIST dataset. Manifold learning algorithms are mainly aimed
at visualization, and so are rarely used to generate more than two new features. Some of them, including
t-SNE, com- pute a new representation of the training data, but don’t allow transformations of new data.
This means these algorithms cannot be applied to a test set: rather, they can only transform the data
they were trained for. Manifold learning can be useful for explora- tory data analysis, but is rarely used if
the final goal is supervised learning. The idea behind t-SNE is to find a two-dimensional representation
of the data that preserves the distances between points as best as possible. t-SNE starts with a random
two-dimensional representation for each data point, and then tries to make points that are close in the
original feature space closer, and points that are far apart in the original feature space farther apart. t-
SNE puts more emphasis on points that are close by, rather than preserving distances between far-apart
points. In other words, it tries to preserve the information indicating which points are neighbors to each
other. We will apply the t-SNE manifold learning algorithm on a dataset of handwritten dig- its that is
included in scikit-learn. 2 Each data point in this dataset is an 8×8 gray- scale image of a handwritten digit
between 0 and 1. Figure below shows an example image for each class:
128
129
Let’s apply t-SNE to the same dataset, and compare the results. As t-SNE does not support transforming
new data, the TSNE class has no transform method. Instead, we can call the fit_transform method, which
will build the model and immediately return the transformed data:
130
The result of t-SNE is quite remarkable. All the classes are quite clearly separated. The ones and nines
are somewhat split up, but most of the classes form a single dense group. Keep in mind that this method
has no knowledge of the class labels: it is com- pletely unsupervised. Still, it can find a representation of
the data in two dimensions that clearly separates the classes, based solely on how close points are in the
original space. The t-SNE algorithm has some tuning parameters, though it often works well with the
default settings. You can try playing with perplexity and early_exaggeration, but the effects are usually
minor.
Clustering
As we described earlier, clustering is the task of partitioning the dataset into groups, called clusters. The
goal is to split up the data in such a way that points within a single cluster are very similar and points in
different clusters are different. Similarly to clas- sification algorithms, clustering algorithms assign (or
predict) a number to each data point, indicating which cluster a particular point belongs to.
131
k-Means Clustering k-means clustering is one of the simplest and most commonly used clustering algo-
rithms. It tries to find cluster centers that are representative of certain regions of the data. The algorithm
alternates between two steps: assigning each data point to the closest cluster center, and then setting
each cluster center as the mean of the data points that are assigned to it. The algorithm is finished when
the assignment of instances to clusters no longer changes. The following example illustrates the
algorithm on a synthetic dataset:
Cluster centers are shown as triangles, while data points are shown as circles. Colors indicate cluster
membership. We specified that we are looking for three clusters, so the algorithm was initialized by
declaring three data points randomly as cluster centers (see “Initialization”). Then the iterative algorithm
starts. First, each data point is assigned to the cluster center it is closest to (see “Assign Points (1)”). Next,
the cluster centers are updated to be the mean of the assigned points (see “Recompute Centers (1)”).
Then the process is repeated two more times. After the third iteration, the assignment of points to cluster
centers remained unchanged, so the algorithm stops. Given new data points, k-means will assign each to
the closest cluster center. The next example shows the boundaries of the cluster centers:
132
Applying k-means with scikit-learn is quite straightforward. Here, we apply it to the synthetic data that
we used for the preceding plots. We instantiate the KMeans class, and set the number of clusters we are
looking for. Then we call the fit method with the data:
During the algorithm, each training data point in X is assigned a cluster label. You can find these labels in
the kmeans.labels_ attribute:
You can see that clustering is somewhat similar to classification, in that each item gets a label. However,
there is no ground truth, and consequently the labels themselves have no a priori meaning. Let’s go back
to the example of clustering face images that we discussed before. It might be that the cluster 3 found
by the algorithm contains only faces of your friend Bela. You can only know that after you look at the
pictures, though, and the number 3 is arbitrary. The only information the algorithm gives you is that all
faces labeled as 3 are similar. For the clustering we just computed on the two-dimensional toy dataset,
that means that we should not assign any significance to the fact that one group was labeled 0 and
133
another one was labeled 1. Running the algorithm again might result in a differ- ent numbering of clusters
because of the random nature of the initialization. Here is a plot of this data again. The cluster centers
are stored in the cluster_centers_ attribute, and we plot them as triangles:
Even if you know the “right” number of clusters for a given dataset, k-means might not always be able to
recover them. Each cluster is defined solely by its center, which means that each cluster is a convex shape.
134
As a result of this, k-means can only cap- ture relatively simple shapes. k-means also assumes that all
clusters have the same “diameter” in some sense; it always draws the boundary between clusters to be
exactly in the middle between the cluster centers. That can sometimes lead to surprising results:
One might have expected the dense region in the lower left to be the first cluster, the dense region in the
upper right to be the second, and the less dense region in the cen- ter to be the third. Instead, both
cluster 0 and cluster 1 have some points that are far away from all the other points in these clusters that
“reach” toward the center. k-means also assumes that all directions are equally important for each
cluster. The following plot shows a two-dimensional dataset where there are three clearly separated
parts in the data. However, these groups are stretched toward the diagonal. As k-means only considers
the distance to the nearest cluster center, it can’t handle this kind of data:
135
k-means also performs poorly if the clusters have more complex shapes:
136
Vector quantization, or seeing k-means as decomposition Even though k-means is a clustering algorithm,
there are interesting parallels between k-means and the decomposition methods like PCA and NMF that
we discussed ear- lier. You might remember that PCA tries to find directions of maximum variance in the
data, while NMF tries to find additive components, which often correspond to “extremes” or “parts” of
the data. Both methods tried to express the data points as a sum over some components. k-means, on
the other hand, tries to rep- resent each data point using a cluster center. You can think of that as each
point being represented using only a single component, which is given by the cluster center. This view of
k-means as a decomposition method, where each point is represented using a single component, is called
vector quantization.
Let’s do a side-by-side comparison of PCA, NMF, and k-means, showing the components extracted, as
well as reconstructions of faces from the test set using 100 components. For k-means, the reconstruction
is the closest cluster center found on the training set:
137
An interesting aspect of vector quantization using k-means is that we can use many more clusters than
input dimensions to encode our data. Let’s go back to the two_moons data. Using PCA or NMF, there is
nothing much we can do to this data, as it lives in only two dimensions. Reducing it to one dimension
with PCA or NMF would completely destroy the structure of the data. But we can find a more expressive
representation with k-means, by using more cluster centers:
138
We used 10 cluster centers, which means each point is now assigned a number between 0 and 9. We can
see this as the data being represented using 10 components (that is, we have 10 new features), with all
features being 0, apart from the one that represents the cluster center the point is assigned to. Using
this 10-dimensional repre- sentation, it would now be possible to separate the two half-moon shapes
using a lin- ear model, which would not have been possible using the original two features. It is also
possible to get an even more expressive representation of the data by using the distances to each of the
cluster centers as features. This can be accomplished using the transform method of kmeans:
K-means is a widely used clustering algorithm, favored for its simplicity, ease of implementation, and fast
execution. It scales well to large datasets, and scikit-learn offers an even more scalable version called
MiniBatchKMeans, which is capable of handling very large datasets. However, one limitation of k-means
139
is that it depends on a random initialization, meaning the results can vary depending on the random
seed. To address this, scikit-learn runs the algorithm 10 times with different random initializations and
selects the best outcome. Other drawbacks include the algorithm's restrictive assumptions about cluster
shapes and the need to specify the number of clusters beforehand, which may not always be known in
real-world scenarios.
Agglomerative clustering encompasses a group of algorithms that follow similar principles: the algorithm
begins by treating each data point as its own cluster, then iteratively merges the two most similar clusters
until a predefined stopping condition is met. In scikit-learn, this stopping condition is set by the desired
number of clusters, and clusters are merged until that number is reached. The similarity between clusters
is measured using different linkage criteria, each defining how to evaluate the proximity between two
clusters.
• Ward: The default option, Ward merges clusters in such a way that the increase in the overall
variance within all clusters is minimized. This typically results in clusters of similar size.
• Average: This method merges clusters with the smallest average distance between all points in
the two clusters.
• Complete: Also known as maximum linkage, this criterion merges clusters based on the smallest
maximum distance between any two points in the clusters.
Ward is suitable for most datasets, and will be used in the examples here. However, if the clusters have
significantly different sizes, such as when one cluster is much larger than the others, average or complete
linkage may produce better results.
The following plot demonstrates the process of agglomerative clustering applied to a two-dimensional
dataset, where the goal is to identify three clusters.
140
Initially, each point is its own cluster. Then, in each step, the two clusters that are closest are merged. In
the first four steps, two single-point clusters are picked and these are joined into two-point clusters. In
step 5, one of the two-point clusters is extended to a third point, and so on. In step 9, there are only
three clusters remain- ing. As we specified that we are looking for three clusters, the algorithm then
stops. Let’s have a look at how agglomerative clustering performs on the simple threecluster data we
used here. Because of the way the algorithm works, agglomerative clustering cannot make predictions
for new data points. Therefore, Agglomerative Clustering has no predict method. To build the model and
get the cluster member- ships on the training set, use the fit_predict method instead.
141
As expected, the algorithm recovers the clustering perfectly. While the scikit-learn implementation of
agglomerative clustering requires you to specify the number of clusters you want the algorithm to find,
agglomerative clustering methods provide some help with choosing the right number, which we will
discuss next.
Agglomerative clustering produces what is known as a hierarchical clustering. The clustering proceeds
iteratively, and every point makes a journey from being a single point cluster to belonging to some final
cluster. Each intermediate step provides a clustering of the data (with a different number of clusters). It
is sometimes helpful to look at all possible clusterings jointly. The next example shows an overlay of all
the possible clusterings shown before, providing some insight into how each cluster breaks up into
smaller clusters:
While this visualization provides a very detailed view of the hierarchical clustering, it relies on the two-
dimensional nature of the data and therefore cannot be used on datasets that have more than two
features. There is, however, another tool to visualize hierarchical clustering, called a dendrogram, that
can handle multidimensional datasets.
Unfortunately, scikit-learn currently does not have the functionality to draw dendrograms. However, you
can generate them easily using SciPy. The SciPy clustering algorithms have a slightly different interface to
the scikit-learn clustering algorithms. SciPy provides a function that takes a data array X and computes a
linkage array, which encodes hierarchical cluster similarities. We can then feed this linkage array into the
scipy dendrogram function to plot the dendrogram:
142
The dendrogram illustrates the data points as numbered from 0 to 11 at the bottom. A tree structure is
then created with these points, each representing individual clusters, and new parent nodes are added
as pairs of clusters are merged. Starting from the bottom and moving upward, the first merger involves
points 1 and 4. Next, points 6 and 9 form a cluster, and this merging continues. At the top level, there are
two main branches: one includes points 11, 0, 5, 10, 7, 6, and 9, while the other consists of points 1, 4,
3, 2, and 8. These two branches represent the two largest clusters on the left side of the plot.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is another powerful clustering
algorithm. Its key advantages are that it does not require specifying the number of clusters beforehand,
it can identify clusters with complex shapes, and it can detect outliers or points that don't belong to any
cluster. Although DBSCAN is slower than both agglomerative clustering and k-means, it can still handle
relatively large datasets.
DBSCAN operates by identifying dense regions in the feature space, where many data points are located
close to each other. These dense regions are considered potential clusters, separated by areas that are
relatively sparse. Points within a dense region are termed core samples (or core points). Two key
143
parameters control DBSCAN: min_samples and eps. If there are at least min_samples points within a
distance of eps from a given point, it is labeled as a core sample.
The algorithm starts by selecting an arbitrary point and finds all points within a distance of eps. If fewer
than min_samples points are found within this radius, the point is marked as noise, meaning it does not
belong to any cluster. If more than min_samples points are within eps, the point is labeled as a core
sample and assigned to a new cluster. The algorithm then visits all neighboring points within eps. If those
neighbors haven't been assigned a cluster, they are given the same cluster label. If any of the neighbors
are core samples, their neighbors are recursively visited, and the cluster grows until there are no more
core samples within eps. The process repeats with an unvisited point, and the algorithm continues until
all points have been processed.
In the end, there are three kinds of points: core points, points that are within distance eps of core points
(called boundary points), and noise. When the DBSCAN algorithm is run on a particular dataset multiple
times, the clustering of the core points is always the same, and the same points will always be labeled as
noise. However, a boundary point might be neighbor to core samples of more than one cluster. Therefore,
the cluster membership of boundary points depends on the order in which points are vis- ited. Usually
there are only few boundary points, and this slight dependence on the order of points is not important.
Let’s apply DBSCAN on the synthetic dataset we used to demonstrate agglomerative clustering. Like
agglomerative clustering, DBSCAN does not allow predictions on new test data, so we will use the
fit_predict method to perform clustering and return the cluster labels in one step:
As you can see, all data points were assigned the label -1, which stands for noise. This is a consequence
of the default parameter settings for eps and min_samples, which are not tuned for small toy datasets.
The cluster assignments for different values of min_samples and eps are shown below, and visualized:
144
In this plot, points belonging to clusters are represented by solid markers, while noise points are shown
in white. Core samples are indicated by larger markers, and boundary points are represented by smaller
ones. As eps increases (moving from left to right in the figure), the clusters expand to include more
points, but this may also lead to the merging of distinct clusters into one. Conversely,
increasing min_samples (moving from top to bottom) results in fewer points being classified as core
samples, and more points being labeled as noise.
The eps parameter is typically more influential, as it sets the proximity threshold for points to be
considered as part of the same cluster. If eps is too small, no points may qualify as core samples, and all
points could be labeled as noise. On the other hand, if eps is too large, all points may end up in a single
cluster.
145
The min_samples parameter mainly impacts whether points in less dense areas are treated as outliers or
included in their own clusters. Lowering min_samples causes smaller groups to be labeled as noise. For
instance, when min_samples is set to 3, three clusters are formed: one with four points, one with five
points, and one with three points. However, when min_samples is increased to 5, the smaller clusters
(with three and four points) are considered noise, leaving only the cluster with five points.
Although DBSCAN does not require specifying the number of clusters directly, adjusting eps affects the
number of clusters identified. Finding the optimal eps value can be easier when the data is scaled using
methods like StandardScaler or MinMaxScaler, as these techniques ensure that all features are on a
similar scale. The outcome of running DBSCAN on the two_moons dataset is shown here, where the
algorithm successfully identifies the two half-circles and separates them using the given settings.
One of the challenges in applying clustering algorithms is that it is very hard to assess how well an
algorithm worked, and to compare outcomes between different algo- rithms. After talking about the
algorithms behind k-means, agglomerative clustering, and DBSCAN, we will now compare them on some
real-world datasets.
The adjusted rand index provides intuitive results, with a random cluster assignment having a score of 0
and DBSCAN (which recovers the desired clustering perfectly) having a score of 1.
A common mistake when evaluating clustering in this way is to use accuracy_score instead of
adjusted_rand_score, normalized_mutual_info_score, or some other clustering metric. The problem in
using accuracy is that it requires the assigned clus- ter labels to exactly match the ground truth. However,
the cluster labels themselves are meaningless—the only thing that matters is which points are in the
same cluster:
147
Evaluating clustering without ground truth Although we have just shown one way to evaluate clustering
algorithms, in practice, there is a big problem with using measures like ARI. When applying clustering
algo- rithms, there is usually no ground truth to which to compare the results. If we knew the right
clustering of the data, we could use this information to build a supervised model like a classifier.
Therefore, using metrics like ARI and NMI usually only helps in developing algorithms, not in assessing
success in an application. There are scoring metrics for clustering that don’t require ground truth, like
the sil- houette coefficient. However, these often don’t work well in practice. The silhouette score
computes the compactness of a cluster, where higher is better, with a perfect score of 1. While compact
clusters are good, compactness doesn’t allow for complex shapes. Here is an example comparing the
outcome of k-means, agglomerative clustering, and DBSCAN on the two-moons dataset using the
silhouette score:
148
As observed, k-means achieves the highest silhouette score, even though we may prefer the results
produced by DBSCAN. A more effective approach for evaluating clusters is to use robustness-based
clustering metrics. These metrics involve running an algorithm after adding noise to the data or using
different parameter settings and comparing the results. The idea is that if many algorithm configurations
and variations in the data produce the same outcome, the result is likely reliable. Unfortunately, this
approach is not available in scikit-learn as of this writing.
Even with a very robust clustering or a high silhouette score, we still cannot determine if the clustering
has any semantic meaning or if it reflects an aspect of the data we care about. Returning to the face
image example, we might want to identify groups of similar faces, such as distinguishing between men
and women, young and old, or people with beards versus those without. Suppose we cluster the data
into two groups, and all algorithms agree on which points should be grouped together. However, we still
cannot be certain that the clusters correspond to the concepts we are interested in. The clusters could
have been formed based on factors like side views versus front views, photos taken at night versus during
the day, or pictures captured with different types of phones (iPhones versus Androids). The only way to
determine whether the clustering aligns with our interests is through manual analysis of the clusters.
Comparing algorithms on the faces dataset Let’s apply the k-means, DBSCAN, and agglomerative
clustering algorithms to the Labeled Faces in the Wild dataset, and see if any of them find interesting
structure. We will use the eigenface representation of the data, as produced by PCA(whiten=True), with
100 components:
We saw earlier that this is a more semantic representation of the face images than the raw pixels. It will
also make computation faster. A good exercise would be for you to run the following experiments on the
original data, without PCA, and see if you find similar clusters.
Analyzing the faces dataset with DBSCAN. We will start by applying DBSCAN, which we just discussed:
149
We see that all the returned labels are –1, so all of the data was labeled as “noise” by DBSCAN. There are
two things we can change to help this: we can make eps higher, to expand the neighborhood of each
point, and set min_samples lower, to consider smaller groups of points as clusters. Let’s try changing
min_samples first:
Even when considering groups of three points, everything is labeled as noise. So, we need to increase
eps:
Using a much larger eps of 15, we get only a single cluster and noise points. We can use this result to find
out what the “noise” looks like compared to the rest of the data. To understand better what’s happening,
let’s look at how many points are noise, and how many points are inside the cluster:
Comparing these images to the random sample of face images, we can guess why they were labeled as
noise: the fifth image in the first row shows a per- son drinking from a glass, there are images of people
wearing hats, and in the last image there’s a hand in front of the person’s face. The other images contain
150
odd angles or crops that are too close or too wide. This kind of analysis—trying to find “the odd one
out”—is called outlier detection. If this was a real application, we might try to do a better job of cropping
images, to get more homogeneous data. There is little we can do about people in photos sometimes
wearing hats, drinking, or holding something in front of their faces, but it’s good to know that these are
issues in the data that any algorithm we might apply needs to handle. If we want to find more interesting
clusters than just one large one, we need to set eps smaller, somewhere between 15 and 0.5 (the default).
Let’s have a look at what different values of eps result in:
For low settings of eps, all points are labeled as noise. For eps=7, we get many noise points and many
smaller clusters. For eps=9 we still get many noise points, but we get one big cluster and some smaller
clusters. Starting from eps=11, we get only one large cluster and noise. What is interesting to note is that
there is never more than one large cluster. At most, there is one large cluster containing most of the
points, and some smaller clusters. This indicates that there are not two or three different kinds of face
images in the data that are very distinct, but rather that all images are more or less equally similar to (or
dissimilar from) the rest. The results for eps=7 look most interesting, with many small clusters. We can
investigate this clustering in more detail by visualizing all of the points in each of the 13 small clusters:
151
Some of the clusters correspond to people with very distinct faces (within this dataset), such as Sharon
or Koizumi. Within each cluster, the orientation of the face is also quite fixed, as well as the facial
expression. Some of the clusters contain faces of multiple people, but they share a similar orientation
and expression. This concludes our analysis of the DBSCAN algorithm applied to the faces dataset. As you
can see, we are doing a manual analysis here, different from the much more automatic search approach
we could use for supervised learning based on R 2 score or accuracy. Let’s move on to applying k-means
and agglomerative clustering. Analyzing the faces dataset with k-means. We saw that it was not possible
to create more than one big cluster using DBSCAN. Agglomerative clustering and k-means are much more
likely to create clusters of even size, but we do need to set a target number of clusters. We could set the
number of clusters to the known number of people in the dataset, though it is very unlikely that an
152
unsupervised clustering algorithm will recover them. Instead, we can start with a low number of clusters,
like 10, which might allow us to analyze each of the clusters:
As you can see, k-means clustering partitioned the data into relatively similarly sized clusters from 64 to
386. This is quite different from the result of DBSCAN. We can further analyze the outcome of k-means
by visualizing the cluster centers. As we clustered in the representation produced by PCA, we need to
rotate the cluster centers back into the original space to visualize them, using pca.inverse_transform:
The cluster centers found by k-means are very smooth versions of faces. This is not very surprising, given
that each center is an average of 64 to 386 face images. Working with a reduced PCA representation adds
to the smoothness of the images (compared to the faces reconstructed using 100 PCA dimensions). The
clustering seems to pick up on different orientations of the face, different expressions (the third cluster
center seems to show a smiling face), and the presence of shirt collars (see the second-to-last cluster
center). For a more detailed view, we show for each cluster center the five most typical images in the
cluster (the images assigned to the cluster that are closest to the cluster center) and the five most atypical
images in the cluster (the images assigned to the cluster that are furthest from the cluster center):
153
154
Analyzing the faces dataset with agglomerative clustering. Now, let’s look at the results of agglomerative
clustering:
Agglomerative clustering also produces relatively equally sized clusters, with cluster sizes between 26
and 623. These are more uneven than those produced by k-means, but much more even than the ones
produced by DBSCAN. We can compute the ARI to measure whether the two partitions of the data given
by agglomerative clustering and k-means are similar:
155
An ARI of only 0.13 means that the two clusterings labels_agg and labels_km have little in common. This
is not very surprising, given the fact that points further away from the cluster centers seem to have little
in common for k-means. Next, we might want to plot the dendrogram. We’ll limit the depth of the tree
in the plot, as branching down to the individual 2,063 data points would result in an unreadably dense
plot:
Creating 10 clusters, we cut across the tree at the very top, where there are 10 vertical lines. In the
dendrogram for the toy data, you could see by the length of the branches that two or three clusters might
capture the data appropriately. For the faces data, there doesn’t seem to be a very natural cutoff point.
There are some branches that represent more distinct groups, but there doesn’t appear to be a particular
number of clusters that is a good fit. This is not surprising, given the results of DBSCAN, which tried to
cluster all points together. Let’s visualize the 10 clusters, as we did for k-means earlier. Note that there is
no notion of cluster center in agglomerative clustering (though we could compute the mean), and we
simply show the first couple of points in each cluster. We show the number of points in each cluster to
the left of the first image:
156
While some of the clusters seem to have a semantic theme, many of them are too large to be actually
homogeneous. To get more homogeneous clusters, we can run the algorithm again, this time with 40
clusters, and pick out some of the clusters that are particularly interesting:
157
Summary of Clustering Methods
This section highlighted the fact that clustering is a largely qualitative process, often most useful during
the exploratory phase of data analysis. We explored three clustering algorithms: k-means, DBSCAN, and
agglomerative clustering. Each method offers a way to control the level of granularity in clustering. While
k-means and agglomerative clustering allow you to specify the number of clusters, DBSCAN lets you
define proximity using the eps parameter, which indirectly affects the cluster size. All three algorithms
are capable of handling large, real-world datasets, are relatively easy to understand, and support
clustering into multiple groups.
Each algorithm has its own unique strengths. K-means provides a way to characterize clusters by their
centers and can be seen as a decomposition method, representing each data point by its cluster's center.
DBSCAN has the advantage of detecting "noise points" that do not belong to any cluster and can
158
automatically determine the number of clusters. Unlike the other two methods, DBSCAN can identify
clusters with complex shapes, as demonstrated in the two_moons example. However, DBSCAN can
sometimes create clusters of vastly different sizes, which may be either a benefit or a drawback.
Agglomerative clustering, on the other hand, offers a hierarchical view of possible data partitions, which
can be easily examined using dendrograms.
This chapter introduced various unsupervised learning algorithms useful for exploratory data analysis
and preprocessing. Having the right data representation is often key to the success of both supervised
and unsupervised learning, with preprocessing and decomposition methods playing a critical role in data
preparation.
Decomposition, manifold learning, and clustering are essential tools for understanding data, especially
when supervision information is absent. Even in supervised settings, exploratory methods are valuable
for gaining insights into the data's properties. While it can be difficult to quantify the usefulness of
unsupervised algorithms, their application can reveal valuable insights from your data. With these tools,
you are now equipped with the fundamental algorithms that machine learning practitioners use
regularly.
We encourage you to experiment with clustering and decomposition methods on both two-dimensional
toy datasets and real-world datasets available in scikit-learn, such as the digits, iris, and cancer datasets.
Here are some exercises for Chapter 3: Unsupervised Learning & Preprocessing.
Exercises:
• Task: Apply the K-Means clustering algorithm on a dataset like the Iris or Digits dataset from
scikit-learn.
• Steps:
1. Load the dataset and visualize the data (e.g., using PCA for dimensionality reduction to 2D
or 3D).
159
2. Use K-Means clustering to group the data into clusters.
3. Evaluate the performance by comparing the true labels with the predicted clusters (if
available), or use metrics such as silhouette score or Davies-Bouldin index.
4. Vary the number of clusters (k) and observe how the clustering performance changes.
2. Hierarchical Clustering
• Steps:
4. Compare the results to K-Means clustering, and discuss the advantages and disadvantages
of both methods.
• Steps:
2. Apply DBSCAN and experiment with different values of eps (the maximum distance
between points to be considered as neighbors) and min_samples (the minimum number
of points to form a cluster).
3. Visualize the results and identify the number of clusters and noise points.
160
4. Compare DBSCAN's results with K-Means, noting how DBSCAN handles noise and
irregular-shaped clusters.
• Task: Use Principal Component Analysis (PCA) to reduce the dimensionality of a dataset
(e.g., Wine or Digits dataset).
• Steps:
3. Discuss how much variance is captured by the first few principal components
(use explained_variance_ratio_).
4. Compare the performance of K-Means clustering on the reduced dataset versus the
original high-dimensional dataset.
• Task: Preprocess a dataset by scaling and normalizing features before applying an unsupervised
learning algorithm.
• Steps:
4. Perform clustering (e.g., K-Means) on the original dataset and the preprocessed dataset.
5. Evaluate and compare the clustering results. Discuss the impact of scaling and
normalization on clustering performance.
161
• Task: Apply t-SNE (t-distributed Stochastic Neighbor Embedding) to visualize high-dimensional
data in two or three dimensions.
• Steps:
3. Visualize the result and color the points according to their true labels.
4. Discuss how t-SNE preserves the local structure of the data and helps visualize complex
high-dimensional relationships.
• Steps:
4. Compare the results with clustering on the original features and discuss how feature
engineering impacted the clustering.
• Steps:
1. Generate a synthetic dataset with outliers (use make_blobs with added noise or another
anomaly-detection dataset).
162
3. Visualize the results and evaluate the performance by checking how well the algorithm
identifies the anomalies.
4. Discuss the advantages and disadvantages of Isolation Forest compared to other anomaly
detection techniques like One-Class SVM or DBSCAN.
• Task: Use the Elbow method to determine the optimal number of clusters for K-Means.
• Steps:
3. Plot the sum of squared distances (inertia) versus k and identify the "elbow" point.
4. Discuss the results and explain how the elbow method helps in determining the number
of clusters.
• Task: Apply a Gaussian Mixture Model (GMM) to a dataset and compare it with K-Means.
• Steps:
3. Compare the results of GMM with K-Means clustering (e.g., comparing cluster
assignments, silhouette score).
4. Discuss how GMM differs from K-Means in terms of flexibility and the assumptions it
makes about the data.
Conceptual Questions:
1. Clustering Algorithms
163
• What is the difference between K-Means clustering and DBSCAN? When would you use
one over the other?
• Explain how hierarchical clustering works. What are the advantages and disadvantages of
using this method over K-Means or DBSCAN?
• How does DBSCAN handle noise and outliers differently from K-Means?
2. Dimensionality Reduction
• Why is dimensionality reduction important in unsupervised learning? How does PCA help
in improving the performance of machine learning models?
• What are the limitations of PCA, and how does it compare with t-SNE for visualizing high-
dimensional data?
3. Feature Engineering
• What is feature scaling, and why is it important when applying clustering algorithms?
4. Anomaly Detection
• What is anomaly detection, and how can algorithms like Isolation Forest or One-Class SVM
be used for this purpose?
• How does anomaly detection relate to unsupervised learning, and what types of
applications would benefit from these methods?
• How do you evaluate clustering algorithms like K-Means or DBSCAN when you don't have
ground truth labels?
• Explain the concepts of silhouette score, Davies-Bouldin index, and inertia, and how they
are used to evaluate clustering results.
164
These exercises and questions will help you practice and deepen your understanding of unsupervised
learning techniques, data preprocessing, and clustering algorithms.
165
6 Chapter Six: Deep Learning
LEARNING OUTCOMES
After reading this Section of the guide, the learner should be able to:
166
6.2 Perceptrons
• The simplest neural network is the perceptron, which approximates a single neuron with n binary
inputs.
• It computes a weighted sum of its inputs and “fires” if that weighted sum is 0 or greater:
dot(weights, x) + bias == 0
With properly chosen weights, perceptrons can solve several simple problems. For example, we can
create an AND gate (which returns 1 if both its inputs are 1 but returns 0 if one of its inputs is 0) with:
and_weights = [2., 2]
and_bias = -3.
167
If both inputs are 1, the calculation equals 2 + 2 – 3 = 1, and the output is 1. If only one of the inputs
is 1, the calculation equals 2 + 0 – 3 = –1, and the output is 0. And if both of the inputs are 0, the
calculation equals –3, and the output is 0.
or_weights = [2., 2]
or_bias = -1.
We could also build a NOT gate (which has one input and converts 1 to 0 and 0 to 1) with:
not_weights = [-2.]
not_bias = 1.
However, there are some problems that simply can’t be solved by a single perceptron. For example,
no matter how hard you try, you cannot use a perceptron to build an XOR gate that outputs 1 if
exactly one of its inputs is 1 and 0 otherwise. This is where we start needing more complicated neural
networks. Of course, you don’t need to approximate a neuron to build a logic gate:
and_gate = min
or_gate = max
xor_gate = lambda x, y: 0 if x == y else 1
168
6.3 Feed-Forward Neural Networks
The topology of the brain is enormously complicated, so it’s common to approximate it with
an idealized feed-forward neural network that consists of discrete layers of neurons, each
connected to the next. This typically entails an input layer (which receives inputs and feeds
them forward unchanged), one or more “hidden layers” (each of which consists of neurons
that take the outputs of the previous layer, performs some calculation, and passes the result
to the next layer), and an output layer (which produces the final outputs).
Just like in the perceptron, each (noninput) neuron has a weight corresponding to each of its
inputs and a bias. To make our representation simpler, we’ll add the bias to the end of our
weights vector and give each neuron a bias input that always equals 1.
As with the perceptron, for each neuron we’ll sum up the products of its inputs and its
weights. But here, rather than outputting the step_function applied to that product, we’ll
output a smooth approximation of it. Here we’ll use the sigmoid function (Figure 4.1):
import math
169
Figure 4.1. The sigmoid function
Given this function, we can represent a neuron simply as a vector of weights whose length is
one more than the number of inputs to that neuron (because of the bias weight). Then we
can represent a neural network as a list of (noninput) layers, where each layer is just a list of
the neurons in that layer.
That is, we will represent a neural network as a list (layers) of lists (neurons) of
vectors (weights).
170
Given such a representation, using the neural network is quite simple:
"""
Returns the outputs of all layers (not just the last one).
"""
outputs: List[Vector] = []
# Then the input to the next layer is the output of this one
input_vector = output
return outputs
Now it’s easy to build the XOR gate that we couldn’t build with a single perceptron.
We just need to scale the weights up so that the neuron_outputs are either really close
to 0 or really close to 1:
171
[[20., 20, -30], # 'and' neuron
[20., 20, -10]], # 'or' neuron #
output layer
[[-60., 60, -30]]] # '2nd input but not 1st input' neuron
For a given input (which is a two-dimensional vector), the hidden layer produces
a two-dimensional vector consisting of the “and” of the two input values and the
“or” of the two input values.
And the output layer takes a two-dimensional vector and computes “second
element but not first element.” The result is a network that performs “or, but not
and,” which is precisely XOR (Figure 4.2).
172
Figure 4.2. A neural network for XOR
• The hidden layer is computing features of the input data (in this case “and” and “or”)
and the output layer is combining those features in a way that generates the desired
output.
6.4 Backpropagation
Usually, we don’t build neural networks by hand. This is in part because we use them to solve much
bigger problems—an image recognition problem might involve hundreds or thousands of neurons. And
it’s in part because we usually won’t be able to “reason out” what the neurons should be.
Instead (as usual) we use data to train neural networks. The typical approach is an algorithm called
backpropagation, which uses gradient descent or one of its variants.
Imagine we have a training set that consists of input vectors and corresponding target output vectors.
For example, in our previous xor_network example, the input vector [1, 0] corresponded to the target
output [1]. Imagine that our network has some set of weights. We then adjust the weights using the
following algorithm:
1. Run feed_forward on an input vector to produce the outputs of all the neurons in the
network.
2. We know the target output, so we can compute a loss that’s the sum of the squared errors.
3. Compute the gradient of this loss as a function of the output neuron’s weights.
4. “Propagate” the gradients and errors backward to compute the gradients with respect to
Typically, we run this algorithm many times for our entire training set until the network converges.
173
input_vector: Vector,
target_vector: Vector) -> List[List[Vector]]:
"""
Given a neural network, an input vector, and a target vector,
make a prediction and compute the gradient of the squared
error loss with respect to the neuron weights.
"""
# forward pass
hidden_outputs, outputs = feed_forward(network, input_vector)
The math behind the preceding calculations is not terribly difficult, but it involves some tedious calculus
and careful attention to detail.
174
Armed with the ability to compute gradients, we can now train neural networks. Let’s try to learn the
XOR network we previously designed by hand. We’ll start by generating the training data and initializing
our neural network with random weights:
import random
[Link](0)
# training data
xs = [[0., 0], [0., 1], [1., 0], [1., 1]]
ys = [[0.], [1.], [1.], [0.]]
As usual, we can train it using gradient descent. One difference from our previous
examples is that here we have several parameter vectors, each with its own
gradient, which means we’ll have to call gradient_step for each of them.
learning_rate = 1.0
[ # hidden layer
[[7, 7, -3], # computes OR
[5, 5, -8]], # computes AND #
output layer
[[11, -12, -5]] # computes "first but not second"
]
6.5 Tensors
Deep neural networks (DNNs) incorporate multiple hidden layers for solving complex tasks. Key
elements include:
176
However, Python won’t let you define recursive types like that. And even if it did that definition is still
not right, as it allows for bad “tensors” like:
Tensor = list
[[1.0, 2.0],
[3.0]]
whose rows have different sizes, which makes it not an n- dimensional array. And we’ll write a helper
function to find a tensor’s shape:
Because tensors can have any number of dimensions, we’ll typically need to work with them
recursively. We’ll do one thing in the one-dimensional case and recurse in the higher-
dimensional case:
177
return not isinstance(tensor[0], list)
if is_1d(tensor):
return sum(tensor) # just a list of floats, use Python sum
else:
return sum(tensor_sum(tensor_i) # Call tensor_sum on each row
for tensor_i in tensor) # and sum up those results.
If you’re not used to thinking recursively, you should ponder this until it makes sense, because
we’ll use the same logic throughout this chapter. However, we’ll create a couple of helper
functions so that we don’t have to rewrite this logic everywhere. The first applies a function
elementwise to a single tensor:
178
assert tensor_apply(lambda x: x + 1, [1, 2, 3]) == [2, 3, 4]
assert tensor_apply(lambda x: 2 * x, [[1, 2], [3, 4]]) == [[2, 4], [6, 8]]
We can use this to write a function that creates a zero tensor with the same shape as a given
tensor:
We’ll also need to apply a function to corresponding elements from two tensors (which had
better be the exact same shape, although we won’t check that):
import operator
assert tensor_combine([Link], [1, 2, 3], [4, 5, 6]) == [5, 7, 9]
assert tensor_combine([Link], [1, 2, 3], [4, 5, 6]) == [4, 10, 18]
We’d like to think of neural networks as sequences of layers, so let’s come up with a way to
combine multiple layers into one. The resulting neural network is itself a layer, and it
implements the Layer methods in the obvious ways:
179
from typing import List
class Sequential(Layer):
"""
A layer consisting of a sequence of other layers.
It's up to you to make sure that the output of each
layer makes sense as the input to the next layer.
"""
def __init__(self, layers: List[Layer]) -> None:
[Link] = layers
return input
180
xor_net = Sequential([
Linear(input_dim=2, output_dim=2),
Sigmoid(),
Linear(input_dim=2, output_dim=1),
Sigmoid()
])
Here we’ll want to experiment with different loss functions, so (as usual) we’ll introduce a new
Loss abstraction that encapsulates both the loss computation and the gradient computation:
class Loss:
def loss(self, predicted: Tensor, actual: Tensor) -> float:
"""How good are our predictions? (Larger numbers are worse.)"""
raise NotImplementedError
We’ve already worked many times with the loss that’s the sum of the squared errors, so we
should have an easy time implementing that. The only trick is that we’ll need to use
tensor_combine:
class SSE(Loss):
"""Loss function that computes the sum of the squared errors."""
def loss(self, predicted: Tensor, actual: Tensor) -> float: #
Compute the tensor of squared differences
squared_errors = tensor_combine(
181
# And just add them up
return tensor_sum(squared_errors)
Here that won’t quite work for us, for a couple reasons. The first is that our neural nets will
have many parameters, and we’ll need to update all of them. The second is that we’d like to
be able to use more clever variants of gradient descent, and we don’t want to have to rewrite
them each time. Accordingly, we’ll introduce a (you guessed it) Optimizerabstraction, of which
gradient descent will be a specific instance:
class Optimizer:
"""
An optimizer updates the weights of a layer (in place) using
information known by either the layer or the optimizer (or by both).
"""
def step(self, layer: Layer) -> None:
raise NotImplementedError
After that it’s easy to implement gradient descent, again using tensor_combine:
class GradientDescent(Optimizer):
def __init__(self, learning_rate: float = 0.1) -> None: [Link]
= learning_rate
182
# Update param using a gradient step
param[:] = tensor_combine(
lambda param, grad: param - grad * [Link],
param,
grad)
The only thing that’s maybe surprising is the “slice assignment,” which is a reflection of the
fact that reassigning a list doesn’t change its original value. That is, if you just did param =
tensor_combine(. . .), you would be redefining the local variable param, but you would not
be affecting the original parameter tensor stored in the layer. If you assign to the slice [:],
however, it actually changes the values inside the list.
If you are somewhat inexperienced in Python, this behavior may be surprising, so meditate
on it and try examples yourself until it makes sense. To demonstrate the value of this
abstraction, let’s implement another optimizer that uses momentum. The idea is that we
don’t want to overreact to each new gradient, and so we maintain a running average of the
gradients we’ve seen, updating it with each new gradient and taking a step in the direction
of the average:
183
momentum: float = 0.9) -> None: [Link]
= learning_rate
[Link] = momentum
[Link]: List[Tensor] = [] # running average
Because we used an Optimizer abstraction, we can easily switch between our different optimizers.
Let’s see how easy it is to use our new framework to train a network that can compute XOR.
We start by re-creating the training data:
# training data
xs = [[0., 0], [0., 1], [1., 0], [1., 1]]
ys = [[0.], [1.], [1.], [0.]]
184
and then we define the network, although now we can leave off the last sigmoid layer:
[Link](0)
net = Sequential([
Linear(input_dim=2, output_dim=2),
Sigmoid(),
Linear(input_dim=2, output_dim=1)
])
We can now write a simple training loop, except that now we can use the abstractions of
Optimizer and Loss. This allows us to easily try different ones:
import tqdm
optimizer = GradientDescent(learning_rate=0.1)
loss = SSE()
with [Link](3000) as t:
for epoch in t:
epoch_loss = 0.0
[Link](net)
This should train quickly, and you should see the loss go down. And now we can inspect
the weights:
185
print(param)
So hidden1 activates if neither input is 1. hidden2 activates if both inputs are 1. And output
activates if neither hidden output is 1—that is, if it’s not the case that neither input is 1 and
it’s also not the case that both inputs are 1. Indeed, this is exactly the logic of XOR.
The sigmoid function has fallen out of favor for a couple of reasons. One reason is that
sigmoid(0) equals 1/2, which means that a neuron whose inputs sum to 0 has a positive
output. Another is that its gradient is very close to 0 for very large and very small inputs,
which means that its gradients can get “saturated”, and its weights can get stuck.
import math
em2x = [Link](-2 * x)
return (1 - em2x) / (1 + em2x)
186
class Tanh(Layer):
def forward(self, input: Tensor) -> Tensor:
# Save tanh output to use in backward pass.
[Link] = tensor_apply(tanh, input) return
[Link]
class Relu(Layer):
def forward(self, input: Tensor) -> Tensor:
[Link] = input
return tensor_apply(lambda x: max(x, 0), input)
Neural networks can output a vector that was entirely 0s, or it could output a vector that was
entirely 1s. Yet when we’re doing classification problems, we’d like to output a 1 for the
correct class and a 0 for all the incorrect classes. Generally, our predictions will not be so
perfect, but we’d at least like to predict an actual probability distribution over the classes. For
example, if we have two classes, and our model outputs [0, 0], it’s hard to make much sense
of that. It doesn’t think the output belongs in either class.
187
But if our model outputs [0.4, 0.6], we can interpret it as a prediction that there’s a probability
of 0.4 that our input belongs to the first class and 0.6 that our input belongs to the second
class. In order to accomplish this, we typically forgo the final Sigmoid layer and instead use
the softmax function, which converts a vector of real numbers to a vector of probabilities.
We compute exp(x) for each number in the vector, which results in a vector of positive
numbers. After that, we just divide each of those positive numbers by the sum, which gives
us a bunch of positive numbers that add up to 1—that is, a vector of probabilities. If we ever
end up trying to compute, say, exp(1000) we will get a Python error, so before taking the exp
we subtract off the largest value. This turns out to result in the same probabilities; it’s just
safer to compute in Python:
Once our network produces probabilities, we often use a different loss function called
cross-entropy (or sometimes “negative log likelihood”). If our network outputs are probabilities, the cross-
entropy loss represents the negative log likelihood of the observed data, which means that minimizing
that loss is the same as maximizing the log likelihood (and hence the likelihood) of the training data.
Typically, we won’t include the softmax function as part of the neural network itself. This is because it
188
turns out that if softmax is part of your loss function but not part of the network itself, the gradients of
the loss with respect to the network outputs are very easy to compute.
class SoftmaxCrossEntropy(Loss):
"""
This is the negative-log-likelihood of the observed values, given the
neural net model. So if we choose weights to minimize it, our model
will be maximizing the likelihood of the observed data.
"""
def loss(self, predicted: Tensor, actual: Tensor) -> float:
# Apply softmax to get probabilities
probabilities = softmax(predicted)
# This will be log p_i for the actual class i and 0 for the
other # classes. We add a tiny amount to p to avoid taking
log(0).
likelihoods = tensor_combine(lambda p, act: [Link](p + 1e-30) * act,
probabilities,
actual)
probabilities = softmax(predicted)
6.10 Dropout
Like most machine learning models, neural networks are prone to overfitting to their training
189
data. Regularization can be utilized to penalize large weights and that helped prevent
overfitting. A common way of regularizing neural networks is using dropout. At training time,
we randomly turn off each neuron (that is, replace its output with 0) with some fixed
probability. This means that the network can’t learn to depend on any individual neuron,
which seems to help with overfitting.
At evaluation time, we don’t want to dropout any neurons, so a Dropout layer will need to
know whether it’s training or not. In addition, at training time a Dropout layer only passes on
some random fraction of its input. To make its output comparable during evaluation, we’ll
scale down the outputs (uniformly) using that same fraction:
class Dropout(Layer):
def __init__(self, p: float) -> None: self.p
=p
[Link] = True
190
return tensor_combine([Link], gradient, [Link])
else:
raise RuntimeError("don't call backward when not in train mode")
We’ll use this to help prevent our deep learning models from overfitting.
Example: MNIST
MNIST is a dataset of handwritten digits that everyone uses to learn deep learning. It is
available in a somewhat tricky binary format, so we’ll install the mnist library to work with it.
(Yes, this part is technically not “from scratch.”)
import mnist
# This will download the data; change this to where you want it.
# (Yes, it's a 0-argument function, that's what the library expects.)
# (Yes, I'm assigning a lambda to a variable, like I said never to do.)
mnist.temporary_dir = lambda: '/tmp'
# Each of these functions first downloads the data and returns a numpy
array. # We call .tolist() because our "tensors" are just lists.
train_images = mnist.train_images().tolist()
train_labels = mnist.train_labels().tolist()
Let’s plot the first 100 training images to see what they look like:
191
fig, ax = [Link](10, 10)
for i in range(10):
for j in range(10):
# Plot each image in black and white and hide the axes.
ax[i][j].imshow(train_images[10 * i + j], cmap='Greys')
ax[i][j].xaxis.set_visible(False) ax[i][j].yaxis.set_visible(False)
[Link]()
MNIST images
test_images = mnist.test_images().tolist()
test_labels = mnist.test_labels().tolist()
Each image is 28 × 28 pixels, but our linear layers can only deal with one-dimensional inputs,
so we’ll just flatten them (and also divide by 256 to get them between 0 and 1). In addition,
192
our neural net will train better if our inputs are 0 on average, so we’ll subtract out the average
value:
We also want to one-hot-encode the targets, since we have 10 outputs. First let’s write a
one_hot_encode function:
One of the strengths of our abstractions is that we can use the same training/evaluation loop
193
with a variety of models. Let’s write that first. We’ll pass it our model, the data, a loss function,
and (if we’re training) an optimizer. It will make a pass through our data, track performance,
and (if we passed in an optimizer) update our parameters:
import tqdm
with [Link](len(images)) as t:
for i in t:
predicted = [Link](images[i]) # Predict.
if argmax(predicted) == argmax(labels[i]): # Check for
correct += 1 # correctness.
total_loss += [Link](predicted, labels[i]) # Compute loss.
194
As a baseline, we can use our deep learning library to train a (multiclass) logistic regression
model, which is just a single linear layer followed by a softmax. This model (in essence) just
looks for 10 linear functions such that if the input represents, say, a 5, then the 5th linear
function produces the largest output. One pass through our 60,000 training examples should
be enough to learn the model:
[Link](0)
This gets about 89% accuracy. Let’s see if we can do better with a deep neural net- work.
We’ll use two hidden layers, the first with 30 neurons, and the second with 10 neurons. And
we’ll use our Tanh activation:
[Link](0)
model = Sequential([
Linear(784, 30), # Hidden layer 1: size 30
dropout1,
Tanh(),
195
Linear(30, 10), # Hidden layer 2: size 10
dropout2,
Tanh(),
Linear(10, 10) # Output layer: size 10
])
optimizer = Momentum(learning_rate=0.01,
momentum=0.99) loss = SoftmaxCrossEntropy()
Our deep model gets better than 92% accuracy on the test set, which is a nice improvement
from the simple logistic model. The MNIST website describes a variety of models that
outperform these. Many of them could be implemented using the machinery we’ve
developed so far but would take an extremely long time to train in our lists-as-tensors
framework. Some of the best models involve convolutional layers, which are important but
unfortunately quite out of scope for an introductory book on data science.
These models take a long time to train, so it would be nice if we could save them so that we
don’t have to train them every time. Luckily, we can use the json module to easily serialize
model weights to a file.
For saving, we can use [Link] to collect the weights, stick them in a list, and use
[Link] to save that list to a file:
196
import json
Loading the weights back is only a little more work. We just use [Link] to get the list of
weights back from the file and slice assignment to set the weights of our model.
6. What is the Difference Between a Feedforward Neural Network and Recurrent Neural Network?
10. Using an example, describe epoch, batch, iteration, vanishing and exploding gradients in deep
learning.
197
7 Chapter Seven: Large Language Models
LEARNING OUTCOMES
After reading this Section of the guide, the learner should be able to:
The evolution of LLMs has led to breakthroughs in tasks such as machine translation, code generation,
conversational AI, and document summarization. The development of models like OpenAI’s GPT
(Generative Pre-trained Transformer), Google’s BERT (Bidirectional Encoder Representations from
198
Transformers), and Meta’s LLaMA (Large Language Model Meta AI) has significantly transformed how
machines interact with human language.
The field of large language models has rapidly evolved through the following milestones:
✓ Scaling Laws (2021–Present) – Models like GPT-4 and Claude-2 demonstrated that increasing data
and compute leads to emergent capabilities.
The self-attention mechanism allows the model to weigh the importance of different words in a sentence
when generating representations. This mechanism is mathematically defined as:
199
7.2.2 Multi-Head Attention
Multi-head attention enhances the model's ability to capture different contextual relationships by
applying multiple attention heads in parallel.
import torch
class MultiHeadAttention([Link]):
super(MultiHeadAttention, self).__init__()
attn_output, _ = [Link](x, x, x)
return attn_output
200
This PyTorch implementation demonstrates a basic multi-head attention mechanism.
1. Pre-training – The model learns general linguistic patterns using massive datasets through self-
supervised learning.
2. Fine-tuning – The model is refined on specific datasets for targeted applications such as legal
document analysis or medical diagnosis.
3. Instruction Tuning – LLMs like GPT-4 are fine-tuned using human-annotated instruction-
following data to improve their ability to follow complex prompts.
201
# Generate text
print(text)
Output
[{'generated_text': 'The future of artificial intelligence is in its infancy, but a growing number of
researchers are trying to bring those improvements to the Internet.\n\nMany projects are using AI
systems to solve complex algorithmic problems of the past, like computer vision and neural networks'}]
This script demonstrates how to generate text using an LLM with minimal setup.
Fine-tuning a model like BERT for sentiment classification can be achieved using the Hugging Face Trainer
API:
# Load dataset
dataset = load_dataset("imdb")
# Tokenization function
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
202
tokenized_datasets = [Link](tokenize_function, batched=True)
# Training configuration
training_args = TrainingArguments(output_dir="./results", num_train_epochs=3,
per_device_train_batch_size=8)
[Link]()
Prompt engineering is the practice of crafting input prompts to optimize model output.
Example
RAG enhances LLM performance by integrating external knowledge retrieval, reducing hallucination
risks.
203
✓ BLEU and ROUGE – Compare generated text with reference outputs for translation and
summarization tasks.
✓ GLUE and SuperGLUE – NLP benchmarks for model performance across multiple tasks.
204
eval_pipeline = pipeline("summarization")
✓ Bias and Fairness – Models may perpetuate societal biases present in training data.
Mitigation Strategies:
7.8 Exercises
3. Use the Hugging Face transformers library to fine-tune a model for question answering on the
SQuAD dataset.
205
Exercise 3: Ethical Considerations Debate
4. Write an essay analyzing bias in LLMs and propose solutions for responsible AI deployment.
7.9 Summary
This chapter explored the principles, architecture, training, and implementation of Large Language
Models. We covered transformers, fine-tuning techniques, prompt engineering, retrieval-augmented
generation, and ethical considerations. Understanding LLMs is essential for leveraging their capabilities
responsibly in real-world applications.
206
8 References
Grus, J. (2019) Data Science from Scratch: First Principles with Python. 2nd edn. O’Reilly Media. ISBN:
9781492041139.
Mueller, A.C. and Guido, S. (2017) Introduction to Machine Learning with Python. United Kingdom:
O’Reilly Media. ISBN: 9781449369415.
Minaee, S., Mikolov, T., Nikzad, N., Chenaghlu, M., Socher, R., Amatriain, X. and Gao, J. (2024) 'Large
Language Models: A Survey', arXiv preprint, arXiv:2402.06196. Available at:
[Link] (Accessed: 2 March 2025).
Guo, Z., Jin, R., Liu, C., Huang, Y., Shi, D., Supryadi, Yu, L., Liu, Y., Li, J., Xiong, B. and Xiong, D. (2023)
'Evaluating Large Language Models: A Comprehensive Survey', arXiv preprint, arXiv:2310.19736.
Available at: [Link] (Accessed: 2 March 2025).
Zhao, W.X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., Du, Y., Yang,
C., Chen, Y., Chen, Z., Jiang, J., Ren, R., Li, Y., Tang, X., Liu, Z., Liu, P., Nie, J.Y. and Wen, J.R. (2023) 'A Survey
of Large Language Models', arXiv preprint, arXiv:2303.18223. Available at:
[Link] (Accessed: 2 March 2025).
Moradi, M., Yan, K., Colwell, D., Samwald, M. and Asgari, R. (2024) 'Exploring the Landscape of Large
Language Models: Foundations, Techniques, and Challenges', arXiv preprint, arXiv:2404.11973. Available
at: [Link] (Accessed: 2 March 2025).
207
The IT qualification at Richfield College stands as a beacon of academic innovation and professional
readiness. It equips students with the skills and credentials necessary for thriving in the IT industry. By
combining foundational knowledge, practical expertise, and global recognition, the program not only
prepares students for immediate employment but also sets them on a trajectory for long-term career
success.
208