0% found this document useful (0 votes)
58 views211 pages

Machine Learning 700 Study Guide

The document is a learner guide for the Machine Learning 700 module at Richfield Graduate Institute of Technology, covering essential concepts, techniques, and applications in machine learning. It includes topics such as supervised and unsupervised learning, data preprocessing, feature engineering, and deep learning, with practical experience using Python libraries. The program aims to equip students with the skills necessary to address modern technological challenges and prepare them for various roles in the IT sector.

Uploaded by

Raees Bassier
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views211 pages

Machine Learning 700 Study Guide

The document is a learner guide for the Machine Learning 700 module at Richfield Graduate Institute of Technology, covering essential concepts, techniques, and applications in machine learning. It includes topics such as supervised and unsupervised learning, data preprocessing, feature engineering, and deep learning, with practical experience using Python libraries. The program aims to equip students with the skills necessary to address modern technological challenges and prepare them for various roles in the IT sector.

Uploaded by

Raees Bassier
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

STUDENT GUIDE

FACULTY OF INFORMATION TECHNOLOGY

MACHINE LEARNING 700

i
Registered with the Department of Higher Education as a Private Higher Education Institution under the Higher Education
Act, [Link] Certificate No. 2000/HE07/008

FACULTY OF INFORMATION TECHNOLOGY

LEARNER GUIDE

MODULE: MACHINE LEARNING 700

PREPARED ON BEHALF OF

RICHFIELD GRADUATE INSTITUTE OF TECHNOLOGY (PTY) LTD

RICHFIELD GRADUATE INSTITUTE OF TECHNOLOGY (PTY) LTD

Registration Number: 2000/000757/07

All rights reserved; no part of this publication may be reproduced in any form or by any means, including photocopying
machines, without the written permission of the Institution.

ii
Grus, J (2019). Data Science
from Scratch First Principles
with Python. 2nd Ed Allen
Institute of AI: O’REILLY
ISBN: 9781492041139

Mueller A, C. & Guido, S.


(2017). Intro to ML with
Python United Kingdom:
O’REILLY
ISBN: 9781449369415

iii
1 Chapter 1: Introduction to Machine Learning .................................................................................... 6

1.1 Overview of Machine Learning ................................................................................................... 6

1.2 Modeling ..................................................................................................................................... 6

1.3 Types of Machine Learning ......................................................................................................... 7

1.4 Overfitting and Underfitting........................................................................................................ 9

1.5 Correctness ............................................................................................................................... 11

1.6 The Bias-Variance Tradeoff ....................................................................................................... 12

1.7 Feature Engineering .................................................................................................................. 13

1.8 Real-World Applications of Machine Learning ......................................................................... 14

1.9 Introduction to Python ML Libraries ......................................................................................... 17

1.10 Revision Questions .................................................................................................................... 19

2 Chapter Two: Data Preprocessing and Feature Engineering ............................................................ 20

2.1 Understanding Datasets ............................................................................................................ 20

2.2 Data Cleaning ............................................................................................................................ 22

2.3 Feature Scaling and Normalization ........................................................................................... 26

2.4 Principal Component Analysis (PCA) ......................................................................................... 27

2.5 Applications of PCA ................................................................................................................... 33

2.6 Summary of Key Concepts ........................................................................................................ 43

2.7 Review Questions ...................................................................................................................... 43

3 Chapter 3: Supervised Learning – Regression Algorithms ................................................................ 45

3.1 What is Supervised Learning? ................................................................................................... 45

3.2 Regression ................................................................................................................................. 46

3.3 Linear Regression ...................................................................................................................... 46

3.4 Polynomial Regression .............................................................................................................. 49

1
3.5 Decision Tree Regression .......................................................................................................... 51

3.6 Hands-on Exercise: House Price Prediction .............................................................................. 60

4 Chapter Four: Supervised learning: Classification Algorithms ......................................................... 62

4.1 What is Classification? .............................................................................................................. 62

4.2 Logistic Regression .................................................................................................................... 63

4.3 k-Nearest Neighbors (k-NN) ...................................................................................................... 65

4.4 Support Vector Machines (SVM) ............................................................................................... 71

5 Chapter 5: Unsupervised Learning – Clustering Algorithms ............................................................. 82

5.1 Introduction .............................................................................................................................. 82

5.2 Clustering .................................................................................................................................. 82

5.3 k-Means Clustering ................................................................................................................... 83

5.4 Hierarchical Clustering .............................................................................................................. 85

5.5 Density-Based Clustering (DBSCAN) ......................................................................................... 87

5.6 Clustering Evaluation Metrics ................................................................................................... 88

6 Chapter Six: Deep Learning ............................................................................................................. 166

6.1 Neural Networks ..................................................................................................................... 166

6.2 Perceptrons ............................................................................................................................. 167

6.3 Feed-Forward Neural Networks .............................................................................................. 169

6.4 Backpropagation ..................................................................................................................... 173

6.5 Tensors .................................................................................................................................... 176

6.6 Neural Networks as a Sequence of Layers .............................................................................. 179

6.7 Loss and Optimization ............................................................................................................. 181

6.8 Activation Functions................................................................................................................ 186

6.9 Softmaxes and Cross-Entropy ................................................................................................. 187

2
6.10 Dropout ................................................................................................................................... 189

6.11 Review Questions .................................................................................................................... 197

7 Chapter Seven: Large Language Models ......................................................................................... 198

7.1 Introduction to Large Language Models ................................................................................. 198

7.2 The Transformer Architecture ................................................................................................. 199

7.3 Training Large Language Models ............................................................................................. 201

7.4 Implementing LLMs................................................................................................................. 201

7.5 Advanced Techniques in LLMs ................................................................................................ 203

7.6 Evaluating LLM Performance .................................................................................................. 203

7.7 Ethical Considerations and Bias in LLMs ................................................................................. 205

7.8 Exercises .................................................................................................................................. 205

7.9 Summary ................................................................................................................................. 206

3
The Information Technology (IT) qualification at Richfield College is a dynamic and future-focused
program designed to equip students with advanced technical, analytical, and problem-solving skills. At
the core of the qualification is a commitment to academic excellence, industry alignment, and innovation,
fostering graduates who are proficient in addressing modern technological challenges. This qualification
strategically integrates theoretical knowledge with practical applications, preparing students for various
roles in the IT sector. The IT program is structured to address the growing complexity of the evolving
technological landscape.

The Higher certificate in Information Technology (HCIT) program is a foundational stepping-stone for
students who wish to pursue further studies or enter the workforce. Graduates of this program are well-
prepared to articulate to the Diploma in IT (DIT) or the Bachelor of Science in IT (BSc IT) qualifications,
providing a seamless transition for those seeking to deepen their knowledge and skills in specialized IT
areas. Additionally, the program equips students with the essential competencies for entry-level IT roles
such as IT Support Technicians, Junior Web/ System Developers, IT Administrators, etc.

The Diploma in Information Technology (DIT) is a comprehensive and practical program designed to build
a strong foundation in IT principles while equipping students with the hands-on skills required to meet
industry demands. Focused on both theoretical knowledge and applied learning, this qualification
prepares students for intermediate-level roles in IT and serves as a stepping-stone for further academic
progression or specialization. Graduates of this program are well-prepared to articulate to the Bachelor
of Science in IT (BSc IT) qualification. The curriculum covers programming, networking, database
management, system analysis etc., ensuring graduates possess the competencies to solve real-world IT
challenges effectively.

The Bachelor of Science in IT (BSc IT) program is structured to address the growing complexity of the
evolving technological landscape. Through carefully curated modules, students gain a deep
understanding of software development, database management, cloud computing, cybersecurity, IT
management, artificial intelligence, machine learning, networking etc. Graduates of this program are
well-prepared to articulate to the Bachelor of Science Honours in IT qualification. The curriculum is
designed to bridge the gap between academic learning and real-world applications, thus fostering

4
innovation and an entrepreneurial mindset. Students are encouraged to participate in research and
practical learning.

The focus on emerging technologies within the IT qualification highlights commitment to academic
excellence and industry relevance. By integrating advanced concepts such as Artificial Intelligence,
Machine Learning, Data Science, Cloud Computing, Big Data, IoT, etc., the program equips students with
the skill-set needed to navigate and innovate in a rapidly evolving technological landscape. The program
aligns with industry courses from globally recognized leading tech giants such as Oracle, AWS, IBM, etc.
ensures that graduates possess the credentials to validate their expertise in emerging and disruptive
technologies.

Machine Learning 700 provides an in-depth exploration of key machine learning concepts, techniques,
and applications. The module covers supervised, unsupervised, and reinforcement learning, with hands-
on experience using Python libraries such as NumPy, Pandas, and Scikit-Learn. Students learn data
preprocessing, feature engineering, and core machine learning algorithms, including regression and
classification techniques. The course also delves into model evaluation, optimization, and
hyperparameter tuning. Additionally, it introduces deep learning concepts, covering neural networks and
frameworks like TensorFlow and PyTorch, along with large language models (LLMs) and their applications
in natural language processing. By the end of the module, students will be equipped to build, assess, and
optimize advanced AI-driven models.

5
1 Chapter 1: Introduction to Machine Learning

LEARNING OUTCOMES
After reading this Section of the guide, the learner should be able to:

• Understand the concept of machine learning (ML)


• Learn different types of ML: Supervised, Unsupervised, Reinforcement Learning
• Explore real-world applications of ML
• Get hands-on experience with Python ML libraries

1.1 Overview of Machine Learning

Machine learning (ML) is a subfield of artificial intelligence (AI) that focuses on developing models
that learn from data to make predictions without explicit programming. It is widely applied in
industries such as healthcare, finance, marketing, and autonomous systems. Machine learning
models are trained to recognize patterns and make predictions by minimizing error and optimizing
performance. While ML is often considered the heart of data science, it is just one component of a
broader workflow that includes data collection, cleaning, and transformation.

A model in machine learning is a mathematical or probabilistic representation of relationships


between variables. For instance, a business model can predict revenue based on inputs such as the
number of users and advertising revenue per user.

1.2 Modeling

Before delving into machine learning algorithms, it is crucial to understand the concept of modeling.
A model is a mathematical framework that defines a relationship between different variables. For
example, in finance, a model may predict future revenue based on advertising spend, while in
medicine, a model may estimate a patient's risk of developing a disease. Machine learning models

6
are different from these traditional models because they learn from data rather than being explicitly
programmed. These models can improve their predictions as more data becomes available.

Examples of Models in Different Contexts

• Business Models: A formula that predicts profit based on revenue and expenses.

• Recipe Models: A structured relationship between the number of servings and ingredient
quantities.

• Poker Models: Probabilistic estimations of a player’s chances of winning based on revealed cards
.

1.3 Types of Machine Learning

There exists a myriad of ML algorithms, but the selection choice will rely on the type of outputs
required for the system. ML algorithms are stratified into two categories: unsupervised and
supervised learning algorithms.

1.3.1 Supervised Learning

In supervised learning, the model is trained using labeled data, meaning that each input is associated
with an output. The goal is to learn a mapping function from input to output. Supervised learning
works with a set of labelled features (input-output) named training dataset. Every observation in the
training dataset must have an input and output object. The supervised ML algorithm will use this
training set to predict or classify unknown features during the testing phase and observations that
were not considered or included in the training dataset will be classified as unknown instances.

Figure 1.2: Supervised learning

Examples include:

✓ Classification: Predicting whether an email is spam or not.

✓ Regression: Predicting house prices based on location and size.

Classification can be either binary (distinguishing between two classes) or multiclass (distinguishing
between more than two classes). For instance, binary classification involves tasks like identifying

7
spam versus non-spam emails, where the question is "Is this email spam?" The iris classification is a
multiclass problem, as is predicting a website’s language based on its text.

Regression tasks, on the other hand, involve predicting a continuous value, such as a real number.
An example of regression is predicting a person’s annual income based on factors like education, age,
and location. The predicted value is a continuous number that can vary within a range. Another
example is forecasting the yield of a corn farm based on previous yields and weather conditions. A
simple way to distinguish between classification and regression is to consider whether the possible
outcomes are continuous. In regression, the output is continuous (e.g., annual income), while in
classification, the outputs are discrete (e.g., identifying the language of a website).

1.3.2 Unsupervised Learning

In unsupervised learning, the model identifies patterns in data without predefined labels. The
opposite of supervised learning is unsupervised learning. With unsupervised learning, the ML
algorithm detects hidden patterns in a given dataset. With unsupervised learning there is no wrong
or right answer; it's just a case of running the ML algorithm and seeing what patterns and outcomes
occur. Unsupervised learning can be thought of as a learner without a teacher because there is no
need for the training dataset. The unsupervised learning algorithm will stratify features based on
similarity and dissimilarity of hidden patterns (see Figure 1.3).

It is commonly used for:

✓ Clustering: Grouping customers based on purchasing behavior.

✓ Anomaly Detection: Identifying fraudulent transactions.

1.3.3 Reinforcement Learning

Reinforcement learning (RL) involves training agents to make decisions by interacting with an
environment and receiving feedback in the form of rewards or penalties. Examples include:

• Game Playing: AI models that play games like chess or Go.

• Robotics: Self-learning robots that adapt to new environments.

8
1.4 Overfitting and Underfitting

A major challenge in machine learning is ensuring that a model generalizes well to unseen data. Two
common issues are:

✓ Overfitting: The model learns noise instead of the underlying pattern, performing well on training
data but poorly on new data.

✓ Underfitting: The model is too simple to capture patterns in the data, leading to poor
performance even on the training set.

Example: Overfitting vs. Underfitting

Consider a dataset of student exam scores based on study hours. A model that is too simple (e.g., a
straight line) underfits the data, while a highly complex polynomial model that fits every data point
overfits.

9
Fig 1.1 Overfitting vs. Underfitting

Fig 1.1 visually illustrate the concepts of underfitting and overfitting, two common pitfalls in machine
learning model training. The code generates sample data that follows a sine wave pattern with some
added random noise, mimicking a realistic dataset. It then attempts to fit two different types of
models to this data. First, it fits a simple linear regression model, which assumes a straight-line
relationship.

This model demonstrates underfitting, as the straight line is too simple to capture the true, non-
linear pattern of the data. In the resulting graph, this is represented by a green line that poorly
follows the blue data points. The model's simplicity prevents it from learning the fundamental
relationships within the data. Next, the code fits a highly complex polynomial regression model with
a degree of 10. This model attempts to fit a curve that passes through, or extremely close to, every

10
single data point, thus representing overfitting. In the image, this is represented by the red line,
which is overly complex and traces every minor fluctuation of the training data. While it fits the
training data almost perfectly, it does so by learning the noise within the dataset, making it unlikely
to predict new, unseen data points accurately.

Finally, the image shows that if the correct degree was chosen, or the model was simpler than a
degree 10 polynomial, the model may have been more generalized and fit the data better. The blue
dots in the image represent the data that the models are being fit on. The three concepts here are,
underfitting when the model is too simple, overfitting when it is too complex, and generalization
when it is neither of those two. These examples show the critical importance of finding the right level
of model complexity to achieve a balance that enables generalization of new data. This visualization
highlights how a model should strike a balance between complexity and generalization.

1.5 Correctness

Evaluating the correctness of a machine learning model requires more than simply measuring its
accuracy. Accuracy, defined as the proportion of correct predictions over the total number of
predictions, is often an inadequate measure, especially when dealing with imbalanced datasets. For
example, a test that predicts leukemia based on a child's name might achieve over 98% accuracy,
but such a model would be meaningless because it does not use relevant medical features.

To better undeectness, machine learning models are evaluated using a confusion matrix, which
categorizes predictions into four groups:

✓ True Positive (TP): The model correctly predicts a positive case (e.g., detecting spam
correctly).

✓ False Positive (FP): The model incorrectly predicts a positive case when it is actually negative
(Type I error).

✓ False Negative (FN): The model incorrectly predicts a negative case when it is actually positive
(Type II error).

✓ True Negative (TN): The model correctly predicts a negative case.


11
A model's performance is measured using multiple metrics:

✓ Precision: The proportion of correctly predicted positive cases among all predicted positive
cases.

✓ Recall: The proportion of actual positive cases that were correctly predicted.

✓ F1 Score: The harmonic mean of precision and recall, balancing both measures.

The tradeoff between precision and recall is crucial when designing machine learning models. For
example, in medical diagnosis, a model with high recall (capturing most positive cases) may be
preferred to minimize false negatives, whereas in spam detection, a model with high precision
(minimizing false positives) might be more important

1.6 The Bias-Variance Tradeoff

The bias-variance tradeoff is a fundamental concept that helps explain the balance between model
complexity and generalization. It describes two types of errors that a machine learning model can
make.

Bias

The error introduced when a model is too simplistic to capture the underlying structure of the data.
High-bias models (e.g., linear regression with very few features) tend to underfit the data, leading to
systematically incorrect predictions.

Variance

The error due to the model being too sensitive to small fluctuations in the training data. High-
variance models (e.g., deep neural networks with too many parameters) tend to overfit, capturing
noise rather than meaningful patterns.

Impact of Bias and Variance on Model Performance

If a model has high bias, it performs poorly even on the training data, indicating underfitting. One
way to address this is by adding more features or using a more complex model. Conversely, if a model
has high variance, it performs well on training data but poorly on unseen data, indicating overfitting.

12
To reduce variance, we can:

• Simplify the model (e.g., reduce polynomial degree).

• Increase the amount of training data.

• Use regularization techniques such as L1/L2 penalties to constrain the model complexity.

A well-balanced machine learning model must optimize both bias a to achieve low total error. This
balance is crucial when designing models for real-world applications, where both prediction accuracy
and generalization are important.

1.7 Feature Engineering


Feature engineering is a fundamental step in the machine learning process, as raw data is often not
in a suitable format for direct analysis.

Feature extraction

Feature extraction involves transforming raw data into numerical representations that machine
learning models can process effectively. This transformation is particularly important when working
with unstructured data such as text or images. For instance, in text data, words and phrases can be
converted into numerical vectors using techniques such as word embeddings, which capture
semantic relationships between words. Similarly, in image data, meaningful features such as pixel
intensities, edges, or texture patterns can be extracted to facilitate classification tasks. Feature
extraction ensures that raw data is transformed into a structured format that enhances the learning
process of machine learning algorithms.

Feature selection

Feature selection focuses on identifying the most relevant features from a given dataset to improve
model performance and generalizability. The presence of too many irrelevant or redundant features
can lead to overfitting, where the model learns noise rather than meaningful patterns. To mitigate
this issue, various feature selection techniques are employed. Filter methods rely on statistical tests,
such as correlation coefficients or mutual information, to assess the relevance of each feature
independently of the learning algorithm. In contrast, wrapper methods evaluate subsets of features

13
based on their impact on model performance, often using iterative techniques such as recursive
feature elimination or forward selection. By selecting only the most informative features, feature
selection enhances model efficiency, reduces computational complexity, and improves
interpretability.

These two processes are essential components of the machine learning pipeline, as they enable
models to learn effectively from data while avoiding unnecessary complexity.

1.8 Real-World Applications of Machine Learning


Machine learning (ML) has become a transformative technology across various industries, enabling
automation, improving decision-making, and enhancing efficiency. From healthcare to finance,
transportation, and entertainment, ML algorithms analyse vast amounts of data to uncover patterns,
make predictions, and drive innovation. Below are some of the most impactful real-world applications
of machine learning.

Healthcare and Medical Diagnosis

Machine learning has significantly improved healthcare by enabling early disease detection,
personalized treatment plans, and predictive analytics. ML models analyze medical images, such as
X-rays and MRIs, to detect diseases like cancer with high accuracy. In genomics, ML is used for drug
discovery and precision medicine, tailoring treatments to individual patients based on genetic
profiles. Additionally, wearable health devices use ML to monitor vital signs and detect abnormalities
in real-time, assisting in preventive care.

Finance and Banking

The financial sector leverages machine learning for fraud detection, algorithmic trading, credit
scoring, and customer service automation. ML models identify fraudulent transactions by detecting
anomalies in user behavior. In investment banking, ML-driven algorithms analyze market trends and
execute high-frequency trades, optimizing returns. Credit scoring models assess borrowers' risk levels
by analyzing historical data and predicting default probabilities. Additionally, chatbots and virtual
assistants powered by ML enhance customer interactions in banking services.

14
E-Commerce and Retail

Machine learning enhances customer experiences in e-commerce through personalized


recommendations, dynamic pricing, and demand forecasting. Recommendation engines, used by
platforms like Amazon and Netflix, analyze user preferences and browsing history to suggest relevant
products or content. ML algorithms also optimize inventory management by predicting product
demand based on seasonal trends, customer behavior, and external factors. Moreover, AI-driven
chatbots provide real-time assistance, improving customer service efficiency.

Transportation and Autonomous Vehicles

The transportation industry benefits from ML in traffic management, predictive maintenance, and
autonomous vehicles. Ride-hailing services like Uber use ML to optimize route selection, reduce wait
times, and set dynamic fares. Autonomous vehicles, such as those developed by Tesla and Waymo,
rely on deep learning models to interpret sensor data, recognize obstacles, and make driving
decisions in real-time. ML also aids airlines in flight scheduling, fuel optimization, and predictive
aircraft maintenance.

Manufacturing and Industry 4.0

Machine learning plays a crucial role in predictive maintenance, quality control, and supply chain
optimization in manufacturing. ML models analyze sensor data to detect early signs of equipment
failure, preventing costly downtime. In quality control, computer vision-based ML systems inspect
products for defects with higher accuracy than human inspectors. Additionally, ML optimizes supply
chain logistics by predicting demand fluctuations and improving inventory management.

Cybersecurity and Threat Detection

In cybersecurity, ML enhances threat detection, malware analysis, and fraud prevention. ML


algorithms analyze network traffic patterns to identify potential cyber threats, such as phishing
attacks or unauthorized access. Intrusion detection systems use anomaly detection techniques to flag
unusual activities that may indicate security breaches. Additionally, ML models help identify spam
emails and malicious software by recognizing patterns in data.

15
Natural Language Processing (NLP) and Chatbots

Natural Language Processing, a subset of ML, powers applications such as chatbots, virtual assistants,
and sentiment analysis. Voice assistants like Siri, Alexa, and Google Assistant leverage ML to
understand and process spoken commands. Businesses use AI-powered chatbots to handle customer
queries, automate responses, and enhance user engagement. Sentiment analysis tools analyze social
media posts and customer reviews to gauge public opinion and brand perception.

Education and Personalized Learning

In the education sector, ML enables personalized learning experiences, automated grading, and
student performance prediction. Adaptive learning platforms tailor educational content based on
students’ learning styles and progress. ML models analyze student engagement and performance
data to identify areas requiring improvement. Additionally, plagiarism detection tools leverage ML to
identify academic misconduct in research papers and assignments.

Agriculture and Precision Farming

Machine learning revolutionizes agriculture by optimizing crop yields, detecting plant diseases, and
automating irrigation systems. ML-powered drones and satellite imagery help farmers monitor soil
health, assess crop conditions, and detect pest infestations. Smart irrigation systems use ML to
analyze weather data and soil moisture levels, ensuring efficient water usage. Additionally, predictive
analytics helps farmers determine the best planting and harvesting times.

Entertainment and Content Creation

The entertainment industry benefits from ML in content recommendation, video editing, and music
generation. Streaming platforms like Netflix and Spotify use ML algorithms to suggest personalized
content based on user preferences. AI-powered tools assist in video editing by automating scene
transitions, color correction, and special effects. Additionally, generative models, such as OpenAI’s
GPT and DeepMind’s WaveNet, create human-like text and music compositions.

16
Machine learning continues to revolutionize industries by enabling automation, enhancing decision-
making, and improving efficiency. As advancements in ML and artificial intelligence (AI) continue,
their applications will further expand, driving innovation across diverse fields. Whether in healthcare,
finance, cybersecurity, or entertainment, ML remains a powerful tool shaping the future of
technology and society.

1.9 Introduction to Python ML Libraries

NumPy

NumPy is one of the fundamental packages for scientific computing in Python. It contains
functionality for multidimensional arrays, high-level mathematical functions such as linear algebra
operations and the Fourier transform, and pseudorandom number generators.

Pandas

Pandas introduce additional data structures for managing datasets in Python. Its primary data
structure is the DataFrame, which is conceptually like the NotQuiteABase Table class, but with far
more functionality and enhanced performance. If you need to clean, slice, group, and manipulate
datasets in Python, pandas is an indispensable tool.

Scikit-learn

Scikit-learn is arguably the most popular Python library for machine learning. It includes all the
models we've implemented and many others that we haven't. In real-world applications, you
wouldn't build a decision tree or optimization algorithm from scratch; instead, you'd rely on scikit-
learn to handle the heavy lifting. Its documentation is filled with examples showcasing its capabilities
and providing a deeper understanding of machine learning.

Familiarize yourself with essential Python libraries for ML.

The goal of this exercise is to familiarize yourself with fundamental Python libraries used in machine
learning, such as NumPy, Pandas, Scikit-learn, and Matplotlib. First, install these libraries using pip
install NumPy pandas scikit-learn matplotlib. Next, load and explore a sample dataset using Pandas
by reading the famous Iris dataset from a URL and displaying basic statistical summaries with
[Link](). Then, train a simple machine learning model using Scikit-learn’s Logistic Regression.
17
The dataset is split into training and testing sets with train_test_split(), and the model is trained using
the fit() function. Finally, the accuracy of the model is evaluated on the test set using the score()
method. This exercise provides a hands-on introduction to essential steps in machine learning: data
loading, exploration, model training, and evaluation.

Step 1: Install Required Libraries

pip install numpy pandas scikit-learn matplotlib

Step 2: Load and Explore Data

import pandas as pd

# Load sample dataset

df = pd.read_csv("[Link]
data/master/[Link]")

# Display basic statistics

print([Link]())

Step 3: Train a Simple Model

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

# Prepare data

X = [Link](columns=['species'])

y = df['species']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

18
# Train model

model = LogisticRegression()

[Link](X_train, y_train)

# Evaluate accuracy

accuracy = [Link](X_test, y_test)

print(f"Model Accuracy: {accuracy:.2f}")

1.10 Revision Questions

1. What is machine learning, and how does it differ from traditional programming?

2. Explain the differences between supervised, unsupervised, and reinforcement learning. Provide
examples of each.

3. What is overfitting, and how can it be avoided?

4. Define the bias-variance tradeoff and explain why it is important in machine learning.

5. What are feature extraction and feature selection? Why are they important?

19
2 Chapter Two: Data Preprocessing and Feature Engineering

LEARNING OUTCOMES
After reading this Section of the guide, the learner should be able to:

• Understand the differences between structured and unstructured data.


• Learn essential data cleaning techniques, including handling missing
values, outliers, and duplicates.
• Explore feature scaling and normalization and understand why they are
important.
• Understand the concepts of feature selection and dimensionality
reduction.
• Implement data preprocessing techniques using Pandas and Scikit-Learn.

2.1 Understanding Datasets

Before applying machine learning (ML) algorithms, it is crucial to understand the nature of the data
being used. The success of an ML model depends on the quality, structure, and type of data available.
Datasets can generally be classified into two broad categories: structured and unstructured data.
Handling these different types of data requires specific techniques and tools to ensure accurate
analysis and meaningful predictions.

Structured Data

Structured data refers to information that is highly organized and stored in a well-defined format,
such as tables in relational databases, spreadsheets, or structured text files (CSV, JSON, XML). In
structured datasets, each row represents an observation (data instance), and each column
corresponds to a specific attribute (feature). The structured nature of this data makes it easy to
process using database management systems (DBMS), SQL queries, and data manipulation tools such
as Pandas in Python. Fig 2.1 below is an example of a structured database.

20
Fig 2.1 Example of a structured dataset used for customer purchase analysis.

Unstructured Data

Unstructured data refers to information that does not follow a predefined format or schema. It
includes a variety of data types such as text, images, audio, and video, which cannot be easily stored
in tabular form. Handling unstructured data requires advanced machine learning techniques such as
Natural Language Processing (NLP) for textual data and Convolutional Neural Networks (CNNs) for
image analysis. Fig 2.2 below shows the different data types(unstructured data).

Fig 2.2 Representation of unstructured data

21
Understanding the structure of data is essential before applying machine learning models. Structured
data is well-organized and easily processed using traditional algorithms, while unstructured data
requires advanced techniques such as deep learning and NLP for meaningful analysis. By selecting
the appropriate preprocessing methods, machine learning practitioners can extract valuable insights
from both structured and unstructured datasets, leading to more accurate and efficient models

2.2 Data Cleaning

In real-world applications, datasets are rarely perfect; they often contain missing values,
inconsistencies, errors, or duplicate records. Data cleaning is a critical preprocessing step in machine
learning that ensures the dataset is accurate, reliable, and suitable for model training. Poor-quality
data can lead to inaccurate predictions and unreliable models, making effective data cleaning
essential for improving model performance and generalizability.

Data cleaning involves several key tasks, including handling missing values, identifying and removing
outliers, and eliminating duplicate records. These steps help maintain data integrity and ensure that
machine learning algorithms can extract meaningful patterns from the dataset.

2.2.1 Handling Missing Values

Missing values occur when certain observations lack data for specific attributes. These gaps in the
dataset can arise due to human error, sensor failures, incomplete data collection, or data corruption.
If not addressed properly, missing values can introduce bias, reduce model accuracy, and lead to
incorrect conclusions.

There are three types of missing data in datasets:

1. Missing Completely at Random (MCAR): The missing values occur independently of any
factor, making them unpredictable.

2. Missing at Random (MAR): The missing values depend on observed data but not on the
missing data itself.

22
3. Missing Not at Random (MNAR): The missing values are dependent on the missing data itself,
making imputation more complex.

Strategies for Handling Missing Values

Handling missing values is an essential step in data preprocessing, as unaddressed missing data can
lead to misleading results and poor model performance. The choice of method depends on the extent
of missing data, the nature of the dataset, and the machine learning task. While simple approaches
like mean imputation work well for numerical data, more sophisticated techniques, such as
predictive imputation, are necessary when dealing with complex missing data patterns.

Effective data cleaning ensures that machine learning models can learn from high-quality, well-
structured data, leading to more reliable and accurate predictions.

1. Removing Missing Values

If only a small percentage of data points contain missing values, removing these rows can
prevent contamination of the dataset. This approach is suitable when missing values occur
randomly and their removal does not introduce bias. However, removing data can lead to
loss of valuable information if the dataset is already small. The code below demonstrates
how to remove the rows with missing values.

df_cleaned = [Link]() # Removes rows with missing values

2. Filling (Imputing) Missing Values


23
Instead of discarding data, missing values can be filled using statistical methods. Imputation
prevents data loss while ensuring consistency across the dataset.

✓ Mean imputation (suitable for numerical data that follows a normal distribution).

✓ Median imputation (best for skewed data).

✓ Mode imputation (used for categorical data).

The code below demonstrates how to fill in missing data using mean and mode.

# Fill missing values with the mean

df['Age'].fillna(df['Age'].mean(), inplace=True)

# Fill categorical missing values with mode

df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)

Example: Filling Missing Values Using Mean full code

import pandas as pd

from [Link] import SimpleImputer

# Sample data

data = {'Age': [25, 30, None, 45, 50], 'Salary': [50000, None, 60000, 80000, 90000]}

df = [Link](data)

# Impute missing values with mean

imputer = SimpleImputer(strategy="mean")

[Link][:, :] = imputer.fit_transform(df)

print(df)

24
3. Predicting Missing Values Using Machine Learning

In cases where missing values are extensive, machine learning models can predict them using
existing features. k-Nearest Neighbors (k-NN) or regression models can estimate missing
values by analyzing patterns in the available data. The code below uses k-NN Imputer to fill
in the missing values.

from [Link] import KNNImputer

imputer = KNNImputer(n_neighbors=5)

df_filled = imputer.fit_transform(df)

2.2.2 Handling Outliers

Outliers are data points that are significantly different from the rest of the dataset. They can occur
due to measurement errors or genuine extreme values. Outlier detection is crucial for improving
model accuracy and robustness. Common techniques for handling outliers:

✓ Remove outliers: If an outlier is a result of a data entry error, it should be removed.

✓ Transform data: Use logarithmic transformations or robust scaling.

✓ Cap extreme values: Replace extreme values with a predefined threshold (e.g., 5th and 95th
percentiles).

Example: Removing Outliers Using the Percentile Method

import numpy as np

# Generate sample data with an outlier

data = [Link]([10, 12, 13, 15, 1000]) # 1000 is an outlier

# Remove outlier using percentile method

filtered_data = data[data < [Link](data, 95)]

print(filtered_data)

25
2.2.3 Removing Duplicates

Duplicate records can cause bias in the dataset, leading to misleading model predictions. Removing
duplicates ensures that each observation is unique. Cleaning redundant data prevents models from
being skewed by repeated observations. The drop_duplicates() method removes duplicate rows
from the dataset in-place, modifying the original DataFrame.

df.drop_duplicates(inplace=True)

2.3 Feature Scaling and Normalization

Feature scaling ensures that all variables contribute equally to model training. If one feature has
values in the range of thousands and another in decimal points, models like Support Vector Machines
(SVM) and k-Nearest Neighbors (k-NN) may become biased. Feature scaling improves the
performance of distance-based algorithms.

Min-Max Scaling and Standardization

from [Link] import MinMaxScaler, StandardScaler

# Sample dataset

data = [[100, 200], [300, 400], [500, 600]]

scaler = MinMaxScaler()

scaled_data = scaler.fit_transform(data)

print("Min-Max Scaled Data:\n", scaled_data)

standardizer = StandardScaler()

standardized_data = standardizer.fit_transform(data)

print("Standardized Data:\n", standardized_data)

26
2.4 Principal Component Analysis (PCA)

2.4.1 Dimensionality Reduction

In machine learning and data analysis, dimensionality reduction is a crucial preprocessing step used to
reduce the number of input features in a dataset while retaining as much relevant information as
possible. High-dimensional data can pose several challenges, collectively referred to as the curse of
dimensionality. These challenges include:

✓ Increased computational cost – As the number of features grows, so does the complexity of the
model, requiring more processing power and memory.
✓ Overfitting risk – A model with too many features may learn noise instead of meaningful patterns,
leading to poor generalization on new data.
✓ Reduced interpretability – When a dataset has a large number of features, understanding the
relationships between variables becomes increasingly difficult.

Dimensionality reduction techniques address these problems by simplifying the dataset while preserving
its core structure and information. Among these techniques, Principal Component Analysis (PCA) is one
of the most widely used methods.

2.4.2 What is Principal Component Analysis (PCA)?

Principal Component Analysis (PCA) is a statistical technique that transforms a high-dimensional dataset
into a lower-dimensional space while maintaining as much variance (information) as possible. Rather
than simply removing features, PCA creates new features that are linear combinations of the original
ones. These new features, called principal components, capture the directions of maximum variance in
the dataset. The primary objectives of PCA are:

1. Identifying the directions of maximum variance in the dataset.


2. Creating new axes called principal components, which are linear combinations of the original
features.
3. Selecting the top components that capture most of the variance while discarding less significant
ones.

27
In the example below, Principal Component Analysis (PCA) is applied to a dataset containing two highly
correlated features to demonstrate how PCA identifies the direction of maximum variance and
transforms the dataset into a new coordinate system. The dataset consists of Feature X, which is
randomly generated, and Feature Y, which is a linear function of Feature X with added noise, creating a
strong correlation between the two features. The first scatter plot (blue) represents the original dataset,
where the elongated cluster of points suggests redundancy in the data, as both features contain nearly
identical information. To extract the most meaningful insights, the dataset is standardized to ensure all
features contribute equally, followed by the application of PCA, which finds the principal components—
new axes that maximize variance. The second scatter plot (red) illustrates the transformed dataset, where
Principal Component 1 (PC1) captures the majority of the variance, while Principal Component 2 (PC2)
contributes only minimally (0.59%). The explained variance ratio confirms that PC1 captures 99.4% of the
variance, making PC2 largely insignificant. This transformation enables dimensionality reduction while
preserving essential patterns in the data, leading to more efficient machine learning models with reduced
computational complexity and improved interpretability.

import numpy as np

import pandas as pd

28
import [Link] as plt

from [Link] import PCA

from [Link] import StandardScaler

# Generate a synthetic dataset with two highly correlated features

[Link](42)

X = [Link](100, 1) * 10 # Generate random values for X

Y = X * 2 + [Link](0, 1, (100, 1)) # Y is a linear function of X with noise

# Create a DataFrame

df = [Link]([Link]((X, Y)), columns=["Feature X", "Feature Y"])

# Standardizing the data (important for PCA)

scaler = StandardScaler()

df_scaled = scaler.fit_transform(df)

# Apply PCA

pca = PCA(n_components=2) # Keep both components for visualization

df_pca = pca.fit_transform(df_scaled)

# Scatter plot of original features

[Link](figsize=(12, 5))

29
[Link](1, 2, 1)

[Link](df["Feature X"], df["Feature Y"], color="blue", alpha=0.5)

[Link]("Before PCA: Original Features")

[Link]("Feature X")

[Link]("Feature Y")

[Link]()

# Scatter plot along principal components

[Link](1, 2, 2)

[Link](df_pca[:, 0], df_pca[:, 1], color="red", alpha=0.5)

[Link]("After PCA: Transformed Features")

[Link]("Principal Component 1")

[Link]("Principal Component 2")

[Link]()

[Link]()

# Display explained variance ratio

explained_variance = pca.explained_variance_ratio_

explained_variance

30
2.4.3 Choosing the Right Number of Principal Components

A key challenge in PCA is determining how many principal components to retain. The explained variance
ratio helps in making this decision by indicating the proportion of total variance captured by each
component.

Explained Variance Ratio

The sum of all eigenvalues represents the total variance in the dataset. Each principal component’s
eigenvalue represents the fraction of total variance it explains. The cumulative explained variance helps
determine how many components are necessary to retain meaningful information.

Selecting the Optimal Number of Components

1. Scree Plot Method

✓ A scree plot displays the explained variance ratio for each principal component.

✓ The "elbow point" in the plot indicates the optimal number of components.

2. Variance Threshold Approach

✓ Retain enough components to explain at least 90-95% of the variance.

✓ Example: If the first three components explain 92% of the variance, the rest can be
discarded.

3. Cross-Validation

✓ Test different numbers of components and evaluate model performance to determine the
best balance between accuracy and complexity.

Fig 2.4 illustrates the effect of PCA on a synthetic two-dimensional dataset. The first plot (top left) shows
the original data points, colored to distinguish between them. The algorithm proceeds by first finding the
direction of maximum variance, labeled “Component 1.” This is the direction (or vector) in the data that
contains most of the information, or in other words, the direction along which the features are most
correlated with each other. Then, the algorithm finds the direction that contains the most information
while being orthogonal (at a right angle) to the first direction. In two dimensions, there is only one

31
possible orientation that is at a right angle, but in higher-dimensional spaces there would be (infinitely)
many orthogonal directions. Although the two components are drawn as arrows, it doesn’t really matter
where the head and the tail are; we could have drawn the first component from the center up to 140 the
top left instead of down to the bottom right. The directions found using this process are called principal
components, as they are the main directions of variance in the data. In general, there are as many
principal components as original features.

Fig 2.4 The effect of PCA on a synthetic two-dimensional dataset

The second plot (top right) shows the same dataset but with a rotation applied, such that the first
principal component aligns with the x-axis and the second principal component aligns with the y-axis.
Before this rotation, the data had the mean subtracted to center it around zero. After the rotation, the
two axes become uncorrelated, meaning the correlation matrix of the data in this new configuration has
zeros off the diagonal. PCA can be used for dimensionality reduction by retaining only a subset of the
principal components. In this case, we could keep just the first principal component, as shown in the
32
third panel (bottom left), which reduces the data from two dimensions to one. It’s important to note that
instead of keeping just one of the original features, we’ve identified and preserved the most important
direction (from top left to bottom right in the first panel) — the first principal component. Finally, by
reversing the rotation and adding the mean back to the data, we get the last panel. The points are now
back in the original feature space, but only the information from the first principal component is
preserved. This transformation is often used to remove noise or to visualize the retained information
through the principal components.

2.5 Applications of PCA


Applying PCA to the Cancer Dataset for Visualization

One of the most common applications of PCA is to visualize high-dimensional datasets. As discussed in
Chapter 1, visualizing data with more than two features can be challenging. For the Iris dataset, we could
use a pair plot (Figure 1-3 in Chapter 1) to show combinations of two features for a partial view of the
data. However, for the Breast Cancer dataset, a pair plot becomes unmanageable because it has 30
features, resulting in 420 scatter plots (30 * 14). Analyzing all of these plots in detail would be impossible.
A simpler visualization method is to create histograms for each feature, comparing the two classes:
benign and malignant cancers.

Before applying PCA, we scale the data using the StandardScaler to ensure each feature has unit
variance. Applying the PCA transformation is as straightforward as any other preprocessing step. We start
by creating a PCA object, fitting it to the data to identify the principal components, and then applying the
rotation and dimensionality reduction by calling the transform method. By default, PCA rotates (and
shifts) the data without reducing its dimensionality. To reduce the dimensionality, we must specify how
many components to retain when initializing the PCA object:

33
Eigenfaces for feature extraction

Another application of PCA that we mentioned earlier is feature extraction. The idea behind feature
extraction is that it is possible to find a representation of your data that is better suited to analysis than

34
the raw representation you were given. A great example of an application where feature extraction is
helpful is with images. Images are made up of pixels, usually stored as red, green, and blue (RGB)
intensities. Objects in images are usually made up of thousands of pixels, and only together are they
meaningful. We will give a very simple application of feature extraction on images using PCA, by working
with face images from the Labeled Faces in the Wild dataset. This dataset contains face images of
celebrities downloaded from the Internet, and it includes faces of politicians, singers, actors, and athletes
from the early 2000s. We use gray- scale versions of these images, and scale them down for faster
processing.

There are 3,023 images, each 87×65 pixels large, belonging to 62 different people:

35
This is where PCA comes in. Computing distances in the original pixel space is quite a bad way to measure
similarity between faces. When using a pixel representation to compare two images, we compare the
grayscale value of each individual pixel to the value of the pixel in the corresponding position in the other
image. This representa- tion is quite different from how humans would interpret the image of a face, and

36
it is hard to capture the facial features using this raw representation. For example, using pixel distances
means that shifting a face by one pixel to the right corresponds to a drastic change, with a completely
different representation. We hope that using distan- ces along principal components can improve our
accuracy. Here, we enable the whitening option of PCA, which rescales the principal components to have
the same scale. This is the same as using StandardScaler after the transformation. Reusing the data again,
whitening corresponds to not only rotating the data, but also rescaling it so that the center panel is a
circle instead of an ellipse:

We fit the PCA object to the training data and extract the first 100 principal compo- nents. Then we
transform the training and test data:

The new data has 100 features, the first 100 principal components. Now, we can use the new
representation to classify our images using a one-nearest-neighbors classifier:

Our accuracy improved quite significantly, from 26.6% to 35.7%, confirming our intuition that the
principal components might provide a better representation of the data.

37
For image data, we can also easily visualize the principal components that are found. Remember that
components correspond to directions in the input space. The input space here is 50×37-pixel grayscale
images, so directions within this space are also 50×37-pixel grayscale images.

Let’s look at the first couple of principal components:

Although we cannot fully comprehend all the details captured by these components, we can make
educated guesses about what aspects of the face images they represent. For example, the first
component seems to primarily capture the contrast between the face and its background, while the
second component appears to reflect lighting differences between the left and right sides of the face,
and so on. While this representation is somewhat more interpretable than the raw pixel values, it still
differs significantly from how a human would perceive a face. Since PCA is based on pixel values, factors
38
such as the alignment of facial features (like the eyes, chin, and nose) and the lighting conditions strongly
influence how similar two images are in terms of their pixel representations. However, alignment and
lighting are not typically the primary factors humans use to judge facial similarity. People tend to focus
on attributes like age, gender, facial expression, and hairstyle, which are difficult to derive from pixel
intensity alone. It's important to remember that algorithms often interpret data—especially visual data
like images—in ways that differ from human perception.

Let’s come back to the specific case of PCA, though. We introduced the PCA transformation as rotating
the data and then dropping the components with low variance. Another useful interpretation is to try to
find some numbers (the new feature values after the PCA rotation) so that we can express the test points
as a weighted sum of the principal components.

In this context, x0, x1, and so on represent the coefficients of the principal components for a given data
point, essentially providing the image's representation in the transformed space. Another way to gain
insight into the workings of a PCA model is by examining the reconstructions of the original data using
only a subset of the components. As demonstrated in figure above, after removing the second
component and reaching the third panel, we reversed the rotation and reintroduced the mean to obtain
new points in the original space with the second component excluded, as shown in the final panel. A
similar transformation can be applied to face images by reducing the data to a smaller number of
principal components, then reversing the transformation back into the original space. This return to the
original feature space can be accomplished using the inverse_transform method. In this case, we visualize
the reconstruction of several faces using 10, 50, 100, 500, or 2,000 components.

39
40
41
You can see that when we use only the first 10 principal components, only the essence of the picture,
like the face orientation and lighting, is captured. By using more and more principal components, more
and more details in the image are preserved. This corresponds to extending the sum to include more and
more terms. Using as many components as there are pixels would mean that we would not discard any
information after the rotation, and we would reconstruct the image perfectly. We can also try to use PCA
to visualize all the faces in the dataset in a scatter plot using the first two principal components, with
classes given by who is shown in the image, similarly to what we did for the cancer dataset:

42
2.6 Summary of Key Concepts
Concept Key Takeaways

Structured Data Organized, tabular data (databases, spreadsheets)

Unstructured Data Raw data such as images, text, and audio

Data Cleaning Handling missing values, outliers, and duplicates

Feature Scaling Ensures features have comparable ranges

PCA Reduces dimensionality while preserving variance

2.7 Review Questions


1. What is the difference between structured and unstructured data?
2. Why is handling missing values important in machine learning?
3. What are outliers, and how do they affect machine learning models?
4. Why is feature scaling necessary? Which algorithms require it?
5. Explain the difference between feature selection and dimensionality reduction.

43
6. How does PCA work, and when should you use it?
8. How do you decide how many principal components to retain?

Practical Questions

1. Load the Iris dataset, preprocess it, and apply PCA to reduce the dimensionality.

2. Apply PCA on the Breast Cancer dataset and check if model accuracy improves after reducing
dimensions.
3. Generate synthetic high-dimensional data using NumPy and visualize it after PCA transformation.

Real-World Scenario-Based Questions

1. A hospital has thousands of patient records with various medical measurements. They want to predict
heart disease risk while reducing redundant features. Would you recommend PCA? Why or why not?

2. An AI engineer is working on an image classification model for high-resolution images. Due to


computational limitations, they want to reduce image features. How can PCA help?

3. A financial analyst is using 100+ stock market indicators to predict prices. Some indicators are
correlated. How might PCA improve the model?

44
3 Chapter 3: Supervised Learning – Regression Algorithms

LEARNING OUTCOMES
After reading this Section of the guide, the learner should be able to:

• Understand supervised learning and the role of regression in predicting continuous values.
• Deeply explore common regression algorithms: Linear Regression, Polynomial Regression, and
Decision Tree Regression.
• Break down key mathematical concepts underlying these models.
• Implement a practical house price prediction model using code, visualizations, and interpret
the results.

3.1 What is Supervised Learning?

Supervised learning is a fundamental branch of machine learning where an algorithm learns from
labeled data. Each data point consists of input features (X) and a corresponding output (Y). The goal
of the algorithm is to learn a mapping function f: X → Y that accurately predicts outputs for unseen
data. The training process involves adjusting model parameters to minimize the error between
predicted and actual values.

Key Characteristics of Supervised Learning:

✓ Labeled Data: Each training example consists of known input-output pairs.

✓ Learning Function: The model approximates the relationship between inputs and outputs
using a mathematical function.

✓ Error Minimization: Training involves optimizing model parameters to minimize the


difference between predictions and actual outcomes.

45
This framework is the foundation for many applications, such as predicting house prices based on
features like square footage, location, and age of the property. Supervised learning is the foundation
for various real-world applications such as predicting house prices, stock trends, and sales revenue
forecasting. Recent studies (Zhang & Lee, 2021) highlight that the performance of supervised
learning models significantly depends on the quality of labeled data and the appropriate selection of
hypothesis functions.

3.2 Regression

Regression is a supervised learning technique specifically used for predicting continuous numerical
outcomes. In regression, the model learns a mapping from a set of features to a continuous output.

Real-World Examples:

✓ House Prices: Predicting the selling price of a home based on features such as square footage,
number of bedrooms, and location.

✓ Stock Prices: Forecasting future stock prices using historical price trends and financial indicators.

✓ Sales Revenue: Estimating future sales based on advertising expenditure and market trends.

Common Regression Algorithms include Linear, Polynomial and Decision Tree Regression.

3.3 Linear Regression

Linear Regression assumes a linear relationship between the independent variable XXX and the
dependent variable YYY. It is represented by the equation:

Where:

• m is the slope, indicating how much Y changes for a unit change in X.

• b is the intercept, representing the predicted value of Y when X=0.

Linear regression operates under several key assumptions, including linearity, independence,
homoscedasticity (constant variance of errors), and normally distributed errors (Nguyen & Chen,

46
2019). The model parameters, typically represented as 𝑚 (slope) and b (intercept), are estimated
using the least squares method, which minimizes the sum of the squared differences between actual
and predicted values. One of the primary advantages of linear regression is its interpretability; each
coefficient directly represents the change in the output variable for a one-unit change in the
corresponding input variable, making it a valuable tool for understanding relationships within data.

Example: Predicting House Prices Using Linear Regression

import numpy as np

import [Link] as plt

from sklearn.linear_model import LinearRegression

# Training data: Square Footage vs. House Price ($1000s)

X = [Link]([1200, 1500, 1800, 2100, 2500]).reshape(-1, 1)

y = [Link]([220, 270, 320, 370, 420])

# Create and train the model

model = LinearRegression()

[Link](X, y)

# Predicting the price of a 2000 sqft house

new_house = [Link]([[2000]])

predicted_price = [Link](new_house)

print(f"Predicted price for a 2000 sqft house: ${predicted_price[0]:.2f}K")

47
# Visualizing the regression line

[Link](X, y, color="blue", label="Data points")

[Link](X, [Link](X), color="red", label="Regression line")

[Link]("Square Footage")

[Link]("Price ($1000s)")

[Link]("Linear Regression: House Price Prediction")

[Link]()

[Link]()

48
3.4 Polynomial Regression

When the relationship between XXX and YYY is non-linear, Polynomial Regression can be used to
model the data. It fits a polynomial equation to the data:

Polynomial regression offers greater flexibility by capturing complex relationships through the
inclusion of higher-degree terms. However, this flexibility comes with the risk of overfitting, as
higher-degree polynomials may fit the training data exceptionally well but fail to generalize to unseen
data. To achieve an effective model, careful selection of the polynomial degree is essential, balancing
bias and variance to avoid both overfitting and underfitting.

Example: House Price Prediction Using Polynomial Regression

49
from [Link] import PolynomialFeatures

from [Link] import make_pipeline

# Create a polynomial regression model of degree 2

poly_model = make_pipeline(PolynomialFeatures(degree=2), LinearRegression())

poly_model.fit(X, y)

# Predict the house price for 2000 sqft using polynomial regression

predicted_price_poly = poly_model.predict(new_house)

print(f"Predicted price (Polynomial Regression): ${predicted_price_poly[0]:.2f}K")

# Visualizing the polynomial regression curve

[Link](X, y, color="blue", label="Data points")

[Link](X, poly_model.predict(X), color="green", label="Polynomial curve")

[Link]("Square Footage")

[Link]("Price ($1000s)")

[Link]("Polynomial Regression: House Price Prediction")

[Link]()

[Link]()

50
3.5 Decision Tree Regression

Decision Tree Regression is a non-parametric model that splits the data into branches based on
feature values. Each terminal node (leaf) represents a decision rule, and the prediction is typically
the average of the outcomes in that node.

Decision trees structure their model by recursively partitioning the feature space, where each node
selects the feature and threshold that best minimizes prediction error. One of their key advantages
is interpretability, as the decision-making process at each split is transparent and can be easily
visualized. However, decision trees are prone to overfitting, especially if they are too deep or not
properly pruned. Recent advancements in machine learning have led to the development of
ensemble methods such as Random Forests and Gradient Boosting, which build upon decision trees
to enhance performance and reduce variance.

Decision tree regressors split data based on decision rules that are easy to follow. The splitting
criterion (often the mean squared error) is computed at each node to decide the best split. While
decision trees are intuitive and interpretable, they are also sensitive to small changes in the data.
Ensemble methods are commonly used to mitigate this instability.

Example: Predicting House Prices Using Decision Trees

from [Link] import DecisionTreeRegressor

# Train a Decision Tree Regressor

tree_model = DecisionTreeRegressor()

tree_model.fit(X, y)

# Predicting using the decision tree model

predicted_price_tree = tree_model.predict(new_house)

print(f"Predicted price (Decision Tree Regression): ${predicted_price_tree[0]:.2f}K")

51
3.5.1 Decision Trees

Decision trees are popular for both classification and regression tasks. They work by learning a
sequence of if/else questions that guide the decision-making process. These questions are like those
asked in a game of 20 Questions. For example, if you need to distinguish between bears, hawks,
penguins, and dolphins, you might begin by asking if the animal has feathers, which would narrow
down the options to two possibilities. If the answer is "yes," you could further ask if the animal can
fly, helping you differentiate between hawks and penguins. If the answer is "no" (indicating the
animal does not have feathers), you’d still need another question to distinguish between dolphins
and bears, such as whether the animal has fins.

This decision-making process can be visually represented.

In this example, each node in the decision tree represents either a question or a terminal (leaf) node,
which holds the final answer. The edges connect the answers to the next question that needs to be
asked based on the current answer.

In machine learning terms, we have developed a model to classify four different animal types (hawks,
penguins, dolphins, and bears) using three features: "has feathers," "can fly," and "has fins." Rather
than manually creating these models, we can employ supervised learning to learn them directly from
data.

52
Building Decision Trees

Now, let's go through the process of constructing a decision tree for a 2D classification dataset. The
dataset, named two_moons, consists of two half-moon shapes, with each class containing 75 data
points.

Building a decision tree involves figuring out the sequence of if/else questions that will lead to the
correct classification in the most efficient manner. In machine learning, these questions are referred
to as tests (not to be confused with the test set, which is used to assess the model's ability to
generalize). While data typically is not structured with simple binary yes/no features like in the
animal example, it's usually represented with continuous features. The tests for continuous data are
framed as "Is feature i greater than value a?"

Controlling the Complexity of Decision Trees

When constructing a decision tree, if it grows until all leaves are pure (i.e., until every leaf node
perfectly classifies the training data), the resulting model tends to be too complex and is at risk of
overfitting. A pure leaf indicates that the tree has achieved perfect accuracy on the training set,
where each data point in a leaf correctly predicts the majority class. This overfitting is visible on the
left side of the figure, where regions of class 1 are located within areas that should belong to class 0.
On the far right, a narrow band predicted as class 0 around a point from class 0 appears unnatural,
influenced by outlier points far from the main cluster in that class.

To avoid overfitting, two main strategies are commonly employed: halting the tree-building process
early (pre-pruning) or building the tree fully and then pruning away nodes that contribute little value
(post-pruning). Pre-pruning methods involve limiting the tree’s maximum depth, restricting the
number of leaves, or enforcing a minimum number of data points required in a node before it can
be split further.

53
In scikit-learn decision trees are implemented using the DecisionTreeRegressor and
DecisionTreeClassifier classes. It’s worth noting that scikit-learn supports only pre-pruning and not
post-pruning.

Let us explore the impact of pre-pruning in more detail using the Breast Cancer dataset. First, we
import the dataset and split it into training and test sets. Then, we create a model using the default
configuration, where the tree grows until all leaves are pure. We also fix
the random_state parameter to ensure consistent results for tie-breaking during tree construction.

As expected, the accuracy on the training set is 100% because the leaves are pure, and the tree has
grown deep enough to memorize all the labels in the training data. However, the accuracy on the
test set is slightly lower than the approximately 95% accuracy observed with the linear models we
previously discussed.

If the depth of a decision tree is not restricted, it can become excessively deep and complex. As a
result, unpruned trees are more prone to overfitting and may struggle to generalize effectively to
new data. To mitigate this, we can apply pre-pruning to prevent the tree from perfectly fitting the
training data. One method is to halt the tree's growth once it reaches a certain depth. For example,
by setting max_depth=4, the tree can only ask up to four questions. Limiting the tree’s depth helps
reduce overfitting, which results in a decrease in training set accuracy but an improvement in
accuracy on the test set.

Analyzing decision trees

We can visualize the tree using the export_graphviz function from the tree module. This writes a
file in the .dot file format, which is a text file format for storing graphs. We set an option to color

54
the nodes to reflect the majority class in each node and pass the class and features names so the
tree can be properly labeled:

The visualization of the decision tree offers a clear view of how the algorithm makes predictions and
serves as an excellent example of a machine learning model that can be easily explained to non-
experts. However, even with a tree of depth four, as shown here, the visualization can still become
somewhat intricate. Deeper trees, especially those with depths of 10 or more, are even more
challenging to interpret. One effective method for understanding the tree is to focus on the path that
most of the data follows. The n_samples displayed in each node represents the number of samples
in that node, while the value indicates the number of samples in each class. By tracing the branches
to the right, we see that a split based on "worst radius <= 16.795" leads to a node with 8 benign and
134 malignant samples. Further splits break down the 8 benign samples, and out of the 142 samples
that initially followed the right path, 132 end up in the far-right leaf. Following the left path at the
root, where "worst radius > 16.795," we find 25 malignant and 259 benign samples. Most of the

55
benign samples end up in the second leaf from the right, with only a few remaining in the other
leaves.

Feature Importance in Trees

Rather than attempting to interpret the entire tree, which can be overwhelming, we can use valuable
metrics to summarize the tree’s behavior. One of the most widely used metrics is feature
importance, which indicates how significant each feature is in the tree’s decision-making process.
This value ranges from 0 to 1 for each feature, with 0 meaning the feature is not used at all and 1
indicating the feature perfectly predicts the target. The sum of all feature importances will always
equal 1.

In this case, we can see that the feature used in the first split, "worst radius," is by far the most
important feature. This aligns with our previous observation during the tree analysis, where the first
level effectively separates the two classes.

56
However, if a feature has a low feature_importance, it doesn’t necessarily mean the feature is
insignificant. It simply suggests that the tree didn't prioritize that feature, possibly because another
feature provides similar information.

Unlike coefficients in linear models, feature importances are always positive and do not indicate
which class a feature predicts. The feature importances reveal that "worst radius" is important, but
they do not specify whether a higher radius indicates a benign or malignant sample. In fact, the
relationship between features and classes may not be so straightforward, as shown in the following
example.

Although our focus here has been on decision trees for classification, the same principles apply to
decision trees used for regression, as seen in the DecisionTreeRegressor. The process of using and
analyzing regression trees is very similar to that of classification trees. However, one key difference
with tree-based models for regression is important to note: the DecisionTreeRegressor, like other
tree-based regression models, cannot extrapolate. This means it is unable to make predictions
beyond the range of the training data.

57
To illustrate this, let’s examine a dataset containing historical computer memory (RAM) prices. The
figure below shows this dataset, with the date on the x-axis and the price per megabyte of RAM for
each year on the y-axis.

Note the logarithmic scale on the y-axis. When plotted on a logarithmic scale, the relationship
appears fairly linear, making it relatively easy to predict, though with some fluctuations.

We will use historical data up to the year 2000 to forecast future RAM prices, treating the date as
the sole feature. Two simple models will be compared: a DecisionTreeRegressor and a
LinearRegression. The prices are rescaled using a logarithmic transformation to make the
relationship more linear. While this transformation doesn't affect the DecisionTreeRegressor, it has
a significant impact on the LinearRegression model. After training both models and making
predictions, we apply the exponential function to reverse the logarithmic transformation. We will
visualize predictions on the entire dataset, but for a proper evaluation, only the test dataset should
be considered.

58
The contrast between the two models is quite clear. The linear model fits the data with a line, as
expected, and provides a solid forecast for the test data (years after 2000), though it overlooks
some of the finer details in both the training and test data. In contrast, the tree model perfectly
predicts the training data since it was allowed to grow without complexity restrictions and
essentially memorized the dataset. However, when the model encounters data beyond the training
range, it simply predicts the last known value. This limitation arises because the tree model cannot
extrapolate or make predictions outside the scope of the training data. This issue is common to all
tree-based models.

Strengths, weaknesses, and parameters As discussed earlier, the parameters that control model
complexity in decision trees are the pre-pruning parameters that stop the building of the tree
before it is fully developed. Usually, picking one of the pre-pruning strategies—setting either

59
Parameters like max_depth, max_leaf_nodes, or min_samples_leaf are sufficient to prevent
overfitting in decision trees.

Decision trees offer two significant advantages over many of the algorithms we've covered so far:
the resulting model is easy to visualize and interpret, especially for smaller trees, and they are
completely unaffected by the scaling of the data. Since each feature is processed independently,
and the splits do not depend on the scaling of the data, decision trees do not require preprocessing
steps such as normalization or standardization. This makes them particularly effective when dealing
with features on different scales or a combination of binary and continuous features.

However, the main drawback of decision trees is that they are prone to overfitting, even with pre-
pruning, which leads to poor generalization performance. As a result, ensemble methods are often
preferred over single decision trees in most applications.

Ensemble methods combine multiple machine learning models to create more robust models.
While there are several ensemble techniques, two have proven particularly effective for both
classification and regression tasks across various datasets, using decision trees as their core
components: random forests and gradient boosted decision trees.

3.6 Hands-on Exercise: House Price Prediction

Task

Build a regression model to predict house prices using a real dataset.

Steps:

1. Load a real-world housing dataset (e.g., California Housing Prices dataset).

2. Preprocess the data (handle missing values, scale features).

3. Train different regression models (Linear, Polynomial, Decision Tree).

4. Evaluate models using R² Score, Mean Absolute Error (MAE).

60
5. Visualize predictions vs. actual values.

Conceptual Questions

1. What is supervised learning, and how does it differ from unsupervised learning?

2. Explain the purpose of regression analysis in supervised learning.

3. What assumptions do linear regression make about the data?

4. How does polynomial regression address non-linear relationships, and what are its limitations?

5. What is the bias-variance tradeoff, and how can it impact model performance?

6. How does decision tree regressors differ from linear models in terms of interpretability and
overfitting?

Coding-Based Questions

1. Implement a linear regression model using a dataset of your choice and interpret the model
coefficients.

2. Compare the performance of linear, polynomial, and decision tree regression on the same
dataset.

3. Visualize the decision boundaries or regression curves for each model and explain any differences
you observe.

Real-World Scenario-Based Questions

1. A real estate company wants to predict house prices in a new market. What factors would you
consider when selecting a regression model?

2. How might you mitigate overfitting in a decision tree regression model?

3. Given a dataset with several hundred features, discuss how you apply dimensionality reduction
techniques before performing regression analysis.

61
4 Chapter Four: Supervised learning - Classification Algorithms

LEARNING OUTCOMES
After reading this Section of the guide, the learner should be able to:

• Understand the fundamental concepts of classification in machine learning.


• Explore common classification algorithms: Logistic Regression, k-Nearest Neighbors
(k-NN), Support Vector Machines (SVM), Decision Trees, Random Forest, and Naïve
Bayes.
• Break down the mathematics behind classification models with real-world
applications.
• Implement classification models in Python and analyze their performance.
• Evaluate classification models using accuracy, precision, recall, F1-score, and ROC
curves.

4.1 Classification
Classification is a type of supervised learning where the goal is to predict a categorical outcome based
on input features. Unlike regression, which predicts continuous values, classification models categorize
inputs into discrete labels. According to Zhang & Wang (2021), classification algorithms are widely used
in automated decision-making, fraud detection, and medical diagnostics, proving their significance in
real-world applications.

Mathematically, classification aims to learn a function

𝑓:𝑋→𝑌

f:X→Y, where 𝑋 is the feature set and 𝑌 is a categorical label. The decision boundary of a classification
model separates different categories in the feature space.

62
Common Classification Algorithms

4.2 Logistic Regression


Logistic Regression is a linear classification algorithm that predicts the probability of a data point
belonging to a particular class. It is commonly used for binary classification problems (e.g., fraud
detection, disease diagnosis). According to Brown et al. (2020), Logistic Regression remains a strong
baseline method for classification tasks, especially in medical and financial applications where
interpretability is crucial.

The logistic (sigmoid) function is given by:

Spam Detection Example

import pandas as pd

import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from [Link] import accuracy_score

# Generate sample data

[Link](42)

63
X = [Link](100, 2) # Two features

y = (X[:, 0] + X[:, 1] > 1).astype(int) # Labels: 1 if sum > 1, else 0

# Split data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train logistic regression model

model = LogisticRegression()

[Link](X_train, y_train)

# Predictions and accuracy

y_pred = [Link](X_test)

accuracy = accuracy_score(y_test, y_pred)

print(f"Model Accuracy: {accuracy:.2f}")

Output:

Model Accuracy: 0.85

The model achieves 85% accuracy in classifying the data into two classes.

64
4.3 k-Nearest Neighbors (k-NN)
The k-Nearest Neighbors (k-NN) algorithm classifies a data point based on the majority class of its k
closest neighbors in the feature space. k-NN performs well when data is well-distributed but struggles
with high-dimensional datasets due to the curse of dimensionality.

Key Properties:

• Lazy learning: No training phase; all computation happens during prediction.

• Distance metric: Typically uses Euclidean distance to measure closeness between points.

Mathematical Formula (Euclidean Distance):

The k-NN algorithm is one of the simplest machine learning methods. Building the model simply involves
storing the training dataset. When making a prediction for a new data point, the algorithm finds the
closest data points in the training set, known as its “nearest neighbors.” In its most basic form, the k-NN
algorithm considers only one neighbor the closest training point to the data point we are predicting for.
The prediction output is then the same as the output of this nearest [Link].

65
Here, we introduced three new data points, represented by stars. For each of these points, we identified
the closest point in the training set. The prediction made by the one-nearest-neighbor algorithm is the
label of that closest point, indicated by the color of the cross. Instead of considering just the nearest
neighbor, we can also factor in a chosen number, k, of neighbors. This is where the term "k-nearest
neighbors" originates. When looking at multiple neighbors, we use a voting process to determine the
label. For each test point, we count how many neighbors belong to class 0 and how many belong to class
1, then assign the label of the majority class—the most frequent class among the k-nearest neighbors.
The example below illustrates this using the three closest neighbors:

Once again, the prediction is represented by the color of the cross. You can observe that the prediction
for the new data point at the top left differs from the one made using only a single neighbor. Although
this example is for a binary classification problem, the same approach can be used for datasets with
multiple classes. In such cases, we count how many neighbors belong to each class and predict the most
frequent class. Now, let’s explore how we can implement the k-nearest neighbors algorithm using scikit-
learn. First, we divide our data into training and test sets so we can assess the model's ability to

66
generalize.

Classifying Iris Flowers

from [Link] import load_iris

from [Link] import KNeighborsClassifier

from sklearn.model_selection import train_test_split

from [Link] import accuracy_score

# Load dataset

iris = load_iris()

X, y = [Link], [Link]

# Split data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train k-NN model (k=5)

67
knn = KNeighborsClassifier(n_neighbors=5)

[Link](X_train, y_train)

# Predict and evaluate

y_pred = [Link](X_test)

accuracy = accuracy_score(y_test, y_pred)

print(f"k-NN Model Accuracy: {accuracy:.2f}")

Output:

k-NN Model Accuracy: 0.97

k-NN achieves 97% accuracy in classifying iris flower species.

Imagine a hobbyist botanist who wants to identify the species of iris flowers she has found. She has
recorded measurements for each iris, including the length and width of both the petals and the sepals,
all in centimeters. Additionally, she has measurements for some irises that have already been identified
by an expert botanist as belonging to the species setosa, versicolor, or virginica. For these, she is certain
of the species. Let's assume these are the only species the botanist will encounter in the wild. Our
objective is to develop a machine learning model that can learn from the known iris measurements so
that it can predict the species of new irises. We aim to build a machine learning model that can predict
the species of an iris based on new measurements. We can now begin building the machine learning
model. We will use a k-nearest neighbors (KNN) classifier, which is simple to understand. The process of
building this model mainly involves storing the training set. To predict the label for a new data point, the
algorithm identifies the training point that is closest to the new one and assigns its label to the new point.

The "k" in k-nearest neighbors means that, rather than just using the closest neighbor, we can take into
account a set number of closest neighbors (for example, three or five). The prediction is then based on

68
the majority class of these neighbors. We will discuss this in more detail, but for now, we will consider
just one neighbor. In scikit-learn, machine learning models are implemented in classes known as
Estimator classes. The K-nearest neighbors classification algorithm is provided by the
KNeighborsClassifier class in the neighbors module. To use the model, we need to create an instance of
this class and configure its parameters. The most crucial parameter for KNeighborsClassifier is the
number of neighbors, which we will set to 1.

The knn object encapsulates the algorithm that will be used to build the model from the training data,
as well the algorithm to make predictions on new data points. It will also hold the information that the
algorithm has extracted from the training data. In the case of KNeighborsClassifier, it will just store the
training set.

The fit method returns the KNN object itself and modifies it in place, providing a string representation of
the classifier. This representation displays the parameters used to create the model. Most of these
parameters are set to their default values, but it also includes n_neighbors=1, which is the value we
specified. While scikit-learn models have many parameters, the majority are related to optimization or
specialized use cases, so you don’t need to worry about the other parameters shown in the output.
Printing a scikit-learn model can produce lengthy strings, but don’t be discouraged by this. We will go
over all the key parameters. From here on, we won’t display the output of fit since it doesn’t provide any
new information.

69
Note that we made the measurements of this single flower into a row in a two-dimensional NumPy array,
as scikit-learn always expects two-dimensional arrays for the data.

Making Predictions

We can now make predictions using this model on new data for which we might not know the correct
labels. Imagine we found an iris in the wild with a sepal length of 5 cm, a sepal width of 2.9 cm, a petal
length of 1 cm, and a petal width of 0.2 cm. What species of iris would this be? We can put this data into
a NumPy array, again by calculating the shape that is, the number of samples (1) multiplied by the
number of features (4):

70
For this model, the accuracy of the test set is around 0.97, meaning the model correctly predicted 97%
of the irises in the test set. Under certain mathematical assumptions, this suggests the model will be
accurate 97% of the time when applied to new irises. In our hobby botanist scenario, this high accuracy
indicates that the model is likely reliable enough for practical use. In upcoming chapters, we will explore
ways to improve the model's performance and discuss the challenges of fine-tuning it.

The task involved three species: setosa, versicolor, and virginica, making it a multiclass classification
problem. In classification, the categories or species are called classes, and each iris has a species label.
The Iris dataset consists of two NumPy arrays: one for the features (denoted as X in scikit-learn) and one
for the correct labels (denoted as y). X is a two-dimensional array where each row corresponds to a data
point, and each column represents a feature, while y is a one-dimensional array containing the class
labels for each sample. We split the dataset into a training set to build the model and a test set to evaluate
how well the model generalizes to new data. We selected the k-nearest neighbors algorithm for
classification, which makes predictions based on the closest neighbors in the training set. We
implemented this using the KNeighborsClassifier class, setting the parameters accordingly. After fitting
the model with the fit method and passing in the training data and labels, we evaluated its performance
with the score method, which returned an accuracy of 97%. This indicates the model made correct
predictions 97% of the time on the test set, giving us confidence that it will predict new iris
measurements with similar accuracy.

4.4 Support Vector Machines (SVM)


SVM finds the optimal hyperplane that maximizes the margin between different classes. It uses support
vectors, the data points closest to the decision boundary, to define the margin. SVMs are particularly
powerful in high-dimensional spaces and are widely used in text classification and image recognition.

For linearly separable data, the hyperplane is given by:

wX+b=0

71
For non-linearly separable data, SVM uses the kernel trick (e.g., RBF kernel) to transform data into a
higher-dimensional space.

Python Implementation: Classifying Digits

from sklearn import datasets

from [Link] import SVC

from sklearn.model_selection import train_test_split

from [Link] import accuracy_score

# Load digits dataset

digits = datasets.load_digits()

X, y = [Link], [Link]

# Split data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train SVM model

svm_model = SVC(kernel='rbf')

svm_model.fit(X_train, y_train)

# Predict and evaluate

y_pred = svm_model.predict(X_test)

72
accuracy = accuracy_score(y_test, y_pred)

print(f"SVM Model Accuracy: {accuracy:.2f}")

Output:

SVM Model Accuracy: 0.98

SVM achieves 98% accuracy on digit classification.

We can apply both Logistic Regression and LinearSVC models to the forge dataset and visualize the
decision boundary.

In this figure, the first feature of the forge dataset is plotted on the x-axis, and the second feature is
plotted on the y-axis, as in the previous example. The decision boundaries for both LinearSVC and
LogisticRegression are shown as straight lines, dividing the area classified as class 1 (above the line) from
the area classified as class 0 (below the line). This means any new data point above the black line will be
classified as class 1 by the respective classifier, while any point below the black line will be classified as
class 0.

Both models produce similar decision boundaries, though they misclassify two points. By default, both
models use L2 regularization, similar to how Ridge regression applies regularization.

For both LogisticRegression and LinearSVC, the trade-off between model complexity and regularization
strength is controlled by the parameter C. Higher values of C result in less regularization, meaning the
73
models try to fit the training data as accurately as possible. In contrast, lower values of C encourage the
models to find coefficient vectors (w) that are closer to zero, thus imposing more regularization.

An interesting characteristic of the C parameter is how it influences the model's focus. Lower values of C
lead the algorithms to focus on adjusting to the majority of data points, while higher values of C make
the models prioritize correctly classifying each individual data point.

On the left side, a very small value of C results in a model with strong regularization. Most of the class 0
points are at the top, and most of the class 1 points are at the bottom. The highly regularized model
chooses a nearly horizontal decision boundary, misclassifying two points. In the middle plot, with a
slightly higher C value, the model places more focus on the two misclassified samples, causing the
decision boundary to tilt. On the far right, with a very high C value, the model tilts the decision boundary
significantly, correctly classifying all class 0 points. However, one class 1 point remains misclassified, as
it's not possible to separate all points in this dataset using a straight line. This model, which attempts to
correctly classify every point, may not capture the overall structure of the data well, indicating it is likely
overfitting.

Similar to regression models, linear models for classification can seem restrictive in low-dimensional
spaces, as they only allow for decision boundaries that are straight lines or planes. However, in higher-
dimensional spaces, linear classification models become much more powerful, and the risk of overfitting
increases as more features are added.

Now, let's dive deeper into analyzing LinearLogistic using the Breast Cancer dataset.

74
The default value of C=1 provides quite good performance, with 95% accuracy on both the training and
the test set. But as training and test set performance are very close, it is likely that we are underfitting.
Let’s try to increase C to fit a more flexible model:

Using C=100 results in higher training set accuracy, and also a slightly increased test set accuracy,
confirming our intuition that a more complex model should perform better. We can also investigate what
happens if we use an even more regularized model than the default of C=1, by setting C=0.01:

As anticipated, when moving further left on the scale in the figure, starting from an underfit model, both
the training and test set accuracies decrease compared to the default parameters.
Finally, let's examine the coefficients learned by the models using the three different settings of the
regularization parameter C:

75
Since LogisticRegression uses L2 regularization by default, the resulting plot is similar to the one produced
by Ridge regression. As the regularization strength increases, the coefficients are gradually reduced,
though they never reach zero. Upon closer examination of the plot, an intriguing pattern appears with
the third coefficient, related to "mean perimeter." When C=100 and C=1, the coefficient is negative, but
at C=0.001, it becomes positive, with a magnitude even larger than at C=1. Interpreting a model like this
might lead one to assume that a feature's coefficient directly corresponds to the class it’s associated with.
For example, one might think that a high "texture error" feature is linked to a "malignant" sample.
However, the sign change in the "mean perimeter" coefficient indicates that, depending on the model
used, a high "mean perimeter" could suggest either "benign" or "malignant." This example underscores
the importance of being careful when interpreting the coefficients of linear models.

If we desire a more interpretable model, using L1 regularization might help, as it limits the model to using
only a few features. Here is the coefficient plot and classification accuracies for L1 regularization

76
Many linear classification models are designed for binary classification and don't naturally extend to
multiclass problems, with logistic regression being a notable exception. To adapt a binary classifier for
multiclass classification, the one-vs.-rest strategy is commonly used. In this approach, a separate binary
model is trained for each class to distinguish that class from all others, leading to as many binary models
as there are classes. To make a prediction, all binary classifiers are applied to a test point, and the
classifier with the highest score for its class "wins," assigning that class label to the test point.

With one classifier per class, there is a separate coefficient vector (w) and intercept (b) for each class.
The class with the highest classification score, determined by the formula:

w[0] * x[0] + w[1] * x[1] + ... + w[p] * x[p] + b

is chosen as the predicted label. While the mathematical details of multiclass logistic regression differ
slightly from the one-vs.-rest approach, both methods result in a separate coefficient vector and intercept
for each class, with the same prediction method applied. Let’s now apply the one-vs.-rest approach to a

77
simple three-class classification dataset, where each class consists of data sampled from a Gaussian
distribution.

Now, we train a LinearSVC classifier on the dataset:

We see that the shape of the coef_ is (3, 2), meaning that each row of coef_ contains the coefficient
vector for one of the three classes and each column holds the coefficient value for a specific feature
(there are two in this dataset). The intercept_ is now a one-dimensional array, storing the intercepts for
each class. Let’s visualize the lines given by the three binary classifiers:

78
You can see that all the points belonging to class 0 in the training data are above the line corresponding
to class 0, which means they are on the “class 0” side of this binary classifier. The points in class 0 are
above the line corresponding to class 2, which means they are classified as “rest” by the binary classifier
for class 2. The points belonging to class 0 are to the left of the line corresponding to class 1, which means
the binary classifier for class 1 also classifies them as “rest.” Therefore, any point in this area will be
classified as class 0 by the final classifier (the result of the classification confidence formula for classifier
0 is greater than zero, while it is smaller than zero for the other two classes). But what about the triangle
in the middle of the plot? All three binary classifiers classify points there as “rest.” Which class would a
point be assigned to? The answer is the one with the highest value for the classification formula: the
class of the closest line.

79
4.4.1 Strengths, Weaknesses, and Parameters

The main parameter in linear models is the regularization parameter, which is called alpha in regression
models and C in LinearSVC and LogisticRegression. Higher values of alpha or smaller values of C result in
simpler models. Tuning these parameters is especially important for regression models, and it’s common
to search for them on a logarithmic scale. Additionally, you must decide between using L1 or L2
regularization. If you believe that only a few features are truly significant, L1 regularization is preferable.
Otherwise, L2 regularization is usually the default. L1 regularization is also helpful for interpretability, as
it selects only a few important features, making it easier to explain how those features affect the
predictions.

Linear models are fast to train and predict, and they scale well to large datasets. They also handle sparse
data well. For datasets with hundreds of thousands or millions of samples, the solver='sag' option in
LogisticRegression and Ridge can be more efficient. Other scalable options
include SGDClassifier and SGDRegressor, which offer even more efficient versions of the linear models
discussed.

One advantage of linear models is that they are easy to understand, as they rely on simple formulas for
both regression and classification. However, the interpretation of the coefficients can sometimes be

80
unclear, especially when features are highly correlated, which can make the coefficients difficult to
interpret.

Linear models tend to perform well when the number of features significantly exceeds the number of
samples. They are often used with very large datasets because training more complex models may not
be practical. However, for lower-dimensional datasets, other models might provide better generalization
performance.

4.4.2 Evaluation Metrics for Classification

Key Metrics:

✓ Accuracy: Overall correctness of predictions.

✓ Precision: TPTP+FP\frac{TP}{TP + FP}TP+FPTP – Focuses on false positives.

✓ Recall: TPTP+FN\frac{TP}{TP + FN}TP+FNTP – Focuses on false negatives.

✓ F1-score: Harmonic mean of precision and recall.

✓ ROC Curve: Evaluates the tradeoff between sensitivity and specificity.

4.4 Review Questions for Chapter 4

1 What is the difference between classification and regression?


2 Explain why Logistic Regression is suitable for binary classification.
3 How does k-NN determine the class of a new data point?
4 What is the advantage of SVM over Logistic Regression?

81
5 Chapter 5: Unsupervised Learning – Clustering Algorithms

LEARNING OUTCOMES
After reading this Section of the guide, the learner should be able to:

• Understand the fundamentals of unsupervised learning and how


clustering differs from supervised learning.
• Learn about k-Means, Hierarchical Clustering, and DBSCAN.
• Explore real-world applications of clustering (e.g., customer
segmentation, anomaly detection).
• Implement clustering algorithms using Python and visualize results.
• Evaluate clustering performance using internal metrics.

5.1 Introduction
Unsupervised learning is a category of machine learning where models analyze and learn patterns from
unlabeled data. Unlike supervised learning, where models require labeled outputs (e.g., spam vs. non-
spam emails), unsupervised learning identifies hidden structures in data without explicit supervision.
Unsupervised learning is crucial in fields like finance, marketing, and healthcare, where vast amounts of
unstructured data must be analyzed without human annotation.

5.2 Clustering
Clustering is an unsupervised learning technique that groups similar data points into clusters based on
their feature similarities. It is widely used for:

Customer segmentation – Grouping customers based on purchasing behavior.

Anomaly detection – Identifying fraudulent transactions.

82
Image segmentation – Dividing images into meaningful regions.

Document categorization – Organizing text documents into topics.

Mathematically, clustering can be defined as a partitioning problem: Given a dataset X with n


observations, the goal is to assign each observation to one of k clusters such that intra-cluster similarity
is maximized while inter-cluster similarity is minimized.

5.3 k-Means Clustering


k-Means is a centroid-based clustering algorithm that divides data into kkk clusters, minimizing the
distance between data points and their assigned cluster center. k-Means is one of the most commonly
used clustering techniques for customer segmentation in retail and e-commerce.

How k-Means Works:

1 Choose the number of clusters kkk.


2 Randomly initialize kkk centroids.
3 Assign each data point to the nearest centroid.
4 Update centroids based on the mean of assigned points.
5 Repeat steps 3 and 4 until convergence.

Mathematical Representation

Customer Segmentation Using k-Means

Cluster customers based on purchasing behavior.


83
import numpy as np

import pandas as pd

import [Link] as plt

from [Link] import KMeans

from [Link] import StandardScaler

# Generate synthetic customer data (Age, Annual Income, Spending Score)

[Link](42)

data = [Link](200, 3) * [50, 100000, 100] # Simulated Age, Income, Score

df = [Link](data, columns=['Age', 'Annual_Income', 'Spending_Score'])

# Standardize the data

scaler = StandardScaler()

df_scaled = scaler.fit_transform(df)

# Apply k-Means clustering

kmeans = KMeans(n_clusters=3, random_state=42)

df['Cluster'] = kmeans.fit_predict(df_scaled)

# Visualize clusters

[Link](df['Annual_Income'], df['Spending_Score'], c=df['Cluster'], cmap='viridis')

84
[Link]('Annual Income ($)')

[Link]('Spending Score')

[Link]('Customer Segmentation using k-Means')

[Link]()

Output:

5.4 Hierarchical Clustering


Unlike k-Means, Hierarchical Clustering creates a tree-like structure (dendrogram) representing nested
clusters. Hierarchical clustering is effective for datasets with unknown numbers of clusters, unlike k-
Means which requires pre-defining k.

Types of Hierarchical Clustering

85
Agglomerative (Bottom-Up) – Each data point starts as its own cluster, and clusters merge iteratively.

Divisive (Top-Down) – The entire dataset starts as a single cluster and splits iteratively.

Hierarchical Clustering on Customer Data

from [Link] import dendrogram, linkage

import seaborn as sns

# Compute linkage matrix

linkage_matrix = linkage(df_scaled, method='ward')

# Plot dendrogram

[Link](figsize=(10, 5))

dendrogram(linkage_matrix, labels=[Link], leaf_rotation=90)

[Link]('Hierarchical Clustering Dendrogram')

[Link]('Customer Index')

[Link]('Distance')

[Link]()

Expected Output

86
5.5 Density-Based Clustering (DBSCAN)
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) identifies dense clusters and
separates noise points. DBSCAN is widely used for anomaly detection and geospatial analysis.

Advantages of DBSCAN:

No need to specify kkk (number of clusters).

Works well with arbitrary-shaped clusters.

Can detect outliers (noise points).

DBSCAN Clustering

from [Link] import DBSCAN

# Apply DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
df['DBSCAN_Cluster'] = dbscan.fit_predict(df_scaled)

# Visualize clusters

87
[Link](df['Annual_Income'], df['Spending_Score'], c=df['DBSCAN_Cluster'],
cmap='viridis')
[Link]('Annual Income ($)')
[Link]('Spending Score')
[Link]('Customer Segmentation using DBSCAN')
[Link]()

Output:

• A scatter plot showing clusters with noise points (outliers).

5.6 Clustering Evaluation Metrics


Since clustering is unsupervised, traditional accuracy metrics do not apply. Instead, we use:

✓ Silhouette Score – Measures how well clusters are separated.

✓ Davies-Bouldin Index – Lower values indicate better clustering.

88
✓ Elbow Method (for k-Means) – Determines the optimal number of clusters.

Task:
1 Load a real-world dataset (e.g., Mall Customer Segmentation dataset on Kaggle).
2 Apply k-Means, Hierarchical Clustering, and DBSCAN.
3 Compare the clusters using visualizations and silhouette scores.

As we described earlier, clustering is the task of partitioning the dataset into groups, called clusters. The
goal is to split up the data in such a way that points within a single cluster are very similar and points in
different clusters are different. Similarly to classification algorithms, clustering algorithms assign (or
predict) a number to each data point, indicating which cluster a particular point belongs to.

k-Means Clustering k-means clustering is one of the simplest and most commonly used clustering
algorithms. It tries to find cluster centers that are representative of certain regions of the data. The
algorithm alternates between two steps: assigning each data point to the closest cluster center, and then
setting each cluster center as the mean of the data points that are assigned to it. The algorithm is finished
when the assignment of instances to clusters no longer changes. The following example illustrates the
algorithm on a synthetic dataset:

89
Cluster centers are shown as triangles, while data points are shown as circles. Colors indicate cluster
membership. We specified that we are looking for three clusters, so the algorithm was initialized by
declaring three data points randomly as cluster centers (see “Initialization”). Then the iterative algorithm
starts. First, each data point is assigned to the cluster center it is closest to (see “Assign Points (1)”). Next,
the cluster centers are updated to be the mean of the assigned points (see “Recompute Centers (1)”).
Then the process is repeated two more times. After the third iteration, the assignment of points to cluster
centers remained unchanged, so the algorithm stops. Given new data points, k-means will assign each to
the closest cluster center. The next example shows the boundaries of the cluster centers:

90
Applying k-means with scikit-learn is quite straightforward. Here, we apply it to the synthetic data that
we used for the preceding plots. We instantiate the KMeans class, and set the number of clusters we are
looking for. Then we call the fit method with the data:

During the algorithm, each training data point in X is assigned a cluster label. You can find these labels in
the kmeans.labels_ attribute:

You can see that clustering is somewhat similar to classification, in that each item gets a label. However,
there is no ground truth, and consequently the labels themselves have no a priori meaning. Let’s go back
to the example of clustering face images that we discussed before. It might be that the cluster 3 found
by the algorithm contains only faces of your friend Bela. You can only know that after you look at the
pictures, though, and the number 3 is arbitrary. The only information the algorithm gives you is that all
faces labeled as 3 are similar. For the clustering we just computed on the two-dimensional toy dataset,
that means that we should not assign any significance to the fact that one group was labeled 0 and
another one was labeled 1. Running the algorithm again might result in a differ- ent numbering of clusters
because of the random nature of the initialization. Here is a plot of this data again. The cluster centers
are stored in the cluster_centers_ attribute, and we plot them as triangles:

91
We can also use more or fewer cluster centers

5.6.1 Failure cases of k-means

Even if you know the “right” number of clusters for a given dataset, k-means might not always be able to
recover them. Each cluster is defined solely by its center, which means that each cluster is a convex shape.
As a result of this, k-means can only cap- ture relatively simple shapes. k-means also assumes that all
clusters have the same “diameter” in some sense; it always draws the boundary between clusters to be
exactly in the middle between the cluster centers. That can sometimes lead to surprising results:
92
One might have expected the dense region in the lower left to be the first cluster, the dense region in the
upper right to be the second, and the less dense region in the cen- ter to be the third. Instead, both
cluster 0 and cluster 1 have some points that are far away from all the other points in these clusters that
“reach” toward the center. k-means also assumes that all directions are equally important for each
cluster. The following plot shows a two-dimensional dataset where there are three clearly separated
parts in the data. However, these groups are stretched toward the diagonal. As k-means only considers
the distance to the nearest cluster center, it can’t handle this kind of data:

93
k-means also performs poorly if the clusters have more complex shapes:

94
Vector quantization, or seeing k-means as decomposition Even though k-means is a clustering algorithm,
there are interesting parallels between k-means and the decomposition methods like PCA and NMF that
we discussed ear- lier. You might remember that PCA tries to find directions of maximum variance in the
data, while NMF tries to find additive components, which often correspond to “extremes” or “parts” of
the data. Both methods tried to express the data points as a sum over some components. k-means, on
the other hand, tries to rep- resent each data point using a cluster center. You can think of that as each
point being represented using only a single component, which is given by the cluster center. This view of
k-means as a decomposition method, where each point is represented using a single component, is called
vector quantization.

Let’s do a side-by-side comparison of PCA, NMF, and k-means, showing the components extracted, as
well as reconstructions of faces from the test set using 100 components. For k-means, the reconstruction
is the closest cluster center found on the training set:

95
An interesting aspect of vector quantization using k-means is that we can use many more clusters than
input dimensions to encode our data. Let’s go back to the two_moons data. Using PCA or NMF, there is
nothing much we can do to this data, as it lives in only two dimensions. Reducing it to one dimension
with PCA or NMF would completely destroy the structure of the data. But we can find a more expressive
representation with k-means, by using more cluster centers:

96
We used 10 cluster centers, which means each point is now assigned a number between 0 and 9. We can
see this as the data being represented using 10 components (that is, we have 10 new features), with all
features being 0, apart from the one that represents the cluster center the point is assigned to. Using
this 10-dimensional repre- sentation, it would now be possible to separate the two half-moon shapes
using a lin- ear model, which would not have been possible using the original two features. It is also
possible to get an even more expressive representation of the data by using the distances to each of the
cluster centers as features. This can be accomplished using the transform method of kmeans:

K-means is a widely used clustering algorithm, favored for its simplicity, ease of implementation, and fast
execution. It scales well to large datasets, and scikit-learn offers an even more scalable version called
MiniBatchKMeans, which is capable of handling very large datasets. However, one limitation of k-means

97
is that it depends on a random initialization, meaning the results can vary depending on the random
seed. To address this, scikit-learn runs the algorithm 10 times with different random initializations and
selects the best outcome. Other drawbacks include the algorithm's restrictive assumptions about cluster
shapes and the need to specify the number of clusters beforehand, which may not always be known in
real-world scenarios.

Agglomerative clustering encompasses a group of algorithms that follow similar principles: the algorithm
begins by treating each data point as its own cluster, then iteratively merges the two most similar clusters
until a predefined stopping condition is met. In scikit-learn, this stopping condition is set by the desired
number of clusters, and clusters are merged until that number is reached. The similarity between clusters
is measured using different linkage criteria, each defining how to evaluate the proximity between two
clusters.

Scikit-learn offers three linkage methods:

• Ward: The default option, Ward merges clusters in such a way that the increase in the overall
variance within all clusters is minimized. This typically results in clusters of similar size.

• Average: This method merges clusters with the smallest average distance between all points in
the two clusters.

• Complete: Also known as maximum linkage, this criterion merges clusters based on the smallest
maximum distance between any two points in the clusters.

Ward is suitable for most datasets, and will be used in the examples here. However, if the clusters have
significantly different sizes, such as when one cluster is much larger than the others, average or complete
linkage may produce better results.

The following plot demonstrates the process of agglomerative clustering applied to a two-dimensional
dataset, where the goal is to identify three clusters.

98
Initially, each point is its own cluster. Then, in each step, the two clusters that are closest are merged. In
the first four steps, two single-point clusters are picked and these are joined into two-point clusters. In
step 5, one of the two-point clusters is extended to a third point, and so on. In step 9, there are only
three clusters remain- ing. As we specified that we are looking for three clusters, the algorithm then
stops. Let’s have a look at how agglomerative clustering performs on the simple threecluster data we
used here. Because of the way the algorithm works, agglomerative clustering cannot make predictions
for new data points. Therefore, Agglomerative Clustering has no predict method. To build the model and
get the cluster member- ships on the training set, use the fit_predict method instead.

99
As expected, the algorithm recovers the clustering perfectly. While the scikit-learn implementation of
agglomerative clustering requires you to specify the number of clusters you want the algorithm to find,
agglomerative clustering methods provide some help with choosing the right number, which we will
discuss next.

Hierarchical clustering and dendrograms

Agglomerative clustering produces what is known as a hierarchical clustering. The clustering proceeds
iteratively, and every point makes a journey from being a single point cluster to belonging to some final
cluster. Each intermediate step provides a clustering of the data (with a different number of clusters). It
is sometimes helpful to look at all possible clusterings jointly. The next example shows an overlay of all
the possible clusterings shown before, providing some insight into how each cluster breaks up into
smaller clusters:

While this visualization provides a very detailed view of the hierarchical clustering, it relies on the two-
dimensional nature of the data and therefore cannot be used on datasets that have more than two
features. There is, however, another tool to visualize hierarchical clustering, called a dendrogram, that
can handle multidimensional datasets.

Unfortunately, scikit-learn currently does not have the functionality to draw dendrograms. However, you
can generate them easily using SciPy. The SciPy clustering algorithms have a slightly different interface to
the scikit-learn clustering algorithms. SciPy provides a function that takes a data array X and computes a
linkage array, which encodes hierarchical cluster similarities. We can then feed this linkage array into the
scipy dendrogram function to plot the dendrogram:

100
The dendrogram illustrates the data points as numbered from 0 to 11 at the bottom. A tree structure is
then created with these points, each representing individual clusters, and new parent nodes are added
as pairs of clusters are merged. Starting from the bottom and moving upward, the first merger involves
points 1 and 4. Next, points 6 and 9 form a cluster, and this merging continues. At the top level, there are
two main branches: one includes points 11, 0, 5, 10, 7, 6, and 9, while the other consists of points 1, 4,
3, 2, and 8. These two branches represent the two largest clusters on the left side of the plot.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is another powerful clustering
algorithm. Its key advantages are that it does not require specifying the number of clusters beforehand,
it can identify clusters with complex shapes, and it can detect outliers or points that don't belong to any
cluster. Although DBSCAN is slower than both agglomerative clustering and k-means, it can still handle
relatively large datasets.

DBSCAN operates by identifying dense regions in the feature space, where many data points are located
close to each other. These dense regions are considered potential clusters, separated by areas that are
relatively sparse. Points within a dense region are termed core samples (or core points). Two key

101
parameters control DBSCAN: min_samples and eps. If there are at least min_samples points within a
distance of eps from a given point, it is labeled as a core sample.

The algorithm starts by selecting an arbitrary point and finds all points within a distance of eps. If fewer
than min_samples points are found within this radius, the point is marked as noise, meaning it does not
belong to any cluster. If more than min_samples points are within eps, the point is labeled as a core
sample and assigned to a new cluster. The algorithm then visits all neighboring points within eps. If those
neighbors haven't been assigned a cluster, they are given the same cluster label. If any of the neighbors
are core samples, their neighbors are recursively visited, and the cluster grows until there are no more
core samples within eps. The process repeats with an unvisited point, and the algorithm continues until
all points have been processed.

In the end, there are three kinds of points: core points, points that are within distance eps of core points
(called boundary points), and noise. When the DBSCAN algorithm is run on a particular dataset multiple
times, the clustering of the core points is always the same, and the same points will always be labeled as
noise. However, a boundary point might be neighbor to core samples of more than one cluster. Therefore,
the cluster membership of boundary points depends on the order in which points are vis- ited. Usually
there are only few boundary points, and this slight dependence on the order of points is not important.
Let’s apply DBSCAN on the synthetic dataset we used to demonstrate agglomerative clustering. Like
agglomerative clustering, DBSCAN does not allow predictions on new test data, so we will use the
fit_predict method to perform clustering and return the cluster labels in one step:

As you can see, all data points were assigned the label -1, which stands for noise. This is a consequence
of the default parameter settings for eps and min_samples, which are not tuned for small toy datasets.
The cluster assignments for different values of min_samples and eps are shown below, and visualized:

102
In this plot, points belonging to clusters are represented by solid markers, while noise points are shown
in white. Core samples are indicated by larger markers, and boundary points are represented by smaller
ones. As eps increases (moving from left to right in the figure), the clusters expand to include more
points, but this may also lead to the merging of distinct clusters into one. Conversely,
increasing min_samples (moving from top to bottom) results in fewer points being classified as core
samples, and more points being labeled as noise.

The eps parameter is typically more influential, as it sets the proximity threshold for points to be
considered as part of the same cluster. If eps is too small, no points may qualify as core samples, and all
points could be labeled as noise. On the other hand, if eps is too large, all points may end up in a single
cluster.

103
The min_samples parameter mainly impacts whether points in less dense areas are treated as outliers or
included in their own clusters. Lowering min_samples causes smaller groups to be labeled as noise. For
instance, when min_samples is set to 3, three clusters are formed: one with four points, one with five
points, and one with three points. However, when min_samples is increased to 5, the smaller clusters
(with three and four points) are considered noise, leaving only the cluster with five points.

Although DBSCAN does not require specifying the number of clusters directly, adjusting eps affects the
number of clusters identified. Finding the optimal eps value can be easier when the data is scaled using
methods like StandardScaler or MinMaxScaler, as these techniques ensure that all features are on a
similar scale. The outcome of running DBSCAN on the two_moons dataset is shown here, where the
algorithm successfully identifies the two half-circles and separates them using the given settings.

Comparing and Evaluating Clustering Algorithms

One of the challenges in applying clustering algorithms is that it is very hard to assess how well an
algorithm worked, and to compare outcomes between different algo- rithms. After talking about the
algorithms behind k-means, agglomerative clustering, and DBSCAN, we will now compare them on some
real-world datasets.

Evaluating clustering with ground truth


104
There are metrics that can be used to assess the outcome of a clustering algorithm relative to a ground
truth clustering, the most important ones being the adjusted rand index (ARI) and normalized mutual
information (NMI), which both provide a quanti- tative measure between 0 and 1. Here, we compare the
k-means, agglomerative clustering, and DBSCAN algorithms using ARI. We also include what it looks like
when we randomly assign points to two clusters for comparison:

The adjusted rand index provides intuitive results, with a random cluster assignment having a score of 0
and DBSCAN (which recovers the desired clustering perfectly) having a score of 1.

A common mistake when evaluating clustering in this way is to use accuracy_score instead of
adjusted_rand_score, normalized_mutual_info_score, or some other clustering metric. The problem in
using accuracy is that it requires the assigned clus- ter labels to exactly match the ground truth. However,
the cluster labels themselves are meaningless—the only thing that matters is which points are in the
same cluster:

105
Evaluating clustering without ground truth Although we have just shown one way to evaluate clustering
algorithms, in practice, there is a big problem with using measures like ARI. When applying clustering
algo- rithms, there is usually no ground truth to which to compare the results. If we knew the right
clustering of the data, we could use this information to build a supervised model like a classifier.
Therefore, using metrics like ARI and NMI usually only helps in developing algorithms, not in assessing
success in an application. There are scoring metrics for clustering that don’t require ground truth, like
the sil- houette coefficient. However, these often don’t work well in practice. The silhouette score
computes the compactness of a cluster, where higher is better, with a perfect score of 1. While compact
clusters are good, compactness doesn’t allow for complex shapes. Here is an example comparing the
outcome of k-means, agglomerative clustering, and DBSCAN on the two-moons dataset using the
silhouette score:

106
As observed, k-means achieves the highest silhouette score, even though we may prefer the results
produced by DBSCAN. A more effective approach for evaluating clusters is to use robustness-based
clustering metrics. These metrics involve running an algorithm after adding noise to the data or using
different parameter settings and comparing the results. The idea is that if many algorithm configurations
and variations in the data produce the same outcome, the result is likely reliable. Unfortunately, this
approach is not available in scikit-learn as of this writing.

Even with a very robust clustering or a high silhouette score, we still cannot determine if the clustering
has any semantic meaning or if it reflects an aspect of the data we care about. Returning to the face
image example, we might want to identify groups of similar faces, such as distinguishing between men
and women, young and old, or people with beards versus those without. Suppose we cluster the data
into two groups, and all algorithms agree on which points should be grouped together. However, we still
cannot be certain that the clusters correspond to the concepts we are interested in. The clusters could
have been formed based on factors like side views versus front views, photos taken at night versus during
the day, or pictures captured with different types of phones (iPhones versus Androids). The only way to
determine whether the clustering aligns with our interests is through manual analysis of the clusters.

Comparing algorithms on the faces dataset Let’s apply the k-means, DBSCAN, and agglomerative
clustering algorithms to the Labeled Faces in the Wild dataset, and see if any of them find interesting
structure. We will use the eigenface representation of the data, as produced by PCA(whiten=True), with
100 components:

We saw earlier that this is a more semantic representation of the face images than the raw pixels. It will
also make computation faster. A good exercise would be for you to run the following experiments on the
original data, without PCA, and see if you find similar clusters.

Analyzing the faces dataset with DBSCAN. We will start by applying DBSCAN, which we just discussed:

107
We see that all the returned labels are –1, so all of the data was labeled as “noise” by DBSCAN. There are
two things we can change to help this: we can make eps higher, to expand the neighborhood of each
point, and set min_samples lower, to consider smaller groups of points as clusters. Let’s try changing
min_samples first:

Even when considering groups of three points, everything is labeled as noise. So, we need to increase
eps:

Using a much larger eps of 15, we get only a single cluster and noise points. We can use this result to find
out what the “noise” looks like compared to the rest of the data. To understand better what’s happening,
let’s look at how many points are noise, and how many points are inside the cluster:

Comparing these images to the random sample of face images, we can guess why they were labeled as
noise: the fifth image in the first row shows a per- son drinking from a glass, there are images of people
wearing hats, and in the last image there’s a hand in front of the person’s face. The other images contain

108
odd angles or crops that are too close or too wide. This kind of analysis—trying to find “the odd one
out”—is called outlier detection. If this was a real application, we might try to do a better job of cropping
images, to get more homogeneous data. There is little we can do about people in photos sometimes
wearing hats, drinking, or holding something in front of their faces, but it’s good to know that these are
issues in the data that any algorithm we might apply needs to handle. If we want to find more interesting
clusters than just one large one, we need to set eps smaller, somewhere between 15 and 0.5 (the default).
Let’s have a look at what different values of eps result in:

For low settings of eps, all points are labeled as noise. For eps=7, we get many noise points and many
smaller clusters. For eps=9 we still get many noise points, but we get one big cluster and some smaller
clusters. Starting from eps=11, we get only one large cluster and noise. What is interesting to note is that
there is never more than one large cluster. At most, there is one large cluster containing most of the
points, and some smaller clusters. This indicates that there are not two or three different kinds of face
images in the data that are very distinct, but rather that all images are more or less equally similar to (or
dissimilar from) the rest. The results for eps=7 look most interesting, with many small clusters. We can
investigate this clustering in more detail by visualizing all of the points in each of the 13 small clusters:

109
Some of the clusters correspond to people with very distinct faces (within this dataset), such as Sharon
or Koizumi. Within each cluster, the orientation of the face is also quite fixed, as well as the facial
expression. Some of the clusters contain faces of multiple people, but they share a similar orientation
and expression. This concludes our analysis of the DBSCAN algorithm applied to the faces dataset. As you
can see, we are doing a manual analysis here, different from the much more automatic search approach
we could use for supervised learning based on R 2 score or accuracy. Let’s move on to applying k-means
and agglomerative clustering. Analyzing the faces dataset with k-means. We saw that it was not possible
to create more than one big cluster using DBSCAN. Agglomerative clustering and k-means are much more
likely to create clusters of even size, but we do need to set a target number of clusters. We could set the
number of clusters to the known number of people in the dataset, though it is very unlikely that an

110
unsupervised clustering algorithm will recover them. Instead, we can start with a low number of clusters,
like 10, which might allow us to analyze each of the clusters:

As you can see, k-means clustering partitioned the data into relatively similarly sized clusters from 64 to
386. This is quite different from the result of DBSCAN. We can further analyze the outcome of k-means
by visualizing the cluster centers. As we clustered in the representation produced by PCA, we need to
rotate the cluster centers back into the original space to visualize them, using pca.inverse_transform:

The cluster centers found by k-means are very smooth versions of faces. This is not very surprising, given
that each center is an average of 64 to 386 face images. Working with a reduced PCA representation adds
to the smoothness of the images (compared to the faces reconstructed using 100 PCA dimensions). The
clustering seems to pick up on different orientations of the face, different expressions (the third cluster
center seems to show a smiling face), and the presence of shirt collars (see the second-to-last cluster
center). For a more detailed view, we show for each cluster center the five most typical images in the
cluster (the images assigned to the cluster that are closest to the cluster center) and the five most atypical
images in the cluster (the images assigned to the cluster that are furthest from the cluster center):

111
112
Analyzing the faces dataset with agglomerative clustering. Now, let’s look at the results of agglomerative
clustering:

Agglomerative clustering also produces relatively equally sized clusters, with cluster sizes between 26
and 623. These are more uneven than those produced by k-means, but much more even than the ones
produced by DBSCAN. We can compute the ARI to measure whether the two partitions of the data given
by agglomerative clustering and k-means are similar:

113
An ARI of only 0.13 means that the two clusterings labels_agg and labels_km have little in common. This
is not very surprising, given the fact that points further away from the cluster centers seem to have little
in common for k-means. Next, we might want to plot the dendrogram. We’ll limit the depth of the tree
in the plot, as branching down to the individual 2,063 data points would result in an unreadably dense
plot:

Creating 10 clusters, we cut across the tree at the very top, where there are 10 vertical lines. In the
dendrogram for the toy data, you could see by the length of the branches that two or three clusters might
capture the data appropriately. For the faces data, there doesn’t seem to be a very natural cutoff point.
There are some branches that represent more distinct groups, but there doesn’t appear to be a particular
number of clusters that is a good fit. This is not surprising, given the results of DBSCAN, which tried to
cluster all points together. Let’s visualize the 10 clusters, as we did for k-means earlier. Note that there is
no notion of cluster center in agglomerative clustering (though we could compute the mean), and we
simply show the first couple of points in each cluster. We show the number of points in each cluster to
the left of the first image:

114
While some of the clusters seem to have a semantic theme, many of them are too large to be actually
homogeneous. To get more homogeneous clusters, we can run the algorithm again, this time with 40
clusters, and pick out some of the clusters that are particularly interesting:

115
Summary of Clustering Methods

This section highlighted the fact that clustering is a largely qualitative process, often most useful during
the exploratory phase of data analysis. We explored three clustering algorithms: k-means, DBSCAN, and
agglomerative clustering. Each method offers a way to control the level of granularity in clustering. While
k-means and agglomerative clustering allow you to specify the number of clusters, DBSCAN lets you
define proximity using the eps parameter, which indirectly affects the cluster size. All three algorithms
are capable of handling large, real-world datasets, are relatively easy to understand, and support
clustering into multiple groups.

Each algorithm has its own unique strengths. K-means provides a way to characterize clusters by their
centers and can be seen as a decomposition method, representing each data point by its cluster's center.
DBSCAN has the advantage of detecting "noise points" that do not belong to any cluster and can

116
automatically determine the number of clusters. Unlike the other two methods, DBSCAN can identify
clusters with complex shapes, as demonstrated in the two_moons example. However, DBSCAN can
sometimes create clusters of vastly different sizes, which may be either a benefit or a drawback.
Agglomerative clustering, on the other hand, offers a hierarchical view of possible data partitions, which
can be easily examined using dendrograms.

The second category of machine learning algorithms we will explore is unsupervised learning. In
unsupervised learning, there is no predefined output or supervisor guiding the algorithm. Instead, the
algorithm is given input data and is tasked with independently identifying patterns or extracting insights
from it.

Different Types of Preprocessing

The first plot above shows a synthetic two-class classification dataset with two features. The first feature,
represented along the x-axis, ranges from 10 to 15, while the second feature, shown along the y-axis,
ranges from approximately 1 to 9. The subsequent four plots demonstrate four different methods for
transforming the data to create more standardized ranges. The StandardScaler in scikit-learn
standardizes each feature by adjusting its mean to 0 and its variance to 1, ensuring that all features are
on the same scale. However, this method does not control the specific minimum and maximum values
of the features. The RobustScaler functions similarly to the StandardScaler but uses the median and
quartiles instead of the mean and variance. This makes the RobustScaler less sensitive to outliers (data
points that are significantly different from the rest), which could otherwise affect scaling methods based
on the mean and variance. The MinMaxScaler shifts and scales the data so that all feature values fall

117
between 0 and 1. In the case of this two-dimensional dataset, the transformation ensures all data points
are within a rectangle bounded by 0 and 1 on both axes. Lastly, the Normalizer scales the data differently:
it adjusts each data point so that the length of its feature vector (its Euclidean norm) equals 1. This
normalization essentially projects the data points onto a unit circle (or sphere in higher dimensions), with
each point scaled by the inverse of its length. This technique is typically used when only the direction (or
angle) of the data matters, rather than its magnitude.

Applying Data Transformations

After reviewing the different types of transformations, let's apply them using scikit-learn. We'll use the
cancer dataset from Chapter 2 for this demonstration. Preprocessing steps like scaling are generally
performed before applying a supervised machine learning algorithm. For instance, if we want to use a
kernel SVM (SVC) on the cancer dataset, we might first apply the MinMaxScaler to preprocess the data.
To do this, we would start by loading the dataset and splitting it into training and test sets. This separation
is important because it allows us to assess the performance of the supervised model on unseen data
after completing the preprocessing steps.

118
The transformed data has the same shape as the original data—the features are simply shifted and
scaled. You can see that all of the features are now between 0 and 1, as desired. To apply the SVM to the
scaled data, we also need to transform the test set. This is again done by calling the transform method,
this time on X_test:

The transformed data has the same shape as the original data—the features are simply shifted and
scaled. You can see that all of the features are now between 0 and 1, as desired. To apply the SVM to the
scaled data, we also need to transform the test set. This is again done by calling the transform method,
this time on X_test:

Maybe somewhat surprisingly, you can see that for the test set, after scaling, the mini- mum and
maximum are not 0 and 1. Some of the features are even outside the 0–1 range! The explanation is that
the MinMaxScaler (and all the other scalers) always applies exactly the same transformation to the
training and the test set. This means the transform method always subtracts the training set minimum

119
and divides by the training set range, which might be different from the minimum and range for the test
set. Scaling Training and Test Data the Same Way It is important to apply exactly the same transformation
to the training set and the test set for the supervised model to work on the test set. The following
example illustrates what would happen if we were to use the minimum and range of the test set instead:

The first panel is an unscaled two-dimensional dataset, with the training set shown as circles and the test
set shown as triangles. The second panel is the same data but scaled using the MinMaxScaler. Here, we

120
called fit on the training set, and then called transform on the training and test sets. You can see that the
dataset in the second panel looks identical to the first; only the ticks on the axes have changed. Now all
the features are between 0 and 1. You can also see that the minimum and maximum feature values for
the test data (the triangles) are not 0 and 1. The third panel shows what would happen if we scaled the
training set and test set separately. In this case, the minimum and maximum feature values for both the
training and the test set are 0 and 1. But now the dataset looks different. The test points moved
incongruously to the training set, as they were scaled differently. We changed the arrangement of the
data in an arbitrary way. Clearly this is not what we want to do. As another way to think about this,
imagine your test set is a single point. There is no way to scale a single point correctly, to fulfill the
minimum and maximum requirements of the MinMaxScaler. But the size of your test set should not
change your processing.

The Effect of Preprocessing on Supervised Learning Now let’s go back to the cancer dataset and see the
effect of using the MinMaxScaler on learning the SVC (this is a different way of doing the same scaling
we did in Chapter 2). First, let’s fit the SVC on the original data again for comparison:

As we saw before, the effect of scaling the data is quite significant. Even though scaling the data doesn’t
involve any complicated math, it is good practice to use the scaling mechanisms provided by scikit-learn
instead of reimplementing them yourself, as it’s easy to make mistakes even in these simple
computations. You can also easily replace one preprocessing algorithm with another by changing the

121
class you use, as all of the preprocessing classes have the same interface, consisting of the fit and
transform methods:

Non-Negative Matrix Factorization (NMF)

Non-negative matrix factorization is another unsupervised learning algorithm that aims to extract useful
features. It works similarly to PCA and can also be used for dimensionality reduction. As in PCA, we are
trying to write each data point as a weighted sum of some components. But whereas in PCA we wanted
components that were orthogonal and that explained as much variance of the data as possible, in NMF,
we want the components and the coefficients to be non-negative; that is, we want both the components
and the coefficients to be greater than or equal to zero. Consequently, this method can only be applied
to data where each feature is non-negative, as a non-negative sum of non-negative components cannot
become negative.

The process of decomposing data into a non-negative weighted sum is particularly helpful for data that
is created as the addition (or overlay) of several independent sources, such as an audio track of multiple
people speaking, or music with many instruments. In these situations, NMF can identify the original
components that make up the combined data. Overall, NMF leads to more interpretable components
than PCA, as negative components and coefficients can lead to hard-to-interpret cancellation effects. The
eigenfaces, for example, contain both positive and negative parts, and as we mentioned in the
description of PCA, the sign is arbitrary. Before we apply NMF to the face dataset, let’s briefly revisit the
synthetic data.

Applying NMF to synthetic data In contrast to when using PCA, we need to ensure that our data is positive
for NMF to be able to operate on the data. This means where the data lies relative to the origin (0, 0)
matters for NMF. Therefore, you can think of the non-negative components that are extracted as

122
directions from (0, 0) toward the data. The following example shows the results of NMF on the two-
dimensional toy data:

For NMF with two components, as shown on the left, it is clear that all points in the data can be written
as a positive combination of the two components. If there are enough components to perfectly
reconstruct the data (as many components as there are features), the algorithm will choose directions
that point toward the extremes of the data.

If we only use a single component, NMF creates a component that points toward the mean, as pointing
there best explains the data. You can see that in contrast with PCA, reducing the number of components
not only removes some directions, but creates an entirely different set of components! Components in
NMF are also not ordered in any specific way, so there is no “first non-negative component”: all
components play an equal part.

NMF uses a random initialization, which might lead to different results depending on the random seed.
In relatively simple cases such as the synthetic data with two com- ponents, where all the data can be
explained perfectly, the randomness has little effect (though it might change the order or scale of the
components). In more complex sit- uations, there might be more drastic changes. Applying NMF to face
images Now, let’s apply NMF to the Labeled Faces in the Wild dataset we used earlier. The main
parameter of NMF is how many components we want to extract. Usually this is lower than the number
of input features (otherwise, the data could be explained by making each pixel a separate component).

123
First, let’s inspect how the number of components impacts how well the data can be reconstructed using
NMF:

124
The quality of the back-transformed data is similar to when using PCA, but slightly worse. This is expected,
as PCA finds the optimum directions in terms of reconstruction. NMF is usually not used for its ability to
reconstruct or encode data, but rather for finding interesting patterns within the data. As a first look into
the data, let’s try extracting only a few components (say, 15).

125
These components are all positive, and so resemble prototypes of faces much more so than the
components shown for PCA. For example, one can clearly see that component 3 shows a face rotated
somewhat to the right, while component 7 shows a face somewhat rotated to the left. Let’s look at the
images for which these components are particularly strong:

126
As expected, faces that have a high coefficient for component 3 are faces looking to the right, while faces
with a high coefficient for component 7 are looking to the left. As mentioned earlier, extracting patterns
like these works best for data with additive structure, including audio, gene expression, and text data.
Let’s walk through one example on synthetic data to see what this might look like. Let’s say we are
interested in a signal that is a combination of three different sources:

127
Manifold Learning with t-SNE While PCA is often a good first approach for transforming your data so that
you might be able to visualize it using a scatter plot, the nature of the method (applying a rotation and
then dropping directions) limits its usefulness, as we saw with the scatter plot of the Labeled Faces in the
Wild dataset. There is a class of algorithms for visuali- zation called manifold learning algorithms that
allow for much more complex map- pings, and often provide better visualizations. A particularly useful
one is the t-SNE algorithm.

Dimensionality Reduction, Feature Extraction, and Manifold Learning

Not to be confused with the much larger MNIST dataset. Manifold learning algorithms are mainly aimed
at visualization, and so are rarely used to generate more than two new features. Some of them, including
t-SNE, com- pute a new representation of the training data, but don’t allow transformations of new data.
This means these algorithms cannot be applied to a test set: rather, they can only transform the data
they were trained for. Manifold learning can be useful for explora- tory data analysis, but is rarely used if
the final goal is supervised learning. The idea behind t-SNE is to find a two-dimensional representation
of the data that preserves the distances between points as best as possible. t-SNE starts with a random
two-dimensional representation for each data point, and then tries to make points that are close in the
original feature space closer, and points that are far apart in the original feature space farther apart. t-
SNE puts more emphasis on points that are close by, rather than preserving distances between far-apart
points. In other words, it tries to preserve the information indicating which points are neighbors to each
other. We will apply the t-SNE manifold learning algorithm on a dataset of handwritten dig- its that is
included in scikit-learn. 2 Each data point in this dataset is an 8×8 gray- scale image of a handwritten digit
between 0 and 1. Figure below shows an example image for each class:

128
129
Let’s apply t-SNE to the same dataset, and compare the results. As t-SNE does not support transforming
new data, the TSNE class has no transform method. Instead, we can call the fit_transform method, which
will build the model and immediately return the transformed data:

130
The result of t-SNE is quite remarkable. All the classes are quite clearly separated. The ones and nines
are somewhat split up, but most of the classes form a single dense group. Keep in mind that this method
has no knowledge of the class labels: it is com- pletely unsupervised. Still, it can find a representation of
the data in two dimensions that clearly separates the classes, based solely on how close points are in the
original space. The t-SNE algorithm has some tuning parameters, though it often works well with the
default settings. You can try playing with perplexity and early_exaggeration, but the effects are usually
minor.

Clustering

As we described earlier, clustering is the task of partitioning the dataset into groups, called clusters. The
goal is to split up the data in such a way that points within a single cluster are very similar and points in
different clusters are different. Similarly to clas- sification algorithms, clustering algorithms assign (or
predict) a number to each data point, indicating which cluster a particular point belongs to.

131
k-Means Clustering k-means clustering is one of the simplest and most commonly used clustering algo-
rithms. It tries to find cluster centers that are representative of certain regions of the data. The algorithm
alternates between two steps: assigning each data point to the closest cluster center, and then setting
each cluster center as the mean of the data points that are assigned to it. The algorithm is finished when
the assignment of instances to clusters no longer changes. The following example illustrates the
algorithm on a synthetic dataset:

Cluster centers are shown as triangles, while data points are shown as circles. Colors indicate cluster
membership. We specified that we are looking for three clusters, so the algorithm was initialized by
declaring three data points randomly as cluster centers (see “Initialization”). Then the iterative algorithm
starts. First, each data point is assigned to the cluster center it is closest to (see “Assign Points (1)”). Next,
the cluster centers are updated to be the mean of the assigned points (see “Recompute Centers (1)”).
Then the process is repeated two more times. After the third iteration, the assignment of points to cluster
centers remained unchanged, so the algorithm stops. Given new data points, k-means will assign each to
the closest cluster center. The next example shows the boundaries of the cluster centers:

132
Applying k-means with scikit-learn is quite straightforward. Here, we apply it to the synthetic data that
we used for the preceding plots. We instantiate the KMeans class, and set the number of clusters we are
looking for. Then we call the fit method with the data:

During the algorithm, each training data point in X is assigned a cluster label. You can find these labels in
the kmeans.labels_ attribute:

You can see that clustering is somewhat similar to classification, in that each item gets a label. However,
there is no ground truth, and consequently the labels themselves have no a priori meaning. Let’s go back
to the example of clustering face images that we discussed before. It might be that the cluster 3 found
by the algorithm contains only faces of your friend Bela. You can only know that after you look at the
pictures, though, and the number 3 is arbitrary. The only information the algorithm gives you is that all
faces labeled as 3 are similar. For the clustering we just computed on the two-dimensional toy dataset,
that means that we should not assign any significance to the fact that one group was labeled 0 and

133
another one was labeled 1. Running the algorithm again might result in a differ- ent numbering of clusters
because of the random nature of the initialization. Here is a plot of this data again. The cluster centers
are stored in the cluster_centers_ attribute, and we plot them as triangles:

We can also use more or fewer cluster centers

Failure cases of k-means

Even if you know the “right” number of clusters for a given dataset, k-means might not always be able to
recover them. Each cluster is defined solely by its center, which means that each cluster is a convex shape.

134
As a result of this, k-means can only cap- ture relatively simple shapes. k-means also assumes that all
clusters have the same “diameter” in some sense; it always draws the boundary between clusters to be
exactly in the middle between the cluster centers. That can sometimes lead to surprising results:

One might have expected the dense region in the lower left to be the first cluster, the dense region in the
upper right to be the second, and the less dense region in the cen- ter to be the third. Instead, both
cluster 0 and cluster 1 have some points that are far away from all the other points in these clusters that
“reach” toward the center. k-means also assumes that all directions are equally important for each
cluster. The following plot shows a two-dimensional dataset where there are three clearly separated
parts in the data. However, these groups are stretched toward the diagonal. As k-means only considers
the distance to the nearest cluster center, it can’t handle this kind of data:

135
k-means also performs poorly if the clusters have more complex shapes:

136
Vector quantization, or seeing k-means as decomposition Even though k-means is a clustering algorithm,
there are interesting parallels between k-means and the decomposition methods like PCA and NMF that
we discussed ear- lier. You might remember that PCA tries to find directions of maximum variance in the
data, while NMF tries to find additive components, which often correspond to “extremes” or “parts” of
the data. Both methods tried to express the data points as a sum over some components. k-means, on
the other hand, tries to rep- resent each data point using a cluster center. You can think of that as each
point being represented using only a single component, which is given by the cluster center. This view of
k-means as a decomposition method, where each point is represented using a single component, is called
vector quantization.

Let’s do a side-by-side comparison of PCA, NMF, and k-means, showing the components extracted, as
well as reconstructions of faces from the test set using 100 components. For k-means, the reconstruction
is the closest cluster center found on the training set:

137
An interesting aspect of vector quantization using k-means is that we can use many more clusters than
input dimensions to encode our data. Let’s go back to the two_moons data. Using PCA or NMF, there is
nothing much we can do to this data, as it lives in only two dimensions. Reducing it to one dimension
with PCA or NMF would completely destroy the structure of the data. But we can find a more expressive
representation with k-means, by using more cluster centers:

138
We used 10 cluster centers, which means each point is now assigned a number between 0 and 9. We can
see this as the data being represented using 10 components (that is, we have 10 new features), with all
features being 0, apart from the one that represents the cluster center the point is assigned to. Using
this 10-dimensional repre- sentation, it would now be possible to separate the two half-moon shapes
using a lin- ear model, which would not have been possible using the original two features. It is also
possible to get an even more expressive representation of the data by using the distances to each of the
cluster centers as features. This can be accomplished using the transform method of kmeans:

K-means is a widely used clustering algorithm, favored for its simplicity, ease of implementation, and fast
execution. It scales well to large datasets, and scikit-learn offers an even more scalable version called
MiniBatchKMeans, which is capable of handling very large datasets. However, one limitation of k-means

139
is that it depends on a random initialization, meaning the results can vary depending on the random
seed. To address this, scikit-learn runs the algorithm 10 times with different random initializations and
selects the best outcome. Other drawbacks include the algorithm's restrictive assumptions about cluster
shapes and the need to specify the number of clusters beforehand, which may not always be known in
real-world scenarios.

Agglomerative clustering encompasses a group of algorithms that follow similar principles: the algorithm
begins by treating each data point as its own cluster, then iteratively merges the two most similar clusters
until a predefined stopping condition is met. In scikit-learn, this stopping condition is set by the desired
number of clusters, and clusters are merged until that number is reached. The similarity between clusters
is measured using different linkage criteria, each defining how to evaluate the proximity between two
clusters.

Scikit-learn offers three linkage methods:

• Ward: The default option, Ward merges clusters in such a way that the increase in the overall
variance within all clusters is minimized. This typically results in clusters of similar size.

• Average: This method merges clusters with the smallest average distance between all points in
the two clusters.

• Complete: Also known as maximum linkage, this criterion merges clusters based on the smallest
maximum distance between any two points in the clusters.

Ward is suitable for most datasets, and will be used in the examples here. However, if the clusters have
significantly different sizes, such as when one cluster is much larger than the others, average or complete
linkage may produce better results.

The following plot demonstrates the process of agglomerative clustering applied to a two-dimensional
dataset, where the goal is to identify three clusters.

140
Initially, each point is its own cluster. Then, in each step, the two clusters that are closest are merged. In
the first four steps, two single-point clusters are picked and these are joined into two-point clusters. In
step 5, one of the two-point clusters is extended to a third point, and so on. In step 9, there are only
three clusters remain- ing. As we specified that we are looking for three clusters, the algorithm then
stops. Let’s have a look at how agglomerative clustering performs on the simple threecluster data we
used here. Because of the way the algorithm works, agglomerative clustering cannot make predictions
for new data points. Therefore, Agglomerative Clustering has no predict method. To build the model and
get the cluster member- ships on the training set, use the fit_predict method instead.

141
As expected, the algorithm recovers the clustering perfectly. While the scikit-learn implementation of
agglomerative clustering requires you to specify the number of clusters you want the algorithm to find,
agglomerative clustering methods provide some help with choosing the right number, which we will
discuss next.

Hierarchical clustering and dendrograms

Agglomerative clustering produces what is known as a hierarchical clustering. The clustering proceeds
iteratively, and every point makes a journey from being a single point cluster to belonging to some final
cluster. Each intermediate step provides a clustering of the data (with a different number of clusters). It
is sometimes helpful to look at all possible clusterings jointly. The next example shows an overlay of all
the possible clusterings shown before, providing some insight into how each cluster breaks up into
smaller clusters:

While this visualization provides a very detailed view of the hierarchical clustering, it relies on the two-
dimensional nature of the data and therefore cannot be used on datasets that have more than two
features. There is, however, another tool to visualize hierarchical clustering, called a dendrogram, that
can handle multidimensional datasets.

Unfortunately, scikit-learn currently does not have the functionality to draw dendrograms. However, you
can generate them easily using SciPy. The SciPy clustering algorithms have a slightly different interface to
the scikit-learn clustering algorithms. SciPy provides a function that takes a data array X and computes a
linkage array, which encodes hierarchical cluster similarities. We can then feed this linkage array into the
scipy dendrogram function to plot the dendrogram:

142
The dendrogram illustrates the data points as numbered from 0 to 11 at the bottom. A tree structure is
then created with these points, each representing individual clusters, and new parent nodes are added
as pairs of clusters are merged. Starting from the bottom and moving upward, the first merger involves
points 1 and 4. Next, points 6 and 9 form a cluster, and this merging continues. At the top level, there are
two main branches: one includes points 11, 0, 5, 10, 7, 6, and 9, while the other consists of points 1, 4,
3, 2, and 8. These two branches represent the two largest clusters on the left side of the plot.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is another powerful clustering
algorithm. Its key advantages are that it does not require specifying the number of clusters beforehand,
it can identify clusters with complex shapes, and it can detect outliers or points that don't belong to any
cluster. Although DBSCAN is slower than both agglomerative clustering and k-means, it can still handle
relatively large datasets.

DBSCAN operates by identifying dense regions in the feature space, where many data points are located
close to each other. These dense regions are considered potential clusters, separated by areas that are
relatively sparse. Points within a dense region are termed core samples (or core points). Two key

143
parameters control DBSCAN: min_samples and eps. If there are at least min_samples points within a
distance of eps from a given point, it is labeled as a core sample.

The algorithm starts by selecting an arbitrary point and finds all points within a distance of eps. If fewer
than min_samples points are found within this radius, the point is marked as noise, meaning it does not
belong to any cluster. If more than min_samples points are within eps, the point is labeled as a core
sample and assigned to a new cluster. The algorithm then visits all neighboring points within eps. If those
neighbors haven't been assigned a cluster, they are given the same cluster label. If any of the neighbors
are core samples, their neighbors are recursively visited, and the cluster grows until there are no more
core samples within eps. The process repeats with an unvisited point, and the algorithm continues until
all points have been processed.

In the end, there are three kinds of points: core points, points that are within distance eps of core points
(called boundary points), and noise. When the DBSCAN algorithm is run on a particular dataset multiple
times, the clustering of the core points is always the same, and the same points will always be labeled as
noise. However, a boundary point might be neighbor to core samples of more than one cluster. Therefore,
the cluster membership of boundary points depends on the order in which points are vis- ited. Usually
there are only few boundary points, and this slight dependence on the order of points is not important.
Let’s apply DBSCAN on the synthetic dataset we used to demonstrate agglomerative clustering. Like
agglomerative clustering, DBSCAN does not allow predictions on new test data, so we will use the
fit_predict method to perform clustering and return the cluster labels in one step:

As you can see, all data points were assigned the label -1, which stands for noise. This is a consequence
of the default parameter settings for eps and min_samples, which are not tuned for small toy datasets.
The cluster assignments for different values of min_samples and eps are shown below, and visualized:

144
In this plot, points belonging to clusters are represented by solid markers, while noise points are shown
in white. Core samples are indicated by larger markers, and boundary points are represented by smaller
ones. As eps increases (moving from left to right in the figure), the clusters expand to include more
points, but this may also lead to the merging of distinct clusters into one. Conversely,
increasing min_samples (moving from top to bottom) results in fewer points being classified as core
samples, and more points being labeled as noise.

The eps parameter is typically more influential, as it sets the proximity threshold for points to be
considered as part of the same cluster. If eps is too small, no points may qualify as core samples, and all
points could be labeled as noise. On the other hand, if eps is too large, all points may end up in a single
cluster.

145
The min_samples parameter mainly impacts whether points in less dense areas are treated as outliers or
included in their own clusters. Lowering min_samples causes smaller groups to be labeled as noise. For
instance, when min_samples is set to 3, three clusters are formed: one with four points, one with five
points, and one with three points. However, when min_samples is increased to 5, the smaller clusters
(with three and four points) are considered noise, leaving only the cluster with five points.

Although DBSCAN does not require specifying the number of clusters directly, adjusting eps affects the
number of clusters identified. Finding the optimal eps value can be easier when the data is scaled using
methods like StandardScaler or MinMaxScaler, as these techniques ensure that all features are on a
similar scale. The outcome of running DBSCAN on the two_moons dataset is shown here, where the
algorithm successfully identifies the two half-circles and separates them using the given settings.

Comparing and Evaluating Clustering Algorithms

One of the challenges in applying clustering algorithms is that it is very hard to assess how well an
algorithm worked, and to compare outcomes between different algo- rithms. After talking about the
algorithms behind k-means, agglomerative clustering, and DBSCAN, we will now compare them on some
real-world datasets.

Evaluating clustering with ground truth


146
There are metrics that can be used to assess the outcome of a clustering algorithm relative to a ground
truth clustering, the most important ones being the adjusted rand index (ARI) and normalized mutual
information (NMI), which both provide a quanti- tative measure between 0 and 1. Here, we compare the
k-means, agglomerative clustering, and DBSCAN algorithms using ARI. We also include what it looks like
when we randomly assign points to two clusters for comparison:

The adjusted rand index provides intuitive results, with a random cluster assignment having a score of 0
and DBSCAN (which recovers the desired clustering perfectly) having a score of 1.

A common mistake when evaluating clustering in this way is to use accuracy_score instead of
adjusted_rand_score, normalized_mutual_info_score, or some other clustering metric. The problem in
using accuracy is that it requires the assigned clus- ter labels to exactly match the ground truth. However,
the cluster labels themselves are meaningless—the only thing that matters is which points are in the
same cluster:

147
Evaluating clustering without ground truth Although we have just shown one way to evaluate clustering
algorithms, in practice, there is a big problem with using measures like ARI. When applying clustering
algo- rithms, there is usually no ground truth to which to compare the results. If we knew the right
clustering of the data, we could use this information to build a supervised model like a classifier.
Therefore, using metrics like ARI and NMI usually only helps in developing algorithms, not in assessing
success in an application. There are scoring metrics for clustering that don’t require ground truth, like
the sil- houette coefficient. However, these often don’t work well in practice. The silhouette score
computes the compactness of a cluster, where higher is better, with a perfect score of 1. While compact
clusters are good, compactness doesn’t allow for complex shapes. Here is an example comparing the
outcome of k-means, agglomerative clustering, and DBSCAN on the two-moons dataset using the
silhouette score:

148
As observed, k-means achieves the highest silhouette score, even though we may prefer the results
produced by DBSCAN. A more effective approach for evaluating clusters is to use robustness-based
clustering metrics. These metrics involve running an algorithm after adding noise to the data or using
different parameter settings and comparing the results. The idea is that if many algorithm configurations
and variations in the data produce the same outcome, the result is likely reliable. Unfortunately, this
approach is not available in scikit-learn as of this writing.

Even with a very robust clustering or a high silhouette score, we still cannot determine if the clustering
has any semantic meaning or if it reflects an aspect of the data we care about. Returning to the face
image example, we might want to identify groups of similar faces, such as distinguishing between men
and women, young and old, or people with beards versus those without. Suppose we cluster the data
into two groups, and all algorithms agree on which points should be grouped together. However, we still
cannot be certain that the clusters correspond to the concepts we are interested in. The clusters could
have been formed based on factors like side views versus front views, photos taken at night versus during
the day, or pictures captured with different types of phones (iPhones versus Androids). The only way to
determine whether the clustering aligns with our interests is through manual analysis of the clusters.

Comparing algorithms on the faces dataset Let’s apply the k-means, DBSCAN, and agglomerative
clustering algorithms to the Labeled Faces in the Wild dataset, and see if any of them find interesting
structure. We will use the eigenface representation of the data, as produced by PCA(whiten=True), with
100 components:

We saw earlier that this is a more semantic representation of the face images than the raw pixels. It will
also make computation faster. A good exercise would be for you to run the following experiments on the
original data, without PCA, and see if you find similar clusters.

Analyzing the faces dataset with DBSCAN. We will start by applying DBSCAN, which we just discussed:

149
We see that all the returned labels are –1, so all of the data was labeled as “noise” by DBSCAN. There are
two things we can change to help this: we can make eps higher, to expand the neighborhood of each
point, and set min_samples lower, to consider smaller groups of points as clusters. Let’s try changing
min_samples first:

Even when considering groups of three points, everything is labeled as noise. So, we need to increase
eps:

Using a much larger eps of 15, we get only a single cluster and noise points. We can use this result to find
out what the “noise” looks like compared to the rest of the data. To understand better what’s happening,
let’s look at how many points are noise, and how many points are inside the cluster:

Comparing these images to the random sample of face images, we can guess why they were labeled as
noise: the fifth image in the first row shows a per- son drinking from a glass, there are images of people
wearing hats, and in the last image there’s a hand in front of the person’s face. The other images contain

150
odd angles or crops that are too close or too wide. This kind of analysis—trying to find “the odd one
out”—is called outlier detection. If this was a real application, we might try to do a better job of cropping
images, to get more homogeneous data. There is little we can do about people in photos sometimes
wearing hats, drinking, or holding something in front of their faces, but it’s good to know that these are
issues in the data that any algorithm we might apply needs to handle. If we want to find more interesting
clusters than just one large one, we need to set eps smaller, somewhere between 15 and 0.5 (the default).
Let’s have a look at what different values of eps result in:

For low settings of eps, all points are labeled as noise. For eps=7, we get many noise points and many
smaller clusters. For eps=9 we still get many noise points, but we get one big cluster and some smaller
clusters. Starting from eps=11, we get only one large cluster and noise. What is interesting to note is that
there is never more than one large cluster. At most, there is one large cluster containing most of the
points, and some smaller clusters. This indicates that there are not two or three different kinds of face
images in the data that are very distinct, but rather that all images are more or less equally similar to (or
dissimilar from) the rest. The results for eps=7 look most interesting, with many small clusters. We can
investigate this clustering in more detail by visualizing all of the points in each of the 13 small clusters:

151
Some of the clusters correspond to people with very distinct faces (within this dataset), such as Sharon
or Koizumi. Within each cluster, the orientation of the face is also quite fixed, as well as the facial
expression. Some of the clusters contain faces of multiple people, but they share a similar orientation
and expression. This concludes our analysis of the DBSCAN algorithm applied to the faces dataset. As you
can see, we are doing a manual analysis here, different from the much more automatic search approach
we could use for supervised learning based on R 2 score or accuracy. Let’s move on to applying k-means
and agglomerative clustering. Analyzing the faces dataset with k-means. We saw that it was not possible
to create more than one big cluster using DBSCAN. Agglomerative clustering and k-means are much more
likely to create clusters of even size, but we do need to set a target number of clusters. We could set the
number of clusters to the known number of people in the dataset, though it is very unlikely that an

152
unsupervised clustering algorithm will recover them. Instead, we can start with a low number of clusters,
like 10, which might allow us to analyze each of the clusters:

As you can see, k-means clustering partitioned the data into relatively similarly sized clusters from 64 to
386. This is quite different from the result of DBSCAN. We can further analyze the outcome of k-means
by visualizing the cluster centers. As we clustered in the representation produced by PCA, we need to
rotate the cluster centers back into the original space to visualize them, using pca.inverse_transform:

The cluster centers found by k-means are very smooth versions of faces. This is not very surprising, given
that each center is an average of 64 to 386 face images. Working with a reduced PCA representation adds
to the smoothness of the images (compared to the faces reconstructed using 100 PCA dimensions). The
clustering seems to pick up on different orientations of the face, different expressions (the third cluster
center seems to show a smiling face), and the presence of shirt collars (see the second-to-last cluster
center). For a more detailed view, we show for each cluster center the five most typical images in the
cluster (the images assigned to the cluster that are closest to the cluster center) and the five most atypical
images in the cluster (the images assigned to the cluster that are furthest from the cluster center):

153
154
Analyzing the faces dataset with agglomerative clustering. Now, let’s look at the results of agglomerative
clustering:

Agglomerative clustering also produces relatively equally sized clusters, with cluster sizes between 26
and 623. These are more uneven than those produced by k-means, but much more even than the ones
produced by DBSCAN. We can compute the ARI to measure whether the two partitions of the data given
by agglomerative clustering and k-means are similar:

155
An ARI of only 0.13 means that the two clusterings labels_agg and labels_km have little in common. This
is not very surprising, given the fact that points further away from the cluster centers seem to have little
in common for k-means. Next, we might want to plot the dendrogram. We’ll limit the depth of the tree
in the plot, as branching down to the individual 2,063 data points would result in an unreadably dense
plot:

Creating 10 clusters, we cut across the tree at the very top, where there are 10 vertical lines. In the
dendrogram for the toy data, you could see by the length of the branches that two or three clusters might
capture the data appropriately. For the faces data, there doesn’t seem to be a very natural cutoff point.
There are some branches that represent more distinct groups, but there doesn’t appear to be a particular
number of clusters that is a good fit. This is not surprising, given the results of DBSCAN, which tried to
cluster all points together. Let’s visualize the 10 clusters, as we did for k-means earlier. Note that there is
no notion of cluster center in agglomerative clustering (though we could compute the mean), and we
simply show the first couple of points in each cluster. We show the number of points in each cluster to
the left of the first image:

156
While some of the clusters seem to have a semantic theme, many of them are too large to be actually
homogeneous. To get more homogeneous clusters, we can run the algorithm again, this time with 40
clusters, and pick out some of the clusters that are particularly interesting:

157
Summary of Clustering Methods

This section highlighted the fact that clustering is a largely qualitative process, often most useful during
the exploratory phase of data analysis. We explored three clustering algorithms: k-means, DBSCAN, and
agglomerative clustering. Each method offers a way to control the level of granularity in clustering. While
k-means and agglomerative clustering allow you to specify the number of clusters, DBSCAN lets you
define proximity using the eps parameter, which indirectly affects the cluster size. All three algorithms
are capable of handling large, real-world datasets, are relatively easy to understand, and support
clustering into multiple groups.

Each algorithm has its own unique strengths. K-means provides a way to characterize clusters by their
centers and can be seen as a decomposition method, representing each data point by its cluster's center.
DBSCAN has the advantage of detecting "noise points" that do not belong to any cluster and can

158
automatically determine the number of clusters. Unlike the other two methods, DBSCAN can identify
clusters with complex shapes, as demonstrated in the two_moons example. However, DBSCAN can
sometimes create clusters of vastly different sizes, which may be either a benefit or a drawback.
Agglomerative clustering, on the other hand, offers a hierarchical view of possible data partitions, which
can be easily examined using dendrograms.

Summary and Outlook

This chapter introduced various unsupervised learning algorithms useful for exploratory data analysis
and preprocessing. Having the right data representation is often key to the success of both supervised
and unsupervised learning, with preprocessing and decomposition methods playing a critical role in data
preparation.

Decomposition, manifold learning, and clustering are essential tools for understanding data, especially
when supervision information is absent. Even in supervised settings, exploratory methods are valuable
for gaining insights into the data's properties. While it can be difficult to quantify the usefulness of
unsupervised algorithms, their application can reveal valuable insights from your data. With these tools,
you are now equipped with the fundamental algorithms that machine learning practitioners use
regularly.

We encourage you to experiment with clustering and decomposition methods on both two-dimensional
toy datasets and real-world datasets available in scikit-learn, such as the digits, iris, and cancer datasets.

Here are some exercises for Chapter 3: Unsupervised Learning & Preprocessing.

Exercises:

1. Exploring Clustering Algorithms (K-Means)

• Task: Apply the K-Means clustering algorithm on a dataset like the Iris or Digits dataset from
scikit-learn.

• Steps:

1. Load the dataset and visualize the data (e.g., using PCA for dimensionality reduction to 2D
or 3D).

159
2. Use K-Means clustering to group the data into clusters.

3. Evaluate the performance by comparing the true labels with the predicted clusters (if
available), or use metrics such as silhouette score or Davies-Bouldin index.

4. Vary the number of clusters (k) and observe how the clustering performance changes.

2. Hierarchical Clustering

• Task: Implement hierarchical clustering using the Iris dataset.

• Steps:

1. Apply Agglomerative Clustering to group the data.

2. Visualize the hierarchical structure using a dendrogram.

3. Determine the optimal number of clusters by cutting the dendrogram at a certain


threshold.

4. Compare the results to K-Means clustering, and discuss the advantages and disadvantages
of both methods.

3. DBSCAN for Clustering

• Task: Apply the DBSCAN algorithm to a dataset with noise


(e.g., make_blobs or two_moons dataset).

• Steps:

1. Generate a dataset with clusters and noise.

2. Apply DBSCAN and experiment with different values of eps (the maximum distance
between points to be considered as neighbors) and min_samples (the minimum number
of points to form a cluster).

3. Visualize the results and identify the number of clusters and noise points.

160
4. Compare DBSCAN's results with K-Means, noting how DBSCAN handles noise and
irregular-shaped clusters.

4. Dimensionality Reduction with PCA

• Task: Use Principal Component Analysis (PCA) to reduce the dimensionality of a dataset
(e.g., Wine or Digits dataset).

• Steps:

1. Apply PCA to reduce the dataset to two or three components.

2. Visualize the data in the reduced space (2D or 3D plot).

3. Discuss how much variance is captured by the first few principal components
(use explained_variance_ratio_).

4. Compare the performance of K-Means clustering on the reduced dataset versus the
original high-dimensional dataset.

5. Data Preprocessing: Scaling and Normalization

• Task: Preprocess a dataset by scaling and normalizing features before applying an unsupervised
learning algorithm.

• Steps:

1. Load a dataset (e.g., Breast Cancer, Diabetes dataset, or California Housing).

2. Apply scaling (e.g., StandardScaler or MinMaxScaler).

3. Apply normalization (e.g., using Normalizer or RobustScaler).

4. Perform clustering (e.g., K-Means) on the original dataset and the preprocessed dataset.

5. Evaluate and compare the clustering results. Discuss the impact of scaling and
normalization on clustering performance.

6. Using t-SNE for Visualization

161
• Task: Apply t-SNE (t-distributed Stochastic Neighbor Embedding) to visualize high-dimensional
data in two or three dimensions.

• Steps:

1. Load a high-dimensional dataset (e.g., Digits or Iris dataset).

2. Apply t-SNE to reduce the dimensionality to two dimensions.

3. Visualize the result and color the points according to their true labels.

4. Discuss how t-SNE preserves the local structure of the data and helps visualize complex
high-dimensional relationships.

7. Clustering with Feature Engineering

• Task: Apply feature engineering techniques to improve clustering results.

• Steps:

1. Load the Iris dataset or any other dataset.

2. Create new features (e.g., interaction terms, polynomial features, or domain-specific


features).

3. Apply a clustering algorithm (e.g., K-Means) to the newly engineered features.

4. Compare the results with clustering on the original features and discuss how feature
engineering impacted the clustering.

8. Anomaly Detection Using Isolation Forest

• Task: Use the Isolation Forest algorithm for anomaly detection.

• Steps:

1. Generate a synthetic dataset with outliers (use make_blobs with added noise or another
anomaly-detection dataset).

2. Apply Isolation Forest to detect outliers.

162
3. Visualize the results and evaluate the performance by checking how well the algorithm
identifies the anomalies.

4. Discuss the advantages and disadvantages of Isolation Forest compared to other anomaly
detection techniques like One-Class SVM or DBSCAN.

9. K-Means Clustering with Elbow Method

• Task: Use the Elbow method to determine the optimal number of clusters for K-Means.

• Steps:

1. Choose a dataset (e.g., Iris or Wine).

2. Apply K-Means clustering for a range of k values (e.g., from 1 to 10).

3. Plot the sum of squared distances (inertia) versus k and identify the "elbow" point.

4. Discuss the results and explain how the elbow method helps in determining the number
of clusters.

10. Gaussian Mixture Model (GMM) for Clustering

• Task: Apply a Gaussian Mixture Model (GMM) to a dataset and compare it with K-Means.

• Steps:

1. Use a dataset with known clusters (e.g., Iris or Digits).

2. Apply GMM to perform soft clustering.

3. Compare the results of GMM with K-Means clustering (e.g., comparing cluster
assignments, silhouette score).

4. Discuss how GMM differs from K-Means in terms of flexibility and the assumptions it
makes about the data.

Conceptual Questions:

1. Clustering Algorithms

163
• What is the difference between K-Means clustering and DBSCAN? When would you use
one over the other?

• Explain how hierarchical clustering works. What are the advantages and disadvantages of
using this method over K-Means or DBSCAN?

• How does DBSCAN handle noise and outliers differently from K-Means?

2. Dimensionality Reduction

• Why is dimensionality reduction important in unsupervised learning? How does PCA help
in improving the performance of machine learning models?

• What are the limitations of PCA, and how does it compare with t-SNE for visualizing high-
dimensional data?

3. Feature Engineering

• What is feature scaling, and why is it important when applying clustering algorithms?

• Discuss the role of feature selection and dimensionality reduction techniques in


unsupervised learning.

4. Anomaly Detection

• What is anomaly detection, and how can algorithms like Isolation Forest or One-Class SVM
be used for this purpose?

• How does anomaly detection relate to unsupervised learning, and what types of
applications would benefit from these methods?

5. Evaluation Metrics for Unsupervised Learning

• How do you evaluate clustering algorithms like K-Means or DBSCAN when you don't have
ground truth labels?

• Explain the concepts of silhouette score, Davies-Bouldin index, and inertia, and how they
are used to evaluate clustering results.

164
These exercises and questions will help you practice and deepen your understanding of unsupervised
learning techniques, data preprocessing, and clustering algorithms.

165
6 Chapter Six: Deep Learning

LEARNING OUTCOMES
After reading this Section of the guide, the learner should be able to:

• Explain the structure and functioning of artificial neural networks, including


neurons, layers, and activation functions.
• Implement basic neural networks, such as perceptrons, using Python.
• Describe the concept of deep neural networks and their applications in solving
complex problems like image recognition and handwriting detection.
• Explain the purpose of activation functions in neural networks.
• Explain the backpropagation algorithm and its role in training neural networks.

6.1 Neural Networks


An artificial neural network (or neural network for short) is a predictive model motivated by the way the
brain operates. Think of the brain as a collection of neurons wired together. Each neuron looks at the
outputs of the other neurons that feed into it, does a calculation, and then either fires (if the calculation
exceeds some threshold) or does not (if it does not). Accordingly, artificial neural networks consist of
artificial neurons, which perform similar calculations over their inputs. Neural networks can solve a wide
variety of problems like handwriting recognition and face detection, and they are used heavily in deep
learning, one of the trendiest subfields of data science. However, most neural networks are “black
boxes”—inspecting their details doesn’t give you much understanding of how they are solving a problem.
And large neural networks can be difficult to train. For most problems you will encounter as a budding
data scientist, they’re probably not the right choice. Someday, when you are trying to build an artificial
intelligence to bring about the Singularity, they very well might be.

166
6.2 Perceptrons
• The simplest neural network is the perceptron, which approximates a single neuron with n binary
inputs.
• It computes a weighted sum of its inputs and “fires” if that weighted sum is 0 or greater:

from scratch.linear_algebra import Vector, dot

def step_function(x: float) -> float:

return 1.0 if x >= 0 else 0.0

def perceptron_output(weights: Vector, bias: float, x: Vector) -> float:

"""Returns 1 if the perceptron 'fires', 0 if not"""


calculation = dot(weights, x) + bias
return step_function(calculation)
The perceptron is simply distinguishing between the half-spaces separated by the hyperplane of
points x for which:

dot(weights, x) + bias == 0

With properly chosen weights, perceptrons can solve several simple problems. For example, we can
create an AND gate (which returns 1 if both its inputs are 1 but returns 0 if one of its inputs is 0) with:

and_weights = [2., 2]
and_bias = -3.

assert perceptron_output(and_weights, and_bias, [1, 1]) == 1


assert perceptron_output(and_weights, and_bias, [0, 1]) == 0
assert perceptron_output(and_weights, and_bias, [1, 0]) == 0
assert perceptron_output(and_weights, and_bias, [0, 0]) == 0

167
If both inputs are 1, the calculation equals 2 + 2 – 3 = 1, and the output is 1. If only one of the inputs
is 1, the calculation equals 2 + 0 – 3 = –1, and the output is 0. And if both of the inputs are 0, the
calculation equals –3, and the output is 0.

Using similar reasoning, we could build an OR gate with:

or_weights = [2., 2]
or_bias = -1.

assert perceptron_output(or_weights, or_bias, [1, 1]) == 1


assert perceptron_output(or_weights, or_bias, [0, 1]) == 1
assert perceptron_output(or_weights, or_bias, [1, 0]) == 1
assert perceptron_output(or_weights, or_bias, [0, 0]) == 0

We could also build a NOT gate (which has one input and converts 1 to 0 and 0 to 1) with:

not_weights = [-2.]
not_bias = 1.

assert perceptron_output(not_weights, not_bias, [0]) == 1


assert perceptron_output(not_weights, not_bias, [1]) == 0

However, there are some problems that simply can’t be solved by a single perceptron. For example,
no matter how hard you try, you cannot use a perceptron to build an XOR gate that outputs 1 if
exactly one of its inputs is 1 and 0 otherwise. This is where we start needing more complicated neural
networks. Of course, you don’t need to approximate a neuron to build a logic gate:

and_gate = min
or_gate = max
xor_gate = lambda x, y: 0 if x == y else 1

168
6.3 Feed-Forward Neural Networks

The topology of the brain is enormously complicated, so it’s common to approximate it with
an idealized feed-forward neural network that consists of discrete layers of neurons, each
connected to the next. This typically entails an input layer (which receives inputs and feeds
them forward unchanged), one or more “hidden layers” (each of which consists of neurons
that take the outputs of the previous layer, performs some calculation, and passes the result
to the next layer), and an output layer (which produces the final outputs).

Just like in the perceptron, each (noninput) neuron has a weight corresponding to each of its
inputs and a bias. To make our representation simpler, we’ll add the bias to the end of our
weights vector and give each neuron a bias input that always equals 1.

As with the perceptron, for each neuron we’ll sum up the products of its inputs and its
weights. But here, rather than outputting the step_function applied to that product, we’ll
output a smooth approximation of it. Here we’ll use the sigmoid function (Figure 4.1):

import math

def sigmoid(t: float) -> float:


return 1 / (1 + [Link](-t))

169
Figure 4.1. The sigmoid function

• Why we use the sigmoid instead of the simpler step_function?


• In order to train a neural network, we need to use calculus, and in order to use calculus,
we need smooth functions, step_function isn’t even continuous, and sigmoid is a good
smooth approxi- mation of it.

We then calculate the output as:

def neuron_output(weights: Vector, inputs: Vector) -> float:


# weights includes the bias term, inputs includes a 1
return sigmoid(dot(weights, inputs))

Given this function, we can represent a neuron simply as a vector of weights whose length is
one more than the number of inputs to that neuron (because of the bias weight). Then we
can represent a neural network as a list of (noninput) layers, where each layer is just a list of
the neurons in that layer.

That is, we will represent a neural network as a list (layers) of lists (neurons) of
vectors (weights).

170
Given such a representation, using the neural network is quite simple:

from typing import List

def feed_forward(neural_network: List[List[Vector]],

input_vector: Vector) -> List[Vector]:

"""

Feeds the input vector through the neural network.

Returns the outputs of all layers (not just the last one).
"""

outputs: List[Vector] = []

for layer in neural_network:

input_with_bias = input_vector + [1] # Add a constant.

output = [neuron_output(neuron, input_with_bias) # Compute the output

for neuron in layer] # for each neuron.

[Link](output) # Add to results.

# Then the input to the next layer is the output of this one

input_vector = output

return outputs

Now it’s easy to build the XOR gate that we couldn’t build with a single perceptron.
We just need to scale the weights up so that the neuron_outputs are either really close
to 0 or really close to 1:

xor_network = [# hidden layer

171
[[20., 20, -30], # 'and' neuron
[20., 20, -10]], # 'or' neuron #
output layer
[[-60., 60, -30]]] # '2nd input but not 1st input' neuron

# feed_forward returns the outputs of all layers, so the [-1] gets


the # final output, and the [0] gets the value out of the resulting
vector assert 0.000 < feed_forward(xor_network, [0, 0])[-1][0] <
0.001
assert 0.999 < feed_forward(xor_network, [1, 0])[-1][0] < 1.000
assert 0.999 < feed_forward(xor_network, [0, 1])[-1][0] < 1.000
assert 0.000 < feed_forward(xor_network, [1, 1])[-1][0] < 0.001

For a given input (which is a two-dimensional vector), the hidden layer produces
a two-dimensional vector consisting of the “and” of the two input values and the
“or” of the two input values.

And the output layer takes a two-dimensional vector and computes “second
element but not first element.” The result is a network that performs “or, but not
and,” which is precisely XOR (Figure 4.2).

172
Figure 4.2. A neural network for XOR

• The hidden layer is computing features of the input data (in this case “and” and “or”)
and the output layer is combining those features in a way that generates the desired
output.

6.4 Backpropagation
Usually, we don’t build neural networks by hand. This is in part because we use them to solve much
bigger problems—an image recognition problem might involve hundreds or thousands of neurons. And
it’s in part because we usually won’t be able to “reason out” what the neurons should be.

Instead (as usual) we use data to train neural networks. The typical approach is an algorithm called
backpropagation, which uses gradient descent or one of its variants.

Imagine we have a training set that consists of input vectors and corresponding target output vectors.
For example, in our previous xor_network example, the input vector [1, 0] corresponded to the target
output [1]. Imagine that our network has some set of weights. We then adjust the weights using the
following algorithm:

1. Run feed_forward on an input vector to produce the outputs of all the neurons in the

network.

2. We know the target output, so we can compute a loss that’s the sum of the squared errors.

3. Compute the gradient of this loss as a function of the output neuron’s weights.

4. “Propagate” the gradients and errors backward to compute the gradients with respect to

the hidden neurons’ weights.

5. Take a gradient descent step.

Typically, we run this algorithm many times for our entire training set until the network converges.

To start with, let’s write the function to compute the gradients:

def sqerror_gradients(network: List[List[Vector]],

173
input_vector: Vector,
target_vector: Vector) -> List[List[Vector]]:
"""
Given a neural network, an input vector, and a target vector,
make a prediction and compute the gradient of the squared
error loss with respect to the neuron weights.
"""
# forward pass
hidden_outputs, outputs = feed_forward(network, input_vector)

# gradients with respect to output neuron pre-activation outputs


output_deltas = [output * (1 - output) * (output - target)
for output, target in zip(outputs, target_vector)]

# gradients with respect to output neuron weights


output_grads = [[output_deltas[i] * hidden_output
for hidden_output in hidden_outputs + [1]]

for i, output_neuron in enumerate(network[-1])]

# gradients with respect to hidden neuron pre-activation outputs


hidden_deltas = [hidden_output * (1 - hidden_output) *
dot(output_deltas, [n[i] for n in network[-1]])
for i, hidden_output in enumerate(hidden_outputs)]

# gradients with respect to hidden neuron weights


hidden_grads = [[hidden_deltas[i] * input for input in input_vector + [1]]
for i, hidden_neuron in enumerate(network[0])]

return [hidden_grads, output_grads]

The math behind the preceding calculations is not terribly difficult, but it involves some tedious calculus
and careful attention to detail.
174
Armed with the ability to compute gradients, we can now train neural networks. Let’s try to learn the
XOR network we previously designed by hand. We’ll start by generating the training data and initializing
our neural network with random weights:

import random
[Link](0)

# training data
xs = [[0., 0], [0., 1], [1., 0], [1., 1]]
ys = [[0.], [1.], [1.], [0.]]

# start with random weights


network = [ # hidden layer: 2 inputs -> 2 outputs
[[[Link]() for _ in range(2 + 1)], # 1st hidden neuron
[[Link]() for _ in range(2 + 1)]], # 2nd hidden
neuron # output layer: 2 inputs -> 1 output
[[[Link]() for _ in range(2 + 1)]] # 1st output neuron
]

As usual, we can train it using gradient descent. One difference from our previous
examples is that here we have several parameter vectors, each with its own
gradient, which means we’ll have to call gradient_step for each of them.

from scratch.gradient_descent import gradient_step


import tqdm

learning_rate = 1.0

for epoch in [Link](20000, desc="neural net for xor"):


for x, y in zip(xs, ys):
gradients = sqerror_gradients(network, x, y)

# Take a gradient step for each neuron in each layer

network = [[gradient_step(neuron, grad, -learning_rate)


175
for neuron, grad in zip(layer, layer_grad)]
for layer, layer_grad in zip(network, gradients)]

# check that it learned XOR


assert feed_forward(network, [0, 0])[-1][0] < 0.01
assert feed_forward(network, [0, 1])[-1][0] > 0.99
assert feed_forward(network, [1, 0])[-1][0] > 0.99
assert feed_forward(network, [1, 1])[-1][0] < 0.01

For me the resulting network has weights that look like:

[ # hidden layer
[[7, 7, -3], # computes OR
[5, 5, -8]], # computes AND #
output layer
[[11, -12, -5]] # computes "first but not second"
]

which is conceptually pretty like our previous bespoke network.

6.5 Tensors
Deep neural networks (DNNs) incorporate multiple hidden layers for solving complex tasks. Key
elements include:

Layer types: Input, hidden, and output layers.


Applications: Image and speech recognition.

# A Tensor is either a float, or a List of Tensors

Tensor = Union[float, List[Tensor]]

176
However, Python won’t let you define recursive types like that. And even if it did that definition is still
not right, as it allows for bad “tensors” like:

Tensor = list

[[1.0, 2.0],
[3.0]]

whose rows have different sizes, which makes it not an n- dimensional array. And we’ll write a helper
function to find a tensor’s shape:

from typing import List

def shape(tensor: Tensor) -> List[int]: sizes:


List[int] = []
while isinstance(tensor, list):
[Link](len(tensor)) tensor
= tensor[0]
return sizes

assert shape([1, 2, 3]) == [3]


assert shape([[1, 2], [3, 4], [5, 6]]) == [3, 2]

Because tensors can have any number of dimensions, we’ll typically need to work with them
recursively. We’ll do one thing in the one-dimensional case and recurse in the higher-
dimensional case:

def is_1d(tensor: Tensor) -> bool:


"""
If tensor[0] is a list, it's a higher-order tensor.
Otherwise, tensor is 1-dimensional (that is, a
vector). """

177
return not isinstance(tensor[0], list)

assert is_1d([1, 2, 3])


assert not is_1d([[1, 2], [3, 4]])

which we can use to write a recursive tensor_sumfunction:

def tensor_sum(tensor: Tensor) -> float:


"""Sums up all the values in the tensor"""

if is_1d(tensor):
return sum(tensor) # just a list of floats, use Python sum
else:
return sum(tensor_sum(tensor_i) # Call tensor_sum on each row
for tensor_i in tensor) # and sum up those results.

assert tensor_sum([1, 2, 3]) == 6


assert tensor_sum([[1, 2], [3, 4]]) == 10

If you’re not used to thinking recursively, you should ponder this until it makes sense, because
we’ll use the same logic throughout this chapter. However, we’ll create a couple of helper
functions so that we don’t have to rewrite this logic everywhere. The first applies a function
elementwise to a single tensor:

from typing import Callable

def tensor_apply(f: Callable[[float], float], tensor: Tensor) -> Tensor:


"""Applies f elementwise"""
if is_1d(tensor):
return [f(x) for x in tensor]
else:
return [tensor_apply(f, tensor_i) for tensor_i in tensor]

178
assert tensor_apply(lambda x: x + 1, [1, 2, 3]) == [2, 3, 4]
assert tensor_apply(lambda x: 2 * x, [[1, 2], [3, 4]]) == [[2, 4], [6, 8]]

We can use this to write a function that creates a zero tensor with the same shape as a given
tensor:

def zeros_like(tensor: Tensor) -> Tensor:


return tensor_apply(lambda _: 0.0, tensor)

assert zeros_like([1, 2, 3]) == [0, 0, 0]


assert zeros_like([[1, 2], [3, 4]]) == [[0, 0], [0, 0]]

We’ll also need to apply a function to corresponding elements from two tensors (which had
better be the exact same shape, although we won’t check that):

def tensor_combine(f: Callable[[float, float], float],


t1: Tensor,
t2: Tensor) -> Tensor:
"""Applies f to corresponding elements of t1 and t2"""
if is_1d(t1):
return [f(x, y) for x, y in zip(t1, t2)]
else:
return [tensor_combine(f, t1_i, t2_i)
for t1_i, t2_i in zip(t1, t2)]

import operator
assert tensor_combine([Link], [1, 2, 3], [4, 5, 6]) == [5, 7, 9]
assert tensor_combine([Link], [1, 2, 3], [4, 5, 6]) == [4, 10, 18]

6.6 Neural Networks as a Sequence of Layers

We’d like to think of neural networks as sequences of layers, so let’s come up with a way to
combine multiple layers into one. The resulting neural network is itself a layer, and it
implements the Layer methods in the obvious ways:

179
from typing import List

class Sequential(Layer):
"""
A layer consisting of a sequence of other layers.
It's up to you to make sure that the output of each
layer makes sense as the input to the next layer.
"""
def __init__(self, layers: List[Layer]) -> None:
[Link] = layers

def forward(self, input):


"""Just forward the input through the layers in order."""
for layer in [Link]:
input = [Link](input)

return input

def backward(self, gradient):


"""Just backpropagate the gradient through the layers in reverse."""
for layer in reversed([Link]): gradient
= [Link](gradient)
return gradient

def params(self) -> Iterable[Tensor]:


"""Just return the params from each layer."""
return (param for layer in [Link] for param in [Link]())

def grads(self) -> Iterable[Tensor]:


"""Just return the grads from each layer."""
return (grad for layer in [Link] for grad in [Link]())

So we could represent the neural network we used for XOR as:

180
xor_net = Sequential([
Linear(input_dim=2, output_dim=2),
Sigmoid(),
Linear(input_dim=2, output_dim=1),
Sigmoid()
])

6.7 Loss and Optimization

Here we’ll want to experiment with different loss functions, so (as usual) we’ll introduce a new
Loss abstraction that encapsulates both the loss computation and the gradient computation:

class Loss:
def loss(self, predicted: Tensor, actual: Tensor) -> float:
"""How good are our predictions? (Larger numbers are worse.)"""
raise NotImplementedError

def gradient(self, predicted: Tensor, actual: Tensor) -> Tensor:


"""How does the loss change as the predictions
change?""" raise NotImplementedError

We’ve already worked many times with the loss that’s the sum of the squared errors, so we
should have an easy time implementing that. The only trick is that we’ll need to use
tensor_combine:

class SSE(Loss):
"""Loss function that computes the sum of the squared errors."""
def loss(self, predicted: Tensor, actual: Tensor) -> float: #
Compute the tensor of squared differences
squared_errors = tensor_combine(

lambda predicted, actual: (predicted - actual) ** 2, predicted,


actual)

181
# And just add them up
return tensor_sum(squared_errors)

def gradient(self, predicted: Tensor, actual: Tensor) -> Tensor:


return tensor_combine(
lambda predicted, actual: 2 * (predicted - actual),
predicted,
actual)
The last piece to figure out is gradient descent. This can be computed as:

theta = gradient_step(theta, grad, -learning_rate)

Here that won’t quite work for us, for a couple reasons. The first is that our neural nets will
have many parameters, and we’ll need to update all of them. The second is that we’d like to
be able to use more clever variants of gradient descent, and we don’t want to have to rewrite
them each time. Accordingly, we’ll introduce a (you guessed it) Optimizerabstraction, of which
gradient descent will be a specific instance:

class Optimizer:
"""
An optimizer updates the weights of a layer (in place) using
information known by either the layer or the optimizer (or by both).
"""
def step(self, layer: Layer) -> None:
raise NotImplementedError

After that it’s easy to implement gradient descent, again using tensor_combine:

class GradientDescent(Optimizer):
def __init__(self, learning_rate: float = 0.1) -> None: [Link]
= learning_rate

def step(self, layer: Layer) -> None:


for param, grad in zip([Link](), [Link]()):

182
# Update param using a gradient step
param[:] = tensor_combine(
lambda param, grad: param - grad * [Link],
param,
grad)

The only thing that’s maybe surprising is the “slice assignment,” which is a reflection of the
fact that reassigning a list doesn’t change its original value. That is, if you just did param =
tensor_combine(. . .), you would be redefining the local variable param, but you would not
be affecting the original parameter tensor stored in the layer. If you assign to the slice [:],
however, it actually changes the values inside the list.

Here’s a simple example to demonstrate:

tensor = [[1, 2], [3, 4]]

for row in tensor: row


= [0, 0]
assert tensor == [[1, 2], [3, 4]], "assignment doesn't update a list"

for row in tensor: row[:]


= [0, 0]
assert tensor == [[0, 0], [0, 0]], "but slice assignment does"

If you are somewhat inexperienced in Python, this behavior may be surprising, so meditate
on it and try examples yourself until it makes sense. To demonstrate the value of this
abstraction, let’s implement another optimizer that uses momentum. The idea is that we
don’t want to overreact to each new gradient, and so we maintain a running average of the
gradients we’ve seen, updating it with each new gradient and taking a step in the direction
of the average:

class Momentum(Optimizer): def


__init__(self,
learning_rate: float,

183
momentum: float = 0.9) -> None: [Link]
= learning_rate
[Link] = momentum
[Link]: List[Tensor] = [] # running average

def step(self, layer: Layer) -> None:


# If we have no previous updates, start with all zeros
if not [Link]:
[Link] = [zeros_like(grad) for grad in [Link]()]

for update, param, grad in zip([Link],


[Link](),
[Link]()):
# Apply momentum
update[:] = tensor_combine(
lambda u, g: [Link] * u + (1 - [Link]) * g, update,
grad)

# Then take a gradient step


param[:] = tensor_combine(
lambda p, u: p - [Link] * u, param,
update)

Because we used an Optimizer abstraction, we can easily switch between our different optimizers.

Example: XOR Revisited

Let’s see how easy it is to use our new framework to train a network that can compute XOR.
We start by re-creating the training data:

# training data
xs = [[0., 0], [0., 1], [1., 0], [1., 1]]
ys = [[0.], [1.], [1.], [0.]]

184
and then we define the network, although now we can leave off the last sigmoid layer:

[Link](0)

net = Sequential([
Linear(input_dim=2, output_dim=2),
Sigmoid(),
Linear(input_dim=2, output_dim=1)
])

We can now write a simple training loop, except that now we can use the abstractions of
Optimizer and Loss. This allows us to easily try different ones:

import tqdm

optimizer = GradientDescent(learning_rate=0.1)
loss = SSE()

with [Link](3000) as t:
for epoch in t:
epoch_loss = 0.0

for x, y in zip(xs, ys): predicted =


[Link](x)
epoch_loss += [Link](predicted, y)
gradient = [Link](predicted, y)
[Link](gradient)

[Link](net)

t.set_description(f"xor loss {epoch_loss:.3f}")

This should train quickly, and you should see the loss go down. And now we can inspect
the weights:

for param in [Link]():

185
print(param)

For the network I find roughly:

hidden1 = -2.6 * x1 + -2.7 * x2 + 0.2 # NOR


hidden2 = 2.1 * x1 + 2.1 * x2 - 3.4 # AND
output = -3.1 * h1 + -2.6 * h2 + 1.8 # NOR

So hidden1 activates if neither input is 1. hidden2 activates if both inputs are 1. And output
activates if neither hidden output is 1—that is, if it’s not the case that neither input is 1 and
it’s also not the case that both inputs are 1. Indeed, this is exactly the logic of XOR.

6.8 Activation Functions

The sigmoid function has fallen out of favor for a couple of reasons. One reason is that
sigmoid(0) equals 1/2, which means that a neuron whose inputs sum to 0 has a positive
output. Another is that its gradient is very close to 0 for very large and very small inputs,
which means that its gradients can get “saturated”, and its weights can get stuck.

One popular replacement is tanh (“hyperbolic tangent”), which is a different sigmoid-shaped


function that ranges from –1 to 1 and outputs 0 if its input is 0. The derivative of tanh(x) is
just 1 - tanh(x) ** 2, which makes the layer easy to write:

import math

def tanh(x: float) -> float:


# If x is very large or very small, tanh is (essentially) 1 or
-1. # We check for this because, e.g., [Link](1000) raises
an error. if x < -100: return -1
elif x > 100: return 1

em2x = [Link](-2 * x)
return (1 - em2x) / (1 + em2x)
186
class Tanh(Layer):
def forward(self, input: Tensor) -> Tensor:
# Save tanh output to use in backward pass.
[Link] = tensor_apply(tanh, input) return
[Link]

def backward(self, gradient: Tensor) -> Tensor:


return tensor_combine(
lambda tanh, grad: (1 - tanh ** 2) * grad,
[Link],
gradient)
In larger networks another popular replacement is Relu, which is 0 for negative inputs and the identity
for positive inputs:

class Relu(Layer):
def forward(self, input: Tensor) -> Tensor:
[Link] = input
return tensor_apply(lambda x: max(x, 0), input)

def backward(self, gradient: Tensor) -> Tensor:


return tensor_combine(lambda x, grad: grad if x > 0 else 0,
[Link], gradient)

6.9 Softmaxes and Cross-Entropy

Neural networks can output a vector that was entirely 0s, or it could output a vector that was
entirely 1s. Yet when we’re doing classification problems, we’d like to output a 1 for the
correct class and a 0 for all the incorrect classes. Generally, our predictions will not be so
perfect, but we’d at least like to predict an actual probability distribution over the classes. For
example, if we have two classes, and our model outputs [0, 0], it’s hard to make much sense
of that. It doesn’t think the output belongs in either class.

187
But if our model outputs [0.4, 0.6], we can interpret it as a prediction that there’s a probability
of 0.4 that our input belongs to the first class and 0.6 that our input belongs to the second
class. In order to accomplish this, we typically forgo the final Sigmoid layer and instead use
the softmax function, which converts a vector of real numbers to a vector of probabilities.
We compute exp(x) for each number in the vector, which results in a vector of positive
numbers. After that, we just divide each of those positive numbers by the sum, which gives
us a bunch of positive numbers that add up to 1—that is, a vector of probabilities. If we ever
end up trying to compute, say, exp(1000) we will get a Python error, so before taking the exp
we subtract off the largest value. This turns out to result in the same probabilities; it’s just
safer to compute in Python:

def softmax(tensor: Tensor) -> Tensor:


"""Softmax along the last
dimension""" if is_1d(tensor):
# Subtract largest value for numerical stability.
largest = max(tensor)
exps = [[Link](x - largest) for x in tensor]

sum_of_exps = sum(exps) # This is the total "weight."


return [exp_i / sum_of_exps # Probability is the fraction
for exp_i in exps] # of the total weight.
else:
return [softmax(tensor_i) for tensor_i in tensor]

Once our network produces probabilities, we often use a different loss function called
cross-entropy (or sometimes “negative log likelihood”). If our network outputs are probabilities, the cross-
entropy loss represents the negative log likelihood of the observed data, which means that minimizing
that loss is the same as maximizing the log likelihood (and hence the likelihood) of the training data.
Typically, we won’t include the softmax function as part of the neural network itself. This is because it

188
turns out that if softmax is part of your loss function but not part of the network itself, the gradients of
the loss with respect to the network outputs are very easy to compute.

class SoftmaxCrossEntropy(Loss):
"""
This is the negative-log-likelihood of the observed values, given the
neural net model. So if we choose weights to minimize it, our model
will be maximizing the likelihood of the observed data.
"""
def loss(self, predicted: Tensor, actual: Tensor) -> float:
# Apply softmax to get probabilities
probabilities = softmax(predicted)

# This will be log p_i for the actual class i and 0 for the
other # classes. We add a tiny amount to p to avoid taking
log(0).
likelihoods = tensor_combine(lambda p, act: [Link](p + 1e-30) * act,
probabilities,
actual)

# And then we just sum up the negatives.


return -tensor_sum(likelihoods)

def gradient(self, predicted: Tensor, actual: Tensor) -> Tensor:

probabilities = softmax(predicted)

# Isn't this a pleasant equation?


return tensor_combine(lambda p, actual: p - actual,
probabilities, actual)

6.10 Dropout

Like most machine learning models, neural networks are prone to overfitting to their training

189
data. Regularization can be utilized to penalize large weights and that helped prevent
overfitting. A common way of regularizing neural networks is using dropout. At training time,
we randomly turn off each neuron (that is, replace its output with 0) with some fixed
probability. This means that the network can’t learn to depend on any individual neuron,
which seems to help with overfitting.

At evaluation time, we don’t want to dropout any neurons, so a Dropout layer will need to
know whether it’s training or not. In addition, at training time a Dropout layer only passes on
some random fraction of its input. To make its output comparable during evaluation, we’ll
scale down the outputs (uniformly) using that same fraction:

class Dropout(Layer):
def __init__(self, p: float) -> None: self.p
=p
[Link] = True

def forward(self, input: Tensor) -> Tensor:


if [Link]:
# Create a mask of 0s and 1s shaped like the
input # using the specified probability.
[Link] = tensor_apply(
lambda _: 0 if [Link]() < self.p else 1,
input)
# Multiply by the mask to dropout inputs.
return tensor_combine([Link], input, [Link])
else:
# During evaluation just scale down the outputs uniformly.
return tensor_apply(lambda x: x * (1 - self.p), input)

def backward(self, gradient: Tensor) -> Tensor:


if [Link]:
# Only propagate the gradients where mask == 1.

190
return tensor_combine([Link], gradient, [Link])
else:
raise RuntimeError("don't call backward when not in train mode")

We’ll use this to help prevent our deep learning models from overfitting.

Example: MNIST
MNIST is a dataset of handwritten digits that everyone uses to learn deep learning. It is
available in a somewhat tricky binary format, so we’ll install the mnist library to work with it.
(Yes, this part is technically not “from scratch.”)

python -m pip install mnist

And then we can load the data:

import mnist

# This will download the data; change this to where you want it.
# (Yes, it's a 0-argument function, that's what the library expects.)
# (Yes, I'm assigning a lambda to a variable, like I said never to do.)
mnist.temporary_dir = lambda: '/tmp'

# Each of these functions first downloads the data and returns a numpy
array. # We call .tolist() because our "tensors" are just lists.
train_images = mnist.train_images().tolist()
train_labels = mnist.train_labels().tolist()

assert shape(train_images) == [60000, 28, 28]


assert shape(train_labels) == [60000]

Let’s plot the first 100 training images to see what they look like:

import [Link] as plt

191
fig, ax = [Link](10, 10)

for i in range(10):
for j in range(10):
# Plot each image in black and white and hide the axes.
ax[i][j].imshow(train_images[10 * i + j], cmap='Greys')
ax[i][j].xaxis.set_visible(False) ax[i][j].yaxis.set_visible(False)

[Link]()

MNIST images

We also need to load the test images:

test_images = mnist.test_images().tolist()
test_labels = mnist.test_labels().tolist()

assert shape(test_images) == [10000, 28, 28]


assert shape(test_labels) == [10000]

Each image is 28 × 28 pixels, but our linear layers can only deal with one-dimensional inputs,
so we’ll just flatten them (and also divide by 256 to get them between 0 and 1). In addition,

192
our neural net will train better if our inputs are 0 on average, so we’ll subtract out the average
value:

# Compute the average pixel value


avg = tensor_sum(train_images) / 60000 / 28 / 28

# Recenter, rescale, and flatten


train_images = [[(pixel - avg) / 256 for row in image for pixel in row]

for image in train_images]


test_images = [[(pixel - avg) / 256 for row in image for pixel in row]
for image in test_images]

assert shape(train_images) == [60000, 784], "images should be flattened"


assert shape(test_images) == [10000, 784], "images should be flattened"

# After centering, average pixel should be very close to 0


assert -0.0001 < tensor_sum(train_images) < 0.0001

We also want to one-hot-encode the targets, since we have 10 outputs. First let’s write a
one_hot_encode function:

def one_hot_encode(i: int, num_labels: int = 10) -> List[float]:


return [1.0 if j == i else 0.0 for j in range(num_labels)]

assert one_hot_encode(3) == [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]


assert one_hot_encode(2, num_labels=5) == [0, 0, 1, 0, 0]

and then apply it to our data:

train_labels = [one_hot_encode(label) for label in train_labels]


test_labels = [one_hot_encode(label) for label in test_labels]

assert shape(train_labels) == [60000, 10]


assert shape(test_labels) == [10000, 10]

One of the strengths of our abstractions is that we can use the same training/evaluation loop

193
with a variety of models. Let’s write that first. We’ll pass it our model, the data, a loss function,
and (if we’re training) an optimizer. It will make a pass through our data, track performance,
and (if we passed in an optimizer) update our parameters:

import tqdm

def loop(model: Layer,


images: List[Tensor],
labels: List[Tensor], loss:
Loss,
optimizer: Optimizer = None) -> None:
correct = 0 # Track number of correct predictions.
total_loss = 0.0 # Track total loss.

with [Link](len(images)) as t:
for i in t:
predicted = [Link](images[i]) # Predict.
if argmax(predicted) == argmax(labels[i]): # Check for
correct += 1 # correctness.
total_loss += [Link](predicted, labels[i]) # Compute loss.

# If we're training, backpropagate gradient and update weights.


if optimizer is not None:

gradient = [Link](predicted, labels[i])


[Link](gradient)
[Link](model)

# And update our metrics in the progress bar.


avg_loss = total_loss / (i + 1) acc =
correct / (i + 1)
t.set_description(f"mnist loss: {avg_loss:.3f} acc: {acc:.3f}")

194
As a baseline, we can use our deep learning library to train a (multiclass) logistic regression
model, which is just a single linear layer followed by a softmax. This model (in essence) just
looks for 10 linear functions such that if the input represents, say, a 5, then the 5th linear
function produces the largest output. One pass through our 60,000 training examples should
be enough to learn the model:

[Link](0)

# Logistic regression is just a linear layer followed by softmax


model = Linear(784, 10)
loss = SoftmaxCrossEntropy()

# This optimizer seems to work


optimizer = Momentum(learning_rate=0.01, momentum=0.99)

# Train on the training data


loop(model, train_images, train_labels, loss, optimizer)

# Test on the test data (no optimizer means just evaluate)


loop(model, test_images, test_labels, loss)

This gets about 89% accuracy. Let’s see if we can do better with a deep neural net- work.
We’ll use two hidden layers, the first with 30 neurons, and the second with 10 neurons. And
we’ll use our Tanh activation:

[Link](0)

# Name them so we can turn train on and off


dropout1 = Dropout(0.1)
dropout2 = Dropout(0.1)

model = Sequential([
Linear(784, 30), # Hidden layer 1: size 30
dropout1,
Tanh(),

195
Linear(30, 10), # Hidden layer 2: size 10
dropout2,
Tanh(),
Linear(10, 10) # Output layer: size 10
])

And we can just use the same training loop!

optimizer = Momentum(learning_rate=0.01,
momentum=0.99) loss = SoftmaxCrossEntropy()

# Enable dropout and train (takes > 20 minutes on my laptop!)


[Link] = [Link] = True
loop(model, train_images, train_labels, loss, optimizer)

# Disable dropout and evaluate


[Link] = [Link] = False
loop(model, test_images, test_labels, loss)

Our deep model gets better than 92% accuracy on the test set, which is a nice improvement
from the simple logistic model. The MNIST website describes a variety of models that
outperform these. Many of them could be implemented using the machinery we’ve
developed so far but would take an extremely long time to train in our lists-as-tensors
framework. Some of the best models involve convolutional layers, which are important but
unfortunately quite out of scope for an introductory book on data science.

Saving and Loading Models

These models take a long time to train, so it would be nice if we could save them so that we
don’t have to train them every time. Luckily, we can use the json module to easily serialize
model weights to a file.

For saving, we can use [Link] to collect the weights, stick them in a list, and use
[Link] to save that list to a file:

196
import json

def save_weights(model: Layer, filename: str) -> None:


weights = list([Link]())
with open(filename, 'w') as f:
[Link](weights, f)

Loading the weights back is only a little more work. We just use [Link] to get the list of
weights back from the file and slice assignment to set the weights of our model.

6.11 Review Questions


1. What are the applications of deep learning?

2. Describe in detail, backpropagation in neural networks?

3. What is the role of Activation Functions in a Neural Network?

4. What is a Gradient Descent?

5. Describe the layers in the neural network?

6. What is the Difference Between a Feedforward Neural Network and Recurrent Neural Network?

7. What is a Gradient Descent?

8. What is a Multi-layer Perceptron(MLP)?

9. What are the difference between Softmax and ReLU Functions?

10. Using an example, describe epoch, batch, iteration, vanishing and exploding gradients in deep
learning.

197
7 Chapter Seven: Large Language Models

LEARNING OUTCOMES
After reading this Section of the guide, the learner should be able to:

• Explain the architecture and fundamental principles of Large Language Models


(LLMs).
• Understand the pre-training and fine-tuning processes of LLMs.
• Implement LLMs using modern frameworks such as TensorFlow and PyTorch.
• Explore advanced techniques like transfer learning, prompt engineering, and
retrieval-augmented generation (RAG).
• Evaluate LLM performance using standard NLP benchmarks.
• Analyze ethical considerations and mitigation strategies in deploying LLMs.

7.1 Introduction to Large Language Models


Large Language Models (LLMs) are a class of deep learning models designed to process, understand, and
generate human-like text. These models, built using neural network architectures like transformers,
leverage massive datasets and billions of parameters to achieve state-of-the-art results in natural
language processing (NLP).

The evolution of LLMs has led to breakthroughs in tasks such as machine translation, code generation,
conversational AI, and document summarization. The development of models like OpenAI’s GPT
(Generative Pre-trained Transformer), Google’s BERT (Bidirectional Encoder Representations from

198
Transformers), and Meta’s LLaMA (Large Language Model Meta AI) has significantly transformed how
machines interact with human language.

7.1.1 Evolution of LLMs

The field of large language models has rapidly evolved through the following milestones:

✓ Word Embeddings (2013–2017) – Introduction of word vector representations like Word2Vec,


GloVe, and FastText.

✓ Contextualized Representations (2018–2019) – BERT introduced bidirectional context


understanding, improving NLP tasks.

✓ Autoregressive Models (2019–2021) – GPT series pioneered large-scale unsupervised pre-


training and fine-tuning.

✓ Scaling Laws (2021–Present) – Models like GPT-4 and Claude-2 demonstrated that increasing data
and compute leads to emergent capabilities.

7.2 The Transformer Architecture


LLMs rely on the Transformer architecture, a deep learning model that replaces recurrent neural
networks (RNNs) with self-attention mechanisms. Introduced by Vaswani et al. (2017), the Transformer
architecture marked a paradigm shift in processing sequential data. Unlike traditional recurrent neural
networks (RNNs) that process data sequentially, Transformers leverage self-attention mechanisms to
process entire sequences in parallel, improving efficiency and capturing long-range dependencies more
effectively. The key components of transformers include:

7.2.1 Self-Attention Mechanism

The self-attention mechanism allows the model to weigh the importance of different words in a sentence
when generating representations. This mechanism is mathematically defined as:

199
7.2.2 Multi-Head Attention

Multi-head attention enhances the model's ability to capture different contextual relationships by
applying multiple attention heads in parallel.

import torch

from torch import nn

class MultiHeadAttention([Link]):

def __init__(self, embed_dim, num_heads):

super(MultiHeadAttention, self).__init__()

[Link] = [Link](embed_dim, num_heads)

def forward(self, x):

attn_output, _ = [Link](x, x, x)

return attn_output

200
This PyTorch implementation demonstrates a basic multi-head attention mechanism.

7.3 Training Large Language Models


Training LLMs requires vast computational resources. The training process consists of:

1. Pre-training – The model learns general linguistic patterns using massive datasets through self-
supervised learning.

2. Fine-tuning – The model is refined on specific datasets for targeted applications such as legal
document analysis or medical diagnosis.

3. Instruction Tuning – LLMs like GPT-4 are fine-tuned using human-annotated instruction-
following data to improve their ability to follow complex prompts.

7.4 Implementing LLMs


Hugging Face’s transformers library simplifies working with pre-trained LLMs. Below is an example of
using GPT-3 for text generation:

from transformers import pipeline

# Load GPT-3 model for text generation

generator = pipeline("text-generation", model="gpt2")

201
# Generate text

text = generator("The future of artificial intelligence is", max_length=50)

print(text)

Output

[{'generated_text': 'The future of artificial intelligence is in its infancy, but a growing number of
researchers are trying to bring those improvements to the Internet.\n\nMany projects are using AI
systems to solve complex algorithmic problems of the past, like computer vision and neural networks'}]

This script demonstrates how to generate text using an LLM with minimal setup.

7.4.1 Fine-Tuning BERT for Sentiment Analysis

Fine-tuning a model like BERT for sentiment classification can be achieved using the Hugging Face Trainer
API:

from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments


from datasets import load_dataset

# Load dataset
dataset = load_dataset("imdb")

# Load tokenizer and model


tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

# Tokenization function
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)

202
tokenized_datasets = [Link](tokenize_function, batched=True)

# Training configuration
training_args = TrainingArguments(output_dir="./results", num_train_epochs=3,
per_device_train_batch_size=8)

trainer = Trainer(model=model, args=training_args, train_dataset=tokenized_datasets["train"])

[Link]()

7.5 Advanced Techniques in LLMs

7.5.1 Prompt Engineering

Prompt engineering is the practice of crafting input prompts to optimize model output.

Example

prompt = "Summarize the following article in three sentences:\n\n" + full_article_text

summary = generator(prompt, max_length=100)

7.5.2 Retrieval-Augmented Generation (RAG)

RAG enhances LLM performance by integrating external knowledge retrieval, reducing hallucination
risks.

7.6 Evaluating LLM Performance


Evaluating LLMs involves benchmarking using standard NLP metrics:

✓ Perplexity (PPL) – Measures model uncertainty in text prediction.

203
✓ BLEU and ROUGE – Compare generated text with reference outputs for translation and
summarization tasks.

✓ GLUE and SuperGLUE – NLP benchmarks for model performance across multiple tasks.

from transformers import pipeline

# Load an evaluation pipeline

204
eval_pipeline = pipeline("summarization")

# Evaluate model output


generated_summary = eval_pipeline("A long paragraph of text...", max_length=50)
print(generated_summary)

7.7 Ethical Considerations and Bias in LLMs


Despite their capabilities, LLMs present ethical challenges:

✓ Bias and Fairness – Models may perpetuate societal biases present in training data.

✓ Misinformation – LLMs can generate plausible but false information.

✓ Data Privacy – Sensitive user data may be unintentionally memorized.

✓ Environmental Impact – Training LLMs consumes significant energy.

Mitigation Strategies:

✓ Fairness-aware training (e.g., using debiasing algorithms).

✓ Human-in-the-loop oversight for model outputs.

✓ Implementing explainability techniques to understand model decisions.

7.8 Exercises

1. Implement a Transformer Encoder

2. Write a function in PyTorch to implement a Transformer Encoder Layer.

Exercise 2: Fine-Tune an LLM for Question Answering

3. Use the Hugging Face transformers library to fine-tune a model for question answering on the
SQuAD dataset.

205
Exercise 3: Ethical Considerations Debate

4. Write an essay analyzing bias in LLMs and propose solutions for responsible AI deployment.

7.9 Summary
This chapter explored the principles, architecture, training, and implementation of Large Language
Models. We covered transformers, fine-tuning techniques, prompt engineering, retrieval-augmented
generation, and ethical considerations. Understanding LLMs is essential for leveraging their capabilities
responsibly in real-world applications.

206
8 References
Grus, J. (2019) Data Science from Scratch: First Principles with Python. 2nd edn. O’Reilly Media. ISBN:
9781492041139.

Mueller, A.C. and Guido, S. (2017) Introduction to Machine Learning with Python. United Kingdom:
O’Reilly Media. ISBN: 9781449369415.

Minaee, S., Mikolov, T., Nikzad, N., Chenaghlu, M., Socher, R., Amatriain, X. and Gao, J. (2024) 'Large
Language Models: A Survey', arXiv preprint, arXiv:2402.06196. Available at:
[Link] (Accessed: 2 March 2025).

Guo, Z., Jin, R., Liu, C., Huang, Y., Shi, D., Supryadi, Yu, L., Liu, Y., Li, J., Xiong, B. and Xiong, D. (2023)
'Evaluating Large Language Models: A Comprehensive Survey', arXiv preprint, arXiv:2310.19736.
Available at: [Link] (Accessed: 2 March 2025).

Zhao, W.X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., Du, Y., Yang,
C., Chen, Y., Chen, Z., Jiang, J., Ren, R., Li, Y., Tang, X., Liu, Z., Liu, P., Nie, J.Y. and Wen, J.R. (2023) 'A Survey
of Large Language Models', arXiv preprint, arXiv:2303.18223. Available at:
[Link] (Accessed: 2 March 2025).

Moradi, M., Yan, K., Colwell, D., Samwald, M. and Asgari, R. (2024) 'Exploring the Landscape of Large
Language Models: Foundations, Techniques, and Challenges', arXiv preprint, arXiv:2404.11973. Available
at: [Link] (Accessed: 2 March 2025).

207
The IT qualification at Richfield College stands as a beacon of academic innovation and professional
readiness. It equips students with the skills and credentials necessary for thriving in the IT industry. By
combining foundational knowledge, practical expertise, and global recognition, the program not only
prepares students for immediate employment but also sets them on a trajectory for long-term career
success.

208

You might also like