Reading Notes
Reading Notes
) COMPUTER APPLICATIONS
DSC-MAJOR PAPER
SEMESTER-V
MACHINE LEARNING
All files are for limited circulation for students to read only. The material provided
is for the students to read only. There is no intention to interfere / alter the
content for any kind of personal or professional use. on this reading reference
material. The University or the faculty does not claim authorship on any of the
readings, website links or videos. The pdfs and videos given have been taken
from open educational resources which are freely available.
Guidelines of [Link]. (H) Computer Science VI Semester/ B.A. Programme V
Semester/ Generic Elective VII Semester (NEP UGCF 2022)
MACHINE LEARNING
DSC17/DSC-A5/GE7c
2. Unit 2: Preprocessing
Feature scaling, feature selection
methods, dimensionality reduction Chapter 6 (6.1.1, 6.1.2) [2] 6
(Principal Chapter 10 (10.2) [2] [2]
Component Analysis).
3. Unit 3: Regression
Linear regression with one variable, Chapter 3 (3.1, 3.2) [2]
linear regression with multiple
variables, 12
gradient descent,
over-fitting, regularization. Chapter 6 (6.2.1) [2]
Regression evaluation metrics.
4. Unit 4: Classification
Decision trees, Chapter 3 (3.1 3.2 3.3 3.4) [1]
Naive Bayes classifier, Chapter 6 (6.1, 6.2, 6.7, 6.9) [1]
logistic regression, Chapter 4 (4.3.1, 4.3.2, 4.3.3, [2]
4.3.4) [1]
k-nearest neighbor classifier, Chapter 8 (8.1 8.2) [2] 15
perceptron, multilayer perceptron, Chapter 10 (10.1, 10.2, 10.7)
neural networks, [2]
Support Vector Machine (SVM), Chapter 9 (9.1 9.2 9.3 9.4) [2]
Classification evaluation metrics Chapter 5 (5.1) [3]
Chapter 19 (19.7)
5 Unit 5: Clustering
Approaches for clustering, distance
metrics, K-means clustering, Chapter 12 (12.4.1, 12.4.2) [2] 7
hierarchical clustering.
Essential/recommended readings
Additional References
1. Flach, P., Machine Learning: The Art and Science of Algorithms that Make Sense
of Data, Cambridge University Press, 2015.
2. Christopher & Bishop, M., Pattern Recognition and Machine Learning,
New York: Springer-Verlag, 2016.
3. Sebastian Raschka, Python Machine Learning, Packt Publishing Ltd, 2019.
Practicals
For practical Labs for Machine Learning, students may use softwares like MATLAB/
Octave/ Python/ R. Utilize publically available datasets from online repositories like
[Link] and [Link]
● Split datasets into training and test sets and evaluate the decision models
● Perform k-cross-validation on datasets for evaluation
6. Logistic regression
8. K-NN classifier
9. Decision tree classification
We live in the age of big data. Earlier, only large companies collected and stored data in big
computer centers. But with personal computers, the internet, and smartphones, everyone has
become both a producer and a consumer of data. Every online purchase, movie rental,
website visit, blog post, or even movement tracked by GPS creates data.
People want personalized products and services, like a supermarket recommending the right
products, or a streaming service suggesting the next movie. To do this, computers need to
detect patterns in data.
Here’s where machine learning (ML) comes in. Instead of manually writing algorithms, we let
computers learn from data. For spam detection, thousands of labeled examples (spam or not
spam) can be used to train a model that learns the underlying patterns.
The idea is not to find a perfect explanation, but a useful approximation. If patterns from past
data remain relevant, the model can make reliable predictions about the future.
When ML methods are applied to huge datasets, it’s called data mining—similar to extracting
valuable minerals from tons of raw material. For example, banks use ML for fraud detection
and credit scoring, manufacturers use it for optimization, doctors for diagnosis, telecom
companies for network improvement, and scientists for analyzing vast datasets in physics,
astronomy, and biology.
But machine learning is more than just handling databases—it’s also a core part of artificial
intelligence (AI). An intelligent system must be able to adapt to changes without programmers
manually coding every possible scenario.
For example, in face recognition, humans can easily recognize friends despite changes in
lighting, hairstyle, or angle, but cannot explain the exact steps. A computer, however, can learn
patterns in face images (like symmetry, position of eyes, nose, mouth) and recognize people
by matching those patterns. This falls under pattern recognition.
• Or both.
ML heavily uses statistics (to infer from samples) and computer science (to design efficient
algorithms that can train on huge datasets and make fast predictions). In some cases, the
speed and efficiency of learning and prediction are just as important as their accuracy.
Machine Learning is used in many real-world applications. Let’s go through some important
ones.
In retail, one common application is market basket analysis. This means finding patterns in
products that customers often buy together.
For example:
• If people who buy beer also tend to buy chips, then when a customer buys beer, the
system can recommend chips.
• Example: P(chips | beer) = 0.7 → 70% of people who buy beer also buy chips.
• More advanced systems also consider customer attributes (age, gender, location, etc.)
to make more personalized recommendations.
ELSE High-Risk
This rule is called a discriminant function—it separates data points into different classes.
The main goal: Prediction. Once trained on past data, the model can predict the risk of new
loan applicants.
Machine learning is heavily used in pattern recognition, where the goal is to identify objects,
signals, or structures.
• Optical Character Recognition (OCR): Recognizing characters (like zip codes on mail or
amounts on checks) from printed or handwritten text. Since handwriting varies, ML
learns from examples rather than fixed rules.
• Face Recognition: Input is an image, and the task is to assign it to a person’s identity.
Challenges include lighting, pose, or obstructions (glasses, beard).
• Medical Diagnosis: Input is patient data (age, symptoms, test results), and output is a
disease class. Missing data is common, so models must handle uncertainty.
• Speech Recognition: Converts spoken words into text. Inputs are acoustic signals, and
outputs are words. Accents, tone, and speed make this task complex. ML models often
use language models to improve accuracy.
Applications also extend to natural language processing (NLP)—such as spam filtering, text
summarization, sentiment analysis on social media, and machine translation.
1.2.4 Biometrics
Examples:
Machine learning is used in both individual recognizers (e.g., fingerprint scanner) and in
multimodal systems that combine several inputs for more accurate and secure decisions.
Learning rules from data is not only about prediction—it also gives insight.
• Example: If a bank learns which customers are low-risk, it gains knowledge about the
characteristics of safe borrowers. This helps in targeted marketing.
Regression
In supervised learning, regression is used when the output is a continuous value. For example,
a dataset of used cars where mileage is taken as the input, and price is the output. A simple
linear model is fitted:
Y = w1x + w0
Here, w1 (slope) and w0 (intercept) are parameters that the algorithm optimizes to minimize
the error between predicted and actual prices.
If a straight line is too restrictive, we can use higher-order models such as:
y =w2x2+w1x+w0
Regression is also widely used in optimization tasks. For example, when designing a coffee-
roasting machine, inputs like temperature, roasting time, and bean type affect the taste. By
experimenting with different settings and recording consumer feedback, we can build a
regression model to predict coffee quality. The machine can then iteratively adjust its settings
toward the best configuration. This approach is called response surface design.
Another application is in robotics, such as autonomous driving. The car’s sensors (camera,
GPS, lidar, etc.) provide inputs, and regression predicts outputs like the steering angle needed
to stay on the road without hitting obstacles.
Ranking
Sometimes, instead of predicting exact values, we want to learn relative preferences. This is
called ranking. For example, in a movie recommendation system, we want to order films based
on how much a user is likely to enjoy them. Using attributes like genre, actors, and past user
ratings, a model can learn a ranking function that helps suggest new movies.
Unsupervised Learning
Unlike supervised learning, unsupervised learning has no labeled outputs. The goal is to
discover hidden patterns and structures in the data. This is also known as density estimation.
Document Clustering: News articles can be grouped into categories such as politics, sports, or
fashion by comparing the words they contain. A predefined lexicon (word list) is used to
represent documents as vectors.
Clustering is also critical in bioinformatics. DNA and proteins are represented as sequences.
By clustering recurring subsequences (called motifs), scientists can discover structural or
functional elements in proteins.
Reinforcement Learning
In reinforcement learning (RL), the goal is not to predict values or labels but to learn a policy—
a sequence of actions that maximizes long-term reward.
Game Playing: In chess, a single move is not important by itself. What matters is whether it
contributes to a winning strategy. RL algorithms learn such strategies by trial and error.
Robotics: A robot navigating a room must learn the correct sequence of moves to reach a
target while avoiding obstacles.
Challenges: RL becomes harder when the agent has incomplete or uncertain information. For
example, a robot with only a camera may know there is a wall nearby but not its exact location.
Multi-agent RL is another challenge, such as a team of robots coordinating to play soccer.
Feature Scaling
Introduction
Feature scaling is a data preprocessing technique used in machine learning to bring all the
features (independent variables) of a dataset onto a comparable scale.
• If features are measured in different units (e.g., age in years, income in lakhs, height in
cm), variables with larger values can dominate the model.
1. Equal Importance of Features: Prevents features with large values from dominating.
2. Faster Convergence: Gradient descent converges faster when features are scaled.
3. Better Distance Calculations: Algorithms like K-Means, KNN, and SVM rely on distance
metrics (Euclidean, Manhattan) that are sensitive to scale.
5. Handles Units of Measurement: Brings different units (kg, cm, INR) to a common scale.
1. Min–Max Normalization
• Formula:
• Example: If salary ranges from ₹10,000 to ₹1,00,000, then salary = ₹50,000 →
normalized value = 0.44.
• Use case: Neural networks, deep learning, where values must be bounded.
• Formula:
3. Robust Scaling
• Uses median and interquartile range (IQR) instead of mean and standard deviation.
• Formula:
• Formula:
Required for:
Without scaling, salary dominates. With scaling, both features contribute equally.
Practical Considerations
• Different scaling for different models: Try both Min–Max and Standardization.
• Feature Extraction → create new features by transforming existing ones (e.g., PCA).
1. Filter Methods
• Start with no features → add features one by one → keep the one that improves
performance most.
• Start with all features → remove least useful one step by step.
• Trains a model (e.g., SVM, Logistic Regression), ranks features by importance, removes
weakest features iteratively.
3. Embedded Methods
• Decision Trees, Random Forests, and Gradient Boosted Trees assign feature
importance scores.
Advantages: Efficient, balances bias and variance, works well with high-dimensional data.
Disadvantages: Algorithm-dependent, less interpretable in complex models.
Comparison of Methods
Filter Correlation, χ², Info Gain Fast Moderate High dimensions (text, genomics)
Embedded LASSO, Decision Trees, RF Medium High Large datasets, regularized models
In real-world datasets, we often deal with high-dimensional data (many features). High
dimensionality creates several issues:
Dimensionality Reduction is the process of reducing the number of input variables while
preserving as much information as possible.
• Feature Extraction → create new features by transforming original ones (PCA, LDA).
Principal Component Analysis (PCA) is the most widely used feature extraction technique.
What is PCA?
PCA is commonly used for data preprocessing for use with machine learning algorithms. It
helps to remove redundancy, improve computational efficiency and make data easier to
visualize and analyze especially when dealing with high-dimensional data.
PCA uses linear algebra to transform data into new features called principal components. It
finds these by calculating eigenvectors (directions) and eigenvalues (importance) from the
covariance matrix. PCA selects the top components with the highest eigenvalues and projects
the data onto them simplify the dataset.
Note: It prioritizes the directions where the data varies the most because more variation =
more useful information.
Imagine you’re looking at a messy cloud of data points like stars in the sky and want to simplify
it. PCA helps you find the "most important angles" to view this cloud so you don’t miss the big
patterns. Here’s how it works step by step:
Different features may have different units and scales like salary vs. age. To compare them
fairly PCA first standardizes the data by making each feature have:
• A mean of 0
• A standard deviation of 1
Step 2: Calculate Covariance Matrix
Next PCA calculates the covariance matrix to see how features relate to each other whether
they increase or decrease together. The covariance between two features x1and x2 is:
PCA identifies new axes where the data spreads out the most:
• 1st Principal Component (PC1): The direction of maximum variance (most spread).
• 2nd Principal Component (PC2): The next best direction, perpendicular to PC1 and so
on.
These directions come from the eigenvectors of the covariance matrix and their importance
is measured by eigenvalues. For a square matrix A an eigenvector X (a non-zero vector) and
its corresponding eigenvalue λ satisfy:
AX =λX
This means:
After calculating the eigenvalues and eigenvectors PCA ranks them by the amount of
information they capture. We then:
1. Select the top k components hat capture most of the variance like 95%.
This means we reduce the number of features (dimensions) while keeping the important
patterns in the data.
In the above image the original dataset has two features "Radius" and "Area" represented by
the black axes. PCA identifies two new directions: PC₁ and PC₂ which are the principal
components.
• These new axes are rotated versions of the original ones. PC₁ captures the maximum
variance in the data meaning it holds the most information while PC₂ captures the
remaining variance and is perpendicular to PC₁.
• The spread of data is much wider along PC₁ than along PC₂. This is why PC₁ is chosen
for dimensionality reduction. By projecting the data points (blue crosses) onto PC₁ we
effectively transform the 2D data into 1D and retain most of the important structure
and patterns.
Instead of comparing students on two subjects (English and History), we now have one
number (PC1 score) that summarizes their performance. This is simpler for analysis, ranking,
or visualization. For example, you could use PC1 scores to rank students or group them into
performance categories.
Hence PCA uses a linear transformation that is based on preserving the most variance in the
data using the least number of dimensions. It involves the following steps:
We import the necessary library like pandas, numpy, scikit learn, seaborn and matplotlib to
visualize results.
import numpy as np
import pandas as pd
We make a small dataset with three features Height, Weight, Age and Gender.
data = {
'Height': [170, 165, 180, 175, 160, 172, 168, 177, 162, 158],
'Weight': [65, 59, 75, 68, 55, 70, 62, 74, 58, 54],
'Age': [30, 25, 35, 28, 22, 32, 27, 33, 24, 21],
'Gender': [1, 0, 1, 1, 0, 1, 0, 1, 0, 0] # 1 = Male, 0 = Female
df = [Link](data)
print(df)
Output:
Dataset
Since the features have different scales Height vs Age we standardize the data. This makes all
features have mean = 0 and standard deviation = 1 so that no feature dominates just because
of its units.
X = [Link]('Gender', axis=1)
y = df['Gender']
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)
• We reduce the data from 3 features to 2 new features called principal components.
These components capture most of the original information but in fewer dimensions.
• We split the data into 70% training and 30% testing sets.
• We train a logistic regression model on the reduced training data and predict gender
labels on the test set.
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
[Link](X_train, y_train)
y_pred = [Link](X_test)
The confusion matrix compares actual vs predicted labels. This makes it easy to see where
predictions were correct or wrong.
cm = confusion_matrix(y_test, y_pred)
[Link](figsize=(5,4))
[Link]('Predicted Label')
[Link]('True Label')
[Link]('Confusion Matrix')
[Link]()
Output:
2. Noise Reduction: Eliminates components with low variance enhance data clarity.
3. Data Compression: Represents data with fewer components reduce storage needs and
speeding up processing.
4. Outlier Detection: Identifies unusual data points by showing which ones deviate
significantly in the reduced space.
3. Information Loss: Reducing dimensions may lose some important information if too
few components are kept.
4. Assumption of Linearity: Works best when relationships between variables are linear
and may struggle with non-linear data.
6. Risk of Overfitting: Using too many components or working with a small dataset might
lead to models that don't generalize well.
Objectives of PCA
Xnew = X⋅W
• PCA finds a new axis (principal component) that represents “overall body size.”
• Instead of using both Height & Weight, we can reduce to 1 dimension (PC1) with
minimal loss of information.
Applications of PCA
Simple Linear Regression is the most basic statistical method used for prediction. It is applied
when we want to study the relationship between a single predictor (independent variable, X)
and a quantitative response (dependent variable, Y).
The key assumption is that the relationship between X and Y is approximately linear.
3.1.2 Assessing Accuracy of Coefficient Estimates
So far, we fit a line assuming the relationship between X and Y is exactly linear.
But in reality:
Why do we need ϵ?
Key Assumption:
We usually assume ϵ is independent of X and has mean zero.
Where,
yi = actual value,
import numpy as np
import [Link] as sm
# Dataset
X = [Link]([1,2,3,4,5,6,7,8,9,10]).reshape(-1,1) # Feature
y = [Link]([2,4,5,4,5,7,8,9,10,12]) # Target
# Train-Test Split
model = LinearRegression()
[Link](X_train, y_train)
y_pred = [Link](X_test)
# Evaluation Metrics
mse = mean_squared_error(y_test, y_pred)
rmse = [Link](mse)
r2 = r2_score(y_test, y_pred)
print("Evaluation Metrics:")
# Plot results
[Link]("X")
[Link]("y")
[Link]()
[Link]()
OUTPUT
MSE = 0.80
RMSE = 0.89
MAE = 0.74
R² Score = 0.91
Slope (β₁ for X) will appear next in the summary with a very small p-
value (strong evidence that X significantly predicts y).
OLS stands for Ordinary Least Squares. It’s the most common method used to estimate the
coefficients (β₀, β₁, …) in linear regression.
Why OLS is important
OLS relies on some assumptions. If these assumptions fail, OLS can give biased, inefficient, or
misleading results.
Linearity assumption: OLS assumes the relationship between X and y is linear. If the true
relationship is nonlinear, OLS won’t capture it well.
Outliers: OLS minimizes squared errors, so it is very sensitive to outliers (a single extreme
point can distort the regression line).
Alternatives to OLS
• Like Ridge, but also performs feature selection (sets some coefficients to zero).
• Useful when you have many predictors and want to identify the important ones.
3. Elastic Net
4. Robust Regression
Simple linear regression models the response using a single predictor. In practice, we often
have multiple predictors. For example, in the Advertising dataset, sales were modeled using
TV advertising. But we also have data on radio and newspaper advertising, and we want to
know how these media are related to sales.
A naive approach is to fit separate simple regressions for each predictor (e.g., sales vs. radio,
sales vs. newspaper). For instance:
• A $1,000 increase in radio advertising is associated with ~203 units higher sales.
• A $1,000 increase in newspaper advertising is associated with ~55 units higher sales.
Linear Regression
These results show that both radio and newspaper advertising are positively associated with
sales, but radio has a stronger effect.
To analyze the effects of TV, radio, and newspaper advertising simultaneously, we use the
multiple regression model
Multiple Regression Coefficient Estimates
When TV, radio, and newspaper budgets are used together to predict sales, we obtain the
following insights:
This shows that simple and multiple regression results can differ substantially.
• In simple regression, the slope of newspaper spending measures its association with
sales ignoring TV and radio.
• In multiple regression, the newspaper coefficient measures the effect of newspaper
advertising while holding TV and radio fixed.
This phenomenon is very common: when predictors are correlated, simple regression can give
misleading results.
Example (Analogy):
• Suppose we regress shark attacks on ice cream sales at a beach. We would see a
positive relationship.
• But the true driver is temperature: hot weather → more people at the beach → more
ice cream sales and more shark attacks.
• Once we include temperature in a multiple regression, the effect of ice cream
disappears — correctly showing no direct relationship.
• The Hat Matrix “projects” the observed values y onto the fitted regression line/plane.
• The coefficients β are the "weights" that determine the regression line, while H is the
geometric operator that uses those weights to produce predictions.
Hypotheses
We test whether all regression coefficients (except the intercept) are zero:
Suppose you fit a multiple linear regression model with 2 predictors (k=2) and a sample size
of 20 (n=20). After fitting the model, you have:
• dfreg = k = 2
• dfres= n – k – 1 = 20 – 2 – 1 = 17
A p-value is a statistical measure that helps you decide whether your observed results are
likely under a given null hypothesis. In other words, it tells you how probable your data would
be if the null hypothesis—typically that there is no effect or no difference—were true.
• Low p-value (≤ 0.05): Strong evidence against the null hypothesis; you reject the null
and accept the alternative hypothesis.
• High p-value (> 0.05): Weak evidence against the null hypothesis; you fail to reject the
null hypothesis.
Example
Suppose you test whether a new drug lowers blood pressure compared to a placebo. After
gathering your data and performing a statistical test (say, a t-test), you get a p-value of 0.03.
• Decision: Because 0.03 < 0.05, you have statistically significant evidence to reject the
null hypothesis and conclude that the drug likely has an effect
What is a t-test?
Key points:
• Works when the data is approximately normally distributed and the population
variance is unknown.
Example of a t-test
Suppose you want to know if a new teaching method affects student test scores.
• Group 1: 10 students using the traditional method; scores: 85, 90, 88, 75, 95, 80, 70,
85, 78, 92
• Group 2: 10 students using the new method; scores: 88, 91, 94, 79, 99, 87, 82, 90, 85,
95
1. Null hypothesis (H₀): There is no difference in the mean test scores between the two
groups.
A t-test determines if the difference between two group means is statistically significant or
could have happened by random chance.
df = n− 1
3. Find the p-value by comparing the calculated t-statistic to the t-distribution with df
degrees of freedom. You use a t-distribution table or statistical software:
• For a two-tailed test, the p-value = 2 × area in the tail beyond the absolute value of the
t-statistic.
• For a one-tailed test, the p-value = area in the tail beyond the calculated t-statistic.
The p-value represents the probability of observing a t-statistic as extreme as (or more
extreme than) the one calculated if the null hypothesis is true.
Example
Say your sample mean xˉ= 8, hypothesized mean μ=9, sample standard deviation s=2, and
sample size n=31.
Calculate:
DECIDING ON IMPORTANT VARIABLES (VARIABLE SELECTION)
After we confirm (via the F-test) that some predictors matter, the next step is figuring out
which predictors are truly important.
Disadvantage: Greedy — may include a variable early that later becomes redundant.
3. Model Fit
Once we choose a subset of predictors, we evaluate how well the model fits the data. Two
common measures are:
2. R2 (Coefficient of Determination):
o Fraction of variance in the response explained by the model.
o Formula:
4. Predictions
Once the model is fit, we can use it to predict outcomes. But predictions carry three kinds of
uncertainty:
Both intervals are centered at the same predicted value (11,256), but PI is wider to reflect
the extra uncertainty.
import numpy as np
import pandas as pd
data = {
df = [Link](data)
X = df[["x1", "x2"]]
y = df["y"]
# Train-test split
# Fit model
model = LinearRegression()
[Link](X_train, y_train)
# Predictions
y_pred = [Link](X_test)
# Evaluation Metrics
rmse = [Link](mse)
print("Intercept:", model.intercept_)
print("\nEvaluation Metrics:")
OUTPUT
Intercept: 1.36
Evaluation Metrics:
R² Score: 0.98
GRADIENT DESCENT
How it works
Repeat until convergence (when changes are very small or cost stops decreasing).
Batch Gradient Descent – uses the entire dataset to compute the gradient in each step (stable
but slow).
Stochastic Gradient Descent (SGD) – updates parameters after every single training example
(fast, but noisy).
Mini-batch Gradient Descent – uses small batches of data (trade-off between speed and
stability, most commonly used).
Overfitting
Overfitting happens when a model learns the training data too well, including noise and
random fluctuations, instead of learning the underlying pattern.
Symptoms of Overfitting
Causes
Solutions
REGULARIZATION
Types of Regularization
Once a regression model is trained, we need metrics to evaluate how well it performs.
Common Metrics
Decision tree learning is a widely used approach for approximating discrete-valued target
functions, where the outcome of the learning process is represented in the form of a decision
tree. These trees can also be converted into sets of if-then rules, which improves readability
and interpretability. Among inductive inference algorithms, decision tree methods are
considered highly popular due to their efficiency and applicability across diverse domains.
They have been successfully applied in areas such as medical diagnosis and credit risk
assessment for loan applicants.
Decision trees classify examples by guiding them from the root node of the tree to a leaf node,
which provides the final classification. Each internal node represents a test on a particular
attribute, and each branch corresponds to one of the possible values of that attribute. An
example is classified by starting at the root, applying the test specified at that node, and
moving down the branch that matches the attribute value of the instance. This process
continues until a leaf node is reached, where the classification is assigned.
Figure 3.1 illustrates a decision tree for the concept PlayTennis. It determines whether a
Saturday morning is suitable for playing tennis. For instance, the case
(Outlook = Sunny, Temperature = Hot, Humidity = High, Wind = Strong)
follows the leftmost branch of the tree and is classified as a negative example (PlayTennis =
No).
• ∨ (Outlook = Overcast)
Although different variations of decision tree algorithms exist, the method is particularly
effective for problems that share the following characteristics:
Discrete-valued target functions: Decision trees commonly assign Boolean outcomes such as
yes or no, but the method can be extended to handle multiple classes. Although less common,
extensions also exist for continuous-valued outputs.
Disjunctive concepts: Decision trees naturally represent target concepts that require
disjunctive expressions.
Noisy data: The method is robust to errors in both class labels and attribute values.
Incomplete data: Decision trees can be applied even when some attribute values are missing.
Techniques for handling such situations allow the model to make use of partially observed
examples.
Due to these properties, decision tree learning has been applied to many practical problems,
such as classifying patients by disease, diagnosing equipment malfunctions, and predicting
whether loan applicants are likely to default. These types of tasks, which involve assigning
examples to one of several discrete categories, are referred to as classification problems.
The ID3 algorithm builds a decision tree step by step. It starts by asking: Which attribute
should be tested at the root of the tree? To answer this, each attribute in the dataset is tested
using a statistical method to measure how well it separates the training examples. The
attribute with the best classification ability is chosen as the root.
After that, branches are created for each possible value of this attribute. The training examples
are then distributed to these branches based on their attribute values. The same process is
repeated at every branch: the best attribute is selected, and further branches are created. This
continues until the decision tree is complete.
The search is called greedy because once an attribute is chosen, the algorithm does not go
back to reconsider earlier choices.
The most important step in ID3 is deciding which attribute to test at each node. The best
attribute is the one that separates the data most effectively. To measure this, the algorithm
uses a statistical concept called information gain.
• Entropy is a measure from information theory that shows how mixed (or pure) a collection
of examples is.
Entropy can also be seen as the average number of bits required to encode the classification
of a random example.
[Link] Information Gain – Reduction in Entropy
Information gain tells how much an attribute helps in classifying the data. It is the expected
reduction in entropy after splitting the dataset according to an attribute.
Mathematically:
The first term is the entropy of the whole dataset. The second term is the weighted average
entropy after the split. The difference gives the information gain.
The attribute with the highest information gain is selected at each step.
Example: Attribute "Wind"
Suppose the dataset has 14 examples [9+, 5-]. The attribute Wind can take values {Weak,
Strong}.
• For Wind = Weak: 6 positive, 2 negative → [6+, 2-]
• For Wind = Strong: 3 positive, 3 negative → [3+, 3-]
After calculating entropy for each branch and taking weighted average, the information gain
for Wind is:
Gain(S,Wind)=0.048
Example: Attribute "Humidity"
The same dataset can be split by Humidity.
• For Humidity = High: [3+, 4-]
• For Humidity = Normal: [6+, 1-]
Information gain for Humidity is:
Gain(S,Humidity)=0.151
Thus, Humidity is a better classifier than Wind because it provides higher information gain.
# Load dataset
iris = load_iris()
X, y = [Link], [Link]
# Make predictions
y_pred = [Link](X_test)
print("\nClassification Report:")
print(classification_report(y_test, y_pred,
target_names=iris.target_names))
Advantages
Disadvantages
• Can be non-robust: small changes in data may lead to very different trees.
What is Overfitting?
Overfitting happens when a decision tree learns the training data too well, including its noise,
outliers, or random fluctuations. As a result, it performs very well on training data but poorly
on new, unseen data (test data). For example,
• Suppose you're building a decision tree to classify whether an email is spam or not.
• If the tree is too deep and complex, it might learn that “email sent on Tuesday at 3:27
PM” means spam just because it happened in the training data. But that is not a
generalizable pattern—it's just noise.
What is Pruning?
Pruning means cutting back the size of a decision tree by removing parts that don’t help
improve performance on new data. The idea is to simplify the tree so it doesn’t overfit. There
are two types of pruning:
• Pre-pruning (early stopping): Stop growing the tree early based on a condition (like max
depth, min samples per leaf).
• Post-pruning (cost complexity pruning): Let the tree grow fully, then cut back branches
that don’t help.
This is a popular post-pruning method used in algorithms like CART (Classification and
Regression Trees). It adds a penalty for large trees. It balances between:
Where:
• T = the tree
• α = penalty parameter (controls trade-off between accuracy and complexity)
• ∣Terminal Nodes∣ = number of leaf nodes (model complexity)
Just like Lasso Regression shrinks coefficients by adding a penalty, cost complexity pruning
shrinks the tree by penalizing unnecessary splits.
Ensemble methods are techniques that combine predictions from multiple models to make a
final, better prediction than any single model on its own. The idea is similar to the saying:
• Overfit or underfit
• Be sensitive to noise
• Make biased decisions
• Improving accuracy
• Reducing variance
• Reducing bias
• Handling noisy data better
Bagging means training many models on different random subsets of the data and then
combining their predictions. The goal is to reduce variance (i.e., make the model more stable
and less sensitive to noise in the data).
How it works:
1. Take multiple random samples of your dataset with replacement (bootstrap samples).
2. Train a separate model (usually the same type, like decision trees) on each sample.
3. Combine the outputs:
o For classification: use majority vote
o For regression: use average
Example:
Imagine you're asking 5 friends to guess how many jelly beans are in a jar.
• Each friend sees different sets of guesses (not the full list).
• They each make a guess (model prediction).
• You take the average (or majority vote) of all guesses for your final answer.
# Load dataset
iris = load_iris()
X, y = [Link], [Link]
# Predict
y_pred = bagging_clf.predict(X_test)
# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
How it works:
# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
3. Random Forest: Random Forest is a bagging method using decision trees, with an extra
twist: each tree uses a random subset of features when splitting. The goal is to improve
accuracy and prevent overfitting (like regular bagging, but better for trees).
How it works:
Example:
Summary Table
When using bagging methods like Random Forest, we train multiple models on random
subsets of the training data (these are called bootstrap samples).
Each of these samples leaves out some data points — and we can use these left-out
observations as a test set to estimate the model's performance. That’s called the Out-of-Bag
(OOB) Error Estimation.
Why is it useful?
Key Concepts
1. Bootstrap Sampling:
• For each tree in a bagging model (like Random Forest), we draw a random sample with
replacement from the training set.
• This means some data points are repeated, and others are left out.
• On average, about 1/3 of the original data is not selected in each bootstrap sample.
• These unselected data points are called OOB observations for that tree.
Analogy
Imagine you're training 100 students (trees) to classify emails as spam or not.
• Each student is trained on different emails (random samples).
• Some emails are never seen by a student.
• Later, you ask each student to classify the emails they didn't train on.
• You collect the answers from all students who didn’t see a particular email — and
compare it to the actual label.
That gives you a good idea of how well your group performs on unseen data, without needing
a separate test.
BART is a machine learning algorithm that uses a sum of many small decision trees (like
boosting), and applies Bayesian methods to make predictions and quantify uncertainty. So,
you can think of it as:
• Instead of using one big tree, BART uses many small trees.
• These trees add together (just like in boosting) to model the output.
• But instead of training them in a greedy way like XGBoost or Gradient Boosting, BART:
o Uses a Bayesian framework to model the trees.
o Samples from a posterior distribution over possible tree structures and
predictions using MCMC (Markov Chain Monte Carlo).
o Produces probabilistic predictions, not just point estimates.
Y = f(x) + ϵ
Bayesian Twist:
Imagine you have a committee of doctors (trees), and each one gives a small opinion (a
number) about a patient's health risk. Instead of one doctor making a big decision, you
average all their opinions.
But instead of just giving a number, each doctor also says how confident they are. The final
decision is not just a prediction — it includes uncertainty.
Bayesian learning methods hold a significant place in the study of machine learning for two
main reasons. First, algorithms based on Bayesian principles, such as the Naive Bayes classifier,
are among the most practical and effective approaches for certain types of learning problems.
For instance, studies conducted by Michie et al. (1994) compared the Naive Bayes classifier
with other learning algorithms, including decision tree and neural network models. The results
showed that Naive Bayes performs competitively in many situations and, in some cases, even
outperforms these alternatives. Because of this reliability, the Naive Bayes classifier is often
considered one of the most effective algorithms, especially in tasks like classifying text
documents such as electronic news articles.
The second reason Bayesian methods are important lies in the perspective they provide for
understanding a wide range of learning algorithms, even those that do not explicitly deal with
probabilities. For example, algorithms like FIND-S and Candidate-Elimination can be analyzed
under Bayesian principles to identify conditions in which they produce the most probable
hypothesis based on training data. Similarly, Bayesian reasoning helps explain design choices
in neural network learning, such as minimizing the sum of squared errors when exploring
possible network structures. It also provides justification for using alternative error functions,
such as cross-entropy, when predicting probabilities. Beyond this, Bayesian analysis sheds
light on the inductive biases of decision tree algorithms that favor shorter trees and relates
closely to the Minimum Description Length (MDL) principle. A basic understanding of Bayesian
methods is therefore crucial for analyzing and characterizing many algorithms in machine
learning.
• Each training example can either increase or decrease the estimated probability that a
hypothesis is correct, allowing for a flexible approach rather than discarding hypotheses
after a single inconsistency.
• Prior knowledge can be incorporated with observed data to calculate the probability of a
hypothesis. This is achieved by assigning prior probabilities to candidate hypotheses and
defining probability distributions for observed data under each hypothesis.
• Hypotheses can provide probabilistic predictions, such as "a patient has a 93% chance of
full recovery."
• New instances can be classified by combining predictions from multiple hypotheses, with
each weighted according to its probability.
• Even in cases where Bayesian methods are computationally complex, they serve as a
standard of optimal decision-making against which other algorithms can be compared.
Despite these advantages, Bayesian methods face practical challenges. One difficulty is the
need for prior knowledge of many probabilities, which are often unavailable. In such cases,
these probabilities are estimated using background knowledge, existing data, or assumptions
about underlying distributions. Another challenge is the high computational cost of
determining the Bayes optimal hypothesis, which grows linearly with the number of candidate
hypotheses. However, in specialized situations, this cost can be reduced significantly.
Formally, Bayes theorem combines the prior probability of a hypothesis, the likelihood of the
observed data given the hypothesis, and the overall probability of the data, to produce the
posterior probability of the hypothesis.
• P(h): The prior probability of hypothesis h, before considering the training data. It reflects
any background knowledge about how likely h is to be correct. If no prior information is
available, equal probability can be assigned to all candidate hypotheses.
• P(D): The probability of observing the data D, without considering which hypothesis is
correct.
• P(D | h): The probability of observing the data D assuming hypothesis h holds. This is called
the likelihood.
• P(h | D): The posterior probability of hypothesis h, after observing data D. This value
indicates the updated confidence in h.
Intuitively, the posterior probability increases if the hypothesis was already likely (high prior
probability) or if the hypothesis strongly predicts the observed data (high likelihood). On the
other hand, if the data is very common regardless of the hypothesis, the posterior probability
decreases.
If all hypotheses are assumed to be equally probable a priori, the problem simplifies to finding
the hypothesis that maximizes the likelihood of the data:
Bayes theorem is not limited to machine learning; it applies to any set of mutually exclusive
propositions whose probabilities sum to one. However, in the context of learning, the
hypotheses generally represent possible target functions, and the data corresponds to training
examples.
The available information is based on a laboratory test with two possible results: positive (+)
or negative (−). The prior probability of having this cancer in the general population is 0.008
(0.8%). The test is not perfectly accurate:
• If the disease is absent, the test correctly shows negative in 97% of cases.
Now suppose a patient’s test result is positive. To decide which hypothesis is most probable,
the MAP approach is used. Applying Bayes theorem shows that although the posterior
probability of cancer increases significantly compared to its prior probability, the more
probable hypothesis is still that the patient does not have cancer.
• Hypotheses are not completely accepted or rejected. Instead, their probabilities are
updated as new evidence becomes available.
• The Bayes Optimal Classifier is a method used in machine learning to make the best
possible prediction about a new instance, based on all the available hypotheses and data.
• Often, attention is given to the MAP (Maximum a Posteriori) hypothesis, which is the
single hypothesis with the highest probability after seeing the training data. However,
relying only on this one hypothesis may not always give the most accurate classification.
• For example,
Imagine there are three hypotheses with probabilities 0.4, 0.3, and 0.3. The first one (0.4)
is the MAP hypothesis. If this hypothesis says a new example is positive, but the other
two say it is negative, then overall the chance of it being positive is 0.4, while the chance
of it being negative is 0.6. The MAP hypothesis would say positive, but the Bayes Optimal
Classifier would say negative, because that has the higher probability when all hypotheses
are taken into account.
• In general, the Bayes Optimal Classifier works by combining the predictions of all
hypotheses, giving each hypothesis a weight according to its probability. The classification
chosen is the one with the highest overall probability after this weighted combination.
• This method is guaranteed to perform as well as possible, on average, given the training
data, the hypothesis space, and prior knowledge. In other words, no other classifier using
the same information can consistently do better.
• An interesting point is that the final classification produced by the Bayes Optimal Classifier
does not always match the predictions of any single hypothesis. Instead, it may act as if
there were a new hypothesis created from a combination of the existing ones.
The Naive Bayes Classifier is a method used to predict a class based on data. It is called naive
because it assumes that all features are independent from each other, even if in reality they
may not be.
It uses probabilities to decide the class of a new example. The method looks at:
The final prediction is the class with the highest probability after multiplying these values
together.
For Yes:
For No:
PlayTennis = No
iris = load_iris()
X, y = [Link], [Link]
gnb = GaussianNB()
[Link](X_train, y_train)
y_pred = [Link](X_test)
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test,y_pred,
target_names=iris.target_names))
In many situations, the response variable of interest is categorical rather than continuous. For
instance, in the Default dataset, the response variable default takes on two possible
outcomes: Yes or No. Instead of attempting to model the outcome directly, logistic regression
focuses on estimating the probability that the outcome belongs to a particular category.
For the credit default data, the goal is to estimate the probability that a customer will default,
given their balance. This probability can be expressed as:
For convenience, denote this probability by p(balance). Since probabilities must always lie
between 0 and 1, the model should produce values in this range for all possible balances.
Once these probabilities are estimated, they can be converted into classifications. For
example, if p(balance)>0.5, then the customer may be predicted to default. Alternatively, if a
financial institution prefers to be more cautious, it might adopt a lower threshold, such as
p(balance)>0.1, when labeling a customer as likely to default.
A key question is how to model the relationship between the predictor X and the probability
p(X)=Pr(Y=1∣X). One naive approach is to use linear regression, so that
This method, however, produces probabilities that are not always valid. In particular, the
fitted line may predict negative probabilities for small values of X or probabilities greater than
one for large values of X. Such predictions are nonsensical because probabilities must always
remain within the interval [0, 1]. This problem occurs generally whenever a straight line is
fitted to a binary response variable.
To overcome this issue, logistic regression employs a transformation that always returns
probabilities between 0 and 1. The model uses the logistic function, defined as:
The logistic curve naturally produces an S-shaped relationship between X and the predicted
probability. For very small values of X, the probability approaches zero, but never falls below
it; for very large values, the probability approaches one, but never exceeds it. Thus,
predictions remain meaningful across the entire range of inputs. In practice, logistic
regression is estimated using the method of maximum likelihood, which identifies the
parameters β0 and β1 that make the observed outcomes most probable.
Odds and Log-Odds can be rearranged to highlight the role of odds in logistic regression:
The ratio p(X)/[1−p(X)] is known as the odds. Odds range from zero to infinity, with values
close to zero indicating a very low probability and large values indicating a very high
probability. For instance, if p(X)=0.2, the odds are 0.2/0.8=0.25, meaning one default is
expected for every four non-defaults. If p(X)=0.9, then the odds are 0.9/0.1=9, corresponding
to nine defaults for every non-default.
The coefficients β0 and β1 in the logistic model are unknown and must be estimated from the
training data. In linear regression, the least squares method is commonly used to estimate
coefficients. Although logistic regression could, in principle, be fitted using a nonlinear least
squares procedure, the method of maximum likelihood is preferred because it provides more
reliable statistical properties.
The intuition behind maximum likelihood in this context is straightforward: the estimates β0
and β1 chosen so that the predicted probabilities align as closely as possible with the observed
outcomes. For individuals who actually defaulted, the fitted probability should be close to
one, while for those who did not default, the fitted probability should be close to zero.
where p(xi) is given by the logistic model. The values β0 and β1 are obtained by maximizing
this likelihood function.
Maximum likelihood is a general estimation principle and is widely used across statistical
modeling. In fact, least squares estimation in linear regression can be viewed as a special case
of maximum likelihood.
Table presents the estimated coefficients for the Default dataset when predicting default
from balance. The estimated slope coefficient is β1=0.0055. This means that for every unit
increase in balance, the log-odds of default increase by 0.0055 units.
Logistic Regression on Balance
In this example, the p-value for balance is extremely small, which strongly rejects the null
hypothesis and confirms that default probability is indeed related to balance. The estimated
intercept has less practical interpretive value, serving mainly to ensure that predicted
probabilities are properly centered to match the observed overall default rate in the data.
Once the coefficients are estimated, predicted probabilities can be calculated for any given
input. For example, using the results in Table 4.1, the probability of default for an individual
with a balance of $1,000 is:
This corresponds to a probability of less than 1%. In contrast, the probability for a balance of
$2,000 is much higher:
or approximately 58.6%.
The results of fitting such a model are shown in Table. The coefficient for student[Yes] is
positive and statistically significant, suggesting that students are associated with a higher
probability of default compared to non-students.
• For a student:
• For a non-student:
This confirms that students, on average, have a slightly higher default risk compared to non-
students.
The logistic regression framework naturally extends from a single predictor to multiple
predictors. By analogy with the transition from simple to multiple linear regression in Chapter
3, the model in (4.4) generalizes to:
Example: Balance, Income, and Student Status
Table 4.3 reports the results of fitting a logistic regression model to the Default data, using
balance, income (measured in thousands of dollars), and student status (coded as a dummy
variable) as predictors.
• The coefficient for balance is positive and highly significant, confirming that larger
balances are associated with higher probabilities of default.
• The coefficient for income is small and not statistically significant, indicating little
evidence of an association once balance and student status are included.
• Surprisingly, the coefficient for student status is negative, suggesting that students are
less likely to default than non-students, holding balance and income fixed.
This result appears contradictory to Table, where student status was associated with a higher
probability of default when used as the only predictor.
Imagine you are studying the relationship between exercise and the risk of developing heart
disease.
This seems surprising — isn’t exercise supposed to be good for your heart?
And find:
This makes smoking a confounder — it creates a false relationship between exercise and heart
disease when it's not included in the model.
This model shows the true relationship — once you account for smoking, the positive link
between exercise and heart disease disappears and reverses.
This example shows why it's essential to include confounding variables like smoking. If you
don’t, logistic regression may give you a wrong sign or strength for your predictors. Just like
in the earlier student default example, adding the right variable (like balance or smoking)
reveals the true effect of another variable (like student status or exercise) on the outcome.
Making Predictions
With the estimated coefficients from Table 4.3, probabilities can be calculated for specific
individuals.
Thus, conditional on balance and income, the student is less likely to default than the non-
student.
# Load dataset
data = load_breast_cancer()
X, y = [Link], [Link]
[Link](X_train, y_train)
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test,y_pred,
target_names=data.target_names))
3.10 INTRODUCTION – INSTANCE-BASED LEARNING
The k-nearest neighbor (k-NN) algorithm is one of the simplest machine learning techniques.
It works by storing all the training examples and predicting new cases by checking which
stored examples are closest to them. Every data point is considered as a position in an n-
dimensional space, where each attribute represents one axis. The closeness between two
points is usually measured using the Euclidean distance formula. For two instances xi and xj,
where ar(x) is the value of the r-th attribute of instance x, the distance is calculated as:
Here, ar(xi): The value of the r-th attribute (or feature/component) for data point xi.
When a new query point x is given, the algorithm finds the k nearest training examples. If
the target function is categorical (like yes/no or red/blue), the algorithm assigns the most
common class among those k neighbors. If the target is numerical, the algorithm calculates
the average value of the neighbors. For example, if k=1, the class of the nearest neighbor is
directly assigned. If k>1, the predicted class is:
A more refined version is distance-weighted k-NN, where closer neighbors are given more
importance than distant ones. A common method is to weight each neighbor using the inverse
square of its distance:
For classification, the prediction is then based on the weighted votes of neighbors. For
regression (when the output is a number), the predicted value is the weighted average of the
k nearest neighbors:
This ensures that points closer to the query have a stronger influence on the final prediction.
If the query exactly matches a training point, then the algorithm directly uses that point’s
class.
The strength of k-NN is that it is very easy to understand and works well when there is enough
training data. It also handles noisy data better when distance-weighting is used, as it smooths
out the effect of random errors. However, k-NN also faces some challenges. One big issue is
the curse of dimensionality. When there are many attributes, especially irrelevant ones, the
distance measure can become misleading. Two instances that are very similar in the
important attributes may still appear far apart when all irrelevant attributes are considered.
To solve this, attributes can be given different weights or irrelevant ones can be removed.
Another problem is that k-NN can be slow when the dataset is very large, since each new
query requires calculating distances to all training examples. To make it faster, data structures
like kd-trees can be used to organize the examples and quickly find nearest neighbors.
In short, k-NN is a lazy learning algorithm because it does not build a general model
beforehand. Instead, it makes decisions only when a query is given, based on the nearest
stored examples. With distance-weighting and proper feature selection, it becomes a very
effective method for both classification and regression tasks.
iris = load_iris()
X, y = [Link], [Link]
knn = KNeighborsClassifier(n_neighbors=3)
[Link](X_train, y_train)
y_pred = [Link](X_test)
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred,
target_names=iris.target_names))
K-means clustering is a simple and widely used method for partitioning a dataset into K distinct,
non-overlapping groups. The user first specifies the desired number of clusters, K, and the
algorithm then assigns each observation to exactly one of those clusters. The idea behind K-
means is intuitive: a good clustering is one in which observations within the same cluster are as
similar as possible, while being as different as possible from observations in other clusters. This
is achieved by minimizing the within-cluster variation, which measures how many observations
in a cluster differ from each other.
Mathematically, the goal is to partition the dataset into K clusters so that the total within-cluster
variation, summed across all clusters, is as small as possible. In practice, this variation is
usually measured using the squared Euclidean distance between observations. Thus, K-means
clustering seeks to minimize the sum of squared distances of each observation from the mean
(centroid) of its assigned cluster.
Finding the exact solution to this optimization problem is extremely difficult because the
number of possible ways to split n observations into K clusters grows exponentially with n.
Instead, K-means uses an iterative algorithm that converges to a local optimum. The algorithm
works as follows:
(2) the cluster centroids (the mean of all observations in the cluster) are calculated.
Steps (2) and (3) are repeated until the assignments stop changing. At every step, the objective
function (within-cluster variation) decreases, so the algorithm always improves until it
stabilizes.
Where,
This equation relates two different but equivalent ways of measuring how spread out (or
"scattered") the points in a cluster are.
Because K-means only guarantees a local optimum, the result depends on the initial random
assignments. This means different runs of the algorithm may produce different clustering’s. To
address this, it is common to run K-means multiple times with different starting points and then
choose the best solution, i.e., the one with the lowest within-cluster variation.
A practical challenge in applying K-means is deciding on the number of clusters K. The choice
of K is not straightforward, and different values can lead to very different results. This issue,
along with other practical considerations such as initialization strategies and computational
efficiency, is an important part of applying K-means clustering effectively.
One limitation of K-means clustering is that it requires us to pre-specify the number of clusters,
K. Hierarchical clustering is an alternative approach that does not require committing to a
specific value of K in advance. A key advantage of hierarchical clustering is that it produces a
tree-based representation of the observations, called a dendrogram, which provides a visual
summary of the data’s clustering structure.
Clusters can be identified by making a horizontal “cut” across the dendrogram at a chosen
height. All the groups formed below the cut line represent clusters. For example, cutting at a
higher level may result in just two broad clusters, while cutting lower down can reveal more
detailed groupings. This makes hierarchical clustering flexible: a single dendrogram can provide
multiple levels of clustering, allowing analysts to choose the number of clusters based on the
problem at hand.
The hierarchical clustering algorithm itself is straightforward. First, each observation starts as
its own cluster. Then, at each step, the two clusters that are most similar are merged, reducing
the number of clusters by one. This process continues until all observations form a single
cluster. However, to perform this merging, we need a definition of “dissimilarity” between
clusters. This is where linkage methods come in.
Complete Linkage: Tends to form compact, spherical clusters → avoids chaining effect. Looks
for the longest bridge.
Average Linkage: A compromise between single and complete; balances compactness and
chaining. Averages all "bridges".
Centroid Linkage: Can result in inversions (clusters being merged that are not closest).
Measures distance between centers.
Ward's Method: Most commonly used for hierarchical clustering of real-valued data; closely
related to k-means clustering. Measures increase in total cluster variance if merged.
The choice of dissimilarity measures is also crucial. Euclidean distance is most used, but in
some cases, correlation-based distance may be more appropriate. For example, in online
shopping data, using Euclidean distance would group shoppers who purchase few items
overall, regardless of their preferences. In contrast, correlation-based distances would cluster
shoppers with similar purchasing patterns, even if some buy more frequently than others. Thus,
correlation-based distance may be better for applications where the goal is to group based on
patterns rather than volume.
Another important consideration is whether to scale variables before computing dissimilarities.
If features are on different scales (e.g., centimeters vs. kilometers) or if some features vary
much more than others (e.g., frequent purchases like socks vs. rare purchases like computers),
they can dominate the clustering process. Scaling the variables to have standard deviation one
gives all features equal importance, which is often desirable. However, whether scaling is
appropriate depends on the context of the analysis.
import numpy as np
X = [Link]([
[1, 2],
[2, 1],
[3, 4],
[5, 8],
[8, 8],
[9, 10]
])
k = 2
labels = kmeans.fit_predict(X)
[Link]('K-Means Clustering')
[Link]('Feature 1')
[Link]('Feature 2')
[Link]()
import numpy as np
data = [Link]([
[1, 2],
[1, 4],
[1, 0],
[4, 2],
[4, 4],
[4, 0]
])
[Link](figsize=(8, 5))
[Link]('Data Points')
[Link]('Distance')
[Link]()
10.1 Single Layer Neural Networks
A neural network is a model that takes an input vector consisting of multiple variables and
builds a nonlinear function to predict a given response. Suppose we have an input vector of p
variables:
The goal is to build a function f(X) that predicts the response Y. Earlier, we studied nonlinear
models such as decision trees, boosting, and generalized additive models. What makes neural
networks different from these methods is their specific layered structure, which mimics how
information flows in the brain.
Let us consider a simple feed-forward neural network. Suppose we have p=4 predictors. In
neural network terminology:
Here:
• f(X) is the final prediction or output the model gives for input X.
• The ∑ (sum) symbol means "add up" for each hidden unit (from 1 to K).
• hₖ(X) is the output from the k-th hidden unit for input X.
• g is the activation function (like turning the hidden unit "on" or "off" depending on
the signal).
• wₖⱼ represents the strength of the connection (weight) between input Xⱼ and
hidden unit k.
1. Each hidden unit takes the inputs (X), combines them with its own weights and adds a
bias.
3. All these outputs from each hidden unit are weighted by βₖ and added together with the
main bias β₀.
In summary, the network transforms the original inputs into new nonlinear features through the
activation function, and then fits a linear regression on them.
The key component of neural networks is the activation function g(z), which introduces
nonlinearity. Without it, the entire model would collapse into a simple linear regression.
Common activation functions:
Sigmoid Function
The choice of activation is crucial: sigmoid helps with probabilistic interpretation, while ReLU is
preferred for efficiency and handling complex patterns.
To understand how nonlinear activations capture interactions, let’s consider a simple example:
Parameters:
Now:
Final function:
This example shows that even simple nonlinear functions of linear combinations can produce
interaction effects between input variables.
However, in practice, we do not use quadratic activations, since they always lead to second-
degree polynomials. Functions like sigmoid and ReLU are more powerful because they do not
impose such strict limitations.
The nonlinearity of the activation function is what gives neural networks their power. If g(z) were
simply the identity function (i.e., g(z)=z), then:
The final step is fitting the model to data. This requires estimating the unknown parameters:
For regression tasks, the most common choice of loss function is the squared error:
Here, the parameters are chosen to minimize the total squared difference between predicted
and actual values.
Modern neural networks often use more than one hidden layer. Although a single hidden layer
with enough units can theoretically approximate almost any function, training such a large
single-layer network is very difficult. Using multiple layers with fewer units makes the learning
task easier. This approach allows the network to learn step by step, building up more abstract
features at each layer.
A well-known example for testing neural networks is the MNIST dataset. It contains images of
handwritten digits from 0 to 9. Each image is 28 by 28 pixels, which makes 784 pixels in total.
Every pixel has a grayscale value between 0 and 255 that shows how dark it is.
To use these images in a neural network, the pixels are arranged into an input vector of length
784. The correct output is the digit label, which is represented using one-hot encoding. For
example, if the digit is “3,” the output vector will have a 1 in the fourth position and 0 in all others.
There are 60,000 training images and 10,000 test images in this dataset.
Digit recognition might look simple for humans because our brains are highly adapted to
recognize patterns. However, for machines this task was not easy, and research on digit
recognition in the late 1980s helped push the
development of modern neural networks.
The network used for MNIST has three main parts: the
input layer, two hidden layers, and the output layer.
● The input layer has 784 units (one for each pixel).
● The first hidden layer has 256 units.
● The second hidden layer has 128 units.
● The output layer has 10 units, one for each digit from 0 to 9.
In total, this network has around 235,000 parameters (weights) that need to be learned. Each
hidden layer applies a transformation to its input, using an activation function such as ReLU or
sigmoid, so that the network can detect more complex patterns as the data flows through the
layers.
Finally, the output layer produces ten values, one for each digit. To convert these values into
probabilities, the softmax function is used. This ensures that all outputs are non-negative and
add up to one. The digit with the highest probability is chosen as the prediction.
To train the network, we need a loss function that measures how well the predicted probabilities
match the true labels. For classification tasks like this, the most common choice is the cross-
entropy loss. This loss punishes the model if it assigns a low probability to the correct class. By
minimizing this loss with optimization techniques such as gradient descent, the network
gradually improves its accuracy.
To train this network, since the response is qualitative, we look for coefficient estimates that
minimize the negative multinomial log-likelihood
If the problem had been predicting numerical values instead of classes, then a squared error
loss would have been more suitable. But since this is a classification task, cross-entropy works
best.
When tested on the MNIST dataset, neural networks perform much better than simpler linear
models such as logistic regression and linear discriminant analysis (LDA).
● A neural network with ridge regularization achieves an error rate of about 2.3%.
● A neural network with dropout regularization performs even better, with an error rate
of only 1.8%.
● In comparison, logistic regression has an error rate of 7.2%, and LDA has 12.7%.
This shows that neural networks are much more powerful for pattern recognition tasks
because they can capture nonlinear relationships in the data.
Even though MNIST has 60,000 training examples, the network still has more parameters than
data points. This can easily cause overfitting, where the model memorizes the training data but
fails to generalize to new data.
To avoid overfitting, regularization methods are used. Two common techniques are:
● Ridge Regularization: Adds a penalty for large weights, preventing the model from
becoming too complex.
● Dropout Regularization: Randomly drops out some units during training so the network
does not depend too heavily on specific neurons.
Both of these techniques help the model perform better on test data.
Fitting a neural network is more complex compared to linear or logistic regression. The main
challenge comes from the nonlinear structure, the large number of parameters, and the risk of
overfitting. While modern software makes this process easier, it is important to understand the
principles behind model fitting.
We start with the simple neural network introduced earlier in Section 10.1. In this network, the
parameters include:
● The hidden layer weights for each unit wₖ = (wₖ₀, wₖ₁, …, wₖₚ) where k = 1, …, K.
Given training data (xᵢ, yᵢ) for i = 1, …, n, we try to fit the model by minimizing the squared
error loss:
● Regularization
○ Penalties such as ridge (L2) or lasso (L1) are added to control parameter size.
○ This prevents the model from memorizing noise in the data.
Intuition: Imagine standing in a hilly landscape. Gradient descent is like walking downhill step
by step until reaching a valley (a minimum). Sometimes this valley is global (best solution), other
times it is only local.
10.7.1 Backpropagation
The key to making gradient descent work efficiently is backpropagation, which applies the chain
rule of differentiation.
Where
Here:
● The chain rule propagates this error backward through the network → hence the term
backpropagation.
10.7.2. Regularization and Stochastic Gradient Descent (SGD)
●
○ Early stopping also acts as a regularizer — stop training when validation error
starts increasing.
With trial and error, networks can achieve very low error rates (e.g., <1% on MNIST digits), but
over-tuning can cause overfitting.
import tensorflow as tf
from tensorflow import keras
from [Link] import layers
1. Input shape = (2,) because each input has 2 values (x1, x2).
2. First layer (hidden layer):
o Dense(2) → fully connected layer with 2 neurons.
o activation="sigmoid" → each neuron outputs a value between 0 and 1.
3. Second layer (output layer):
o Dense(1) → 1 neuron output (since XOR has one output).
o activation="sigmoid" → outputs a probability between 0 and 1.
During training:
What Is a Hyperplane?
In mathematics, a hyperplane is a flat surface that has one dimension less than the space in
which it exists. For example, in two dimensions, a hyperplane is simply a line, and in three
dimensions a hyperplane is a plane. When we move to higher dimensions (p > 3), it becomes
hard to visualize, but the idea is still the same: a hyperplane is always one dimension less than
space.
Any point that satisfies this equation lies on the hyperplane. If the point
does not satisfy the equation exactly, then it lies either on one side of the hyperplane or on the
other side. Specifically, if
then the point lies on one side of the hyperplane. On the other hand, if
then the point lies on the opposite side. In this way, a hyperplane divides the entire space into
two halves.
Once we have such a hyperplane, it can be used as a classifier. A new test observation is
classified depending on which side of the hyperplane it lies. Mathematically, we compute
If f(x) > 0, the observation is assigned to class +1. If f(x)<0, it is assigned to class −1. The size of
∣f(x)∣ also tells us how confident the classification is. If f(x)is large in magnitude, the point lies
far away from the hyperplane, and we are more confident in the class label. If f(x) is close to
zero, the point lies near the hyperplane, and the classification is less certain.
Thus, a classifier based on a separating hyperplane produces a linear decision boundary that
divides the space into two regions, one for each class.
Our goal is to develop a classifier using training data that can correctly classify this test
observation.
If it is possible to construct a separating hyperplane, then the two classes can be perfectly
divided. For the blue class (y_i = +1), the condition is:
We can label the observations from the blue class as yi =-1and those from the purple class as
yi = 1. Then a separating hyperplane has the property that
● If f(x) > 0, we assign the test
observation to class +1.
● If f(x) < 0, we assign the test
observation to class –1.
A natural choice is the maximal margin hyperplane (also called the optimal separating
hyperplane). This is the hyperplane that is farthest away from the training observations. To
define this, we first calculate the perpendicular distance from each training observation to a
given hyperplane. The smallest of these distances is called the margin. The maximal margin
hyperplane is the one that maximizes this margin, meaning it is the hyperplane with the largest
possible minimum distance to the training points.
Once we have the maximal margin hyperplane, we can classify a test observation based on
which side of this hyperplane lies. This method is called the maximal margin classifier. The
idea is that if we build a classifier with a large margin on the training data, it will hopefully also
give a large margin on unseen test data, leading to good classification accuracy. However, when
the number of features p is very large, the maximal margin classifier can sometimes overfit the
data.
If the coefficients of the maximal margin hyperplane are β0, β1, ..., βp, then classification is
done using the function:
Figure 9.3 shows an example of the maximal margin hyperplane. Compared to the separating
hyperplanes in Figure 9.2, the maximal hyperplane margin clearly results in a larger margin
between the classes. It can be thought of as the mid-line of the widest “slab” that can be placed
between the two classes without touching any training points.
Looking at Figure 9.3, we see that three training observations lie exactly on the dashed lines that
mark the edges of the margin. These points are called support vectors. They are the points that
directly “support” the maximal margin hyperplane. If any of these support vectors were moved
slightly, the position of the hyperplane would change. However, moving any of the other training
observations would not affect the hyperplane if they do not cross into the margin.
This property is very important. It means that the maximal margin hyperplane depends only on
a small subset of the training data (the support vectors), not on all the observations. This idea
will play a central role later when we discuss the support vector classifier and support vector
machines.
The condition
ensures that each observation is on the correct side of the hyperplane with a cushion of at least
margin M (as long as M > 0).
defines a hyperplane, then multiplying the whole equation by any constant k also defines the
same hyperplane. By fixing the sum of squares of the coefficients to 1, we make the margin well-
defined.
Under these constraints, the perpendicular distance from the ith observation to the
hyperplane is given by:
Thus, maximizing M means finding the hyperplane that maximizes the margin between the two
classes. This hyperplane is called the maximal margin hyperplane.
The maximal margin classifier works well when the data can be perfectly separated by a straight
line (or hyperplane in higher dimensions). In such cases, it finds the boundary that leaves the
largest possible margin between the two classes.
However, in many real situations, the data from different classes overlap and no single
hyperplane can separate them. When this happens, the optimization problem for the maximal
margin classifier has no solution,
since it requires a margin greater than
zero.
To handle such cases, the idea of a separating hyperplane is relaxed. Instead of demanding
perfect separation, we allow for some flexibility so that the boundary “almost” separates the
classes. This leads to what is called the soft margin, which forms the basis of the support
vector classifier, a more general version of the maximal margin classifier that works when the
data are not perfectly separable.
In many cases, data points from two classes cannot be perfectly separated by a straight line
(hyperplane). Even if separation is possible, using such a classifier may not always be a good
choice. The reason is that a hyperplane that separates all training points exactly can become
too sensitive to individual observations.
For example, in Figure 9.5 (left panel), we see two classes of points—blue and purple—with a
maximal margin hyperplane. But when a single extra blue point is added (right panel), the
hyperplane shifts dramatically. The new separating line has a very small margin, which makes
it unreliable. Since the distance of points from the hyperplane is a measure of classification
confidence, a tiny margin means we cannot be confident about the predictions. This shows that
the maximal margin classifier can easily overfit, adjusting too much to specific training points
instead of capturing the general pattern.
To solve this issue, it can be useful to allow a classifier that does not perfectly separate the two
classes. By permitting a few misclassifications, we can achieve a boundary that better
represents the overall data. This approach is called the support vector classifier or soft margin
classifier.
The idea is that instead of forcing all points to be on the correct side of both the hyperplane and
the margin, we allow some flexibility. A few observations may fall inside the margin, and some
may even be on the wrong side of the hyperplane. The margin is called “soft” because it can be
violated by certain training points.
For instance, in Figure 9.6 (left panel), most observations lie correctly outside the margin, but a
few fall inside it. In the right panel of Figure 9.6, some points are even misclassified because
they lie on the wrong side of the hyperplane. This situation is expected when no perfect
hyperplane exists. By accepting these violations, the support vector classifier achieves greater
robustness and usually performs better on unseen data.
The support vector classifier decides the class of a new observation based on which side of a
hyperplane falls. The hyperplane is chosen so that it separates most of the training data
correctly, but it may misclassify a few points.
Here, M is the margin we want to maximize, ξi are slack variables, and C is a nonnegative tuning
parameter. After solving this, we classify a new observation x^∗ by checking the sign of
The slack variables ξi describe where each observation lies relative to the margin:
The parameter C controls how many violations are allowed. Equation (9.15) makes sure the
total amount of violation does not exceed C. If C=0, then all ξi=0, so no violations are allowed,
and the problem reduces to the maximal margin hyperplane (which only works if the data are
separable). For C>0, at most C observations can be on the wrong side of the hyperplane, since
that requires ξi>1.
When C is larger, the model allows more violations, and the margin becomes wider. When C is
smaller, fewer violations are allowed, and the margin becomes narrower. Figure 9.7 shows this:
the top-left panel uses the largest value of C, and the margin is wide. As C decreases (top-right,
bottom-left, bottom-right panels), the margin gets narrower.
When C is large, many observations violate the margin, so there are many support vectors. This
makes the classifier depend on many points. In the top-left panel of Figure 9.7, there are many
support vectors. When C is small, there are fewer support vectors, as shown in the bottom-right
panel of Figure 9.7, which has only eight support vectors.
The support vector classifier works well when the boundary between two classes is roughly
linear. But in many real problems, the classes are separated by non-linear boundaries.
For example, in the left panel of Figure 9.8, the data clearly has a non-linear boundary. A
support vector classifier or any linear classifier would perform badly. This is confirmed in the
right panel of Figure 9.8, where the linear support vector classifier does not separate the
classes properly.
A similar situation was seen earlier in Chapter 7, where linear regression struggled with non-
linear relationships. There, we solved the issue by enlarging the feature space using
polynomial terms like quadratic or cubic functions.
We can use the same idea here for the support vector classifier. Instead of just using the original
predictors X1, X2,…, Xp, we could add polynomial functions of them. For example, we could fit
a classifier with features:
This would double the number of features, giving 2p features instead of just p.
In the enlarged feature space, the decision boundary is still linear. But in the original space, the
boundary comes from a polynomial equation, which is generally non-linear.
We can even add higher-order polynomials (cubic, quartic, etc.) or interaction terms like Xj. But
if we keep adding more functions, the number of features could become extremely large,
making computation difficult.
This is where the support vector machine (SVM) comes in. It allows us to enlarge the feature
space efficiently using kernels, without explicitly creating all those new features.
The support vector machine (SVM) is an extension of the support vector classifier that handles
non-linear boundaries by using kernels. Kernels allow us to enlarge the feature space efficiently
without having to compute all polynomial or interaction terms directly.
The details of how the support vector classifier is computed are technical, but the key idea is
Now, instead of using just the standard inner product (9.17), we can replace it with a kernel
function:
● If we use
● we get back the linear kernel, which is just the regular support vector classifier.
● If we use
we get a polynomial kernel of degree d. With d>1, this allows non-linear boundaries.
When the support vector classifier is combined with such a kernel, it is called a
support vector machine. The classifier is then:
● The left panel of Figure 9.9 shows an SVM with a polynomial kernel of degree 3 applied
to the non-linear data from Figure 9.8. It gives a much better decision boundary than
the linear support vector classifier.
The right panel of Figure 9.9 shows an SVM with a radial kernel on the same data, and it
also gives a good separation.
-
How does the radial kernel work?
● If a test observation x∗ is far (in Euclidean distance) from a training observation xi, then
K(x∗,xi) will be close to 0.
● This means far-away points have little to no influence on the classification of x∗.
So, the radial kernel has a local behavior, where only nearby training points affect the
decision.
Finally, why use kernels instead of directly enlarging the feature space like in Section 9.3.1?
The main reason is computation. With kernels, we only need to compute K(xi,xi′) for all
distinct pairs of observations, without explicitly working in the enlarged feature space.
This is crucial because in many cases the enlarged feature space is extremely large, or even
infinite (as with the radial kernel). Kernels let us work efficiently without ever constructing that
huge space.
Decision trees and related methods were applied to the Heart data. The goal was to use 13
predictors such as Age, Sex, and Chol to predict whether a person has heart disease. Here, an
SVM is compared to LDA on this data. After removing 6 missing observations, 297 subjects
remain, split into 207 training and 90 test observations.
First, LDA and the support vector classifier (SVM with polynomial kernel of degree 1) were fitted
to the training data. The left panel of Figure 9.10 shows ROC curves for training set predictions
from both methods.
Both calculate scores of the form
For a cutoff t, predictions are made based on whether . The ROC curve
is formed by calculating false positive and true positive rates across different values of t. The
best classifier will be close to the top left of the ROC plot. Here, LDA and the support vector
classifier both perform well, though the support vector classifier may be slightly better.
The right panel of Figure 9.10 shows ROC curves for SVMs using a radial kernel with different
values of γ. As γ increases, the fit becomes more non-linear, and the ROC curves improve. With
γ=10^−1, the training ROC curve looks almost perfect. But these are training results, which may
not reflect test performance.
Figure 9.11 shows ROC curves for the 90 test observations. In the left panel, the support vector
classifier has a small advantage over LDA (though not statistically significant). In the right panel,
the SVM with γ=10^−1, which performed best on training data, performs worst on test data.
This shows that more flexible models can lower training error but not necessarily improve test
performance. The SVMs with γ=10^−2 and γ=10^−3perform similarly to the support vector
classifier, and all three do better than the SVM with γ=10^−1.
In this approach, separate classifiers are trained for every possible pair of classes.
This method compares each class with all the remaining classes together.
● Each classifier predicts whether an observation belongs to its specific class (+1) or to
the rest (–1).
● For a test observation:
A score is calculated for each class.
The class with the highest score is chosen as the predicted class.
5.1 Cross-Validation
● Training error → how well the model predicts the data it was trained on.
● Test error → how well the model predicts new, unseen data.
What we really care about is the test error, since it tells us how the model will perform in real
situations. But usually, we don’t have a separate test set available.
The training error is easy to calculate, but it often underestimates the true test error
(sometimes by a lot). That’s why we need ways to estimate the test error using only the training
data.
1. Mathematical adjustment → tweak the training error to better estimate the test error
2. Resampling methods → hold out part of the training data, train the model on the rest, and
then test it on the held-out part.
In this section, the focus is on the second strategy: estimating test error by holding out subsets
of data.
The validation set approach is a simple way to estimate how well a model will perform on
unseen data. In this method, the dataset is randomly divided into two parts: a training set and
a validation (hold-out) set. The model is trained using the training set and then evaluated on
the validation set. The prediction error on the validation set, often measured using Mean
Squared Error (MSE) for regression problems, serves as an estimate of the test error.
To illustrate this, consider the Auto dataset, where we want to predict miles per gallon (mpg)
from horsepower using polynomial regression. The dataset of 392 observations was randomly
split into two equal halves: 196 for training and 196 for validation. The results showed that a
quadratic model (including horsepower and horsepower²) had a much lower validation error
compared to a simple linear model. Interestingly, adding a cubic term did not improve
performance; in fact, the cubic model performed slightly worse than the quadratic one. This
suggests that while the linear model was too simple, a quadratic model was sufficient without
needing more complexity.
However, one drawback of this method is that the results can vary depending on how the data
is split. For example, if we split the Auto dataset multiple times, the estimated test error
changes with each split. Although all splits consistently show that the linear model is
inadequate and the quadratic model is effective, there is no agreement about whether higher-
order polynomials offer any real benefit. This highlights the high variability of the validation set
approach.
Another drawback is that the model is trained on only part of the data, not the full dataset. Since
statistical models generally perform better when trained on more data, the validation error may
be larger than the true test error we would expect if the model were trained on the entire
dataset. In other words, the validation set approach tends to be too pessimistic.
Because of these issues, the validation set approach, though simple and easy to implement, is
not always reliable. To overcome its limitations, more advanced methods such as cross-
validation are used, which provide more stable and accurate estimates of test error.
5.1.2 Leave-One-Out Cross-Validation
Leave-One-Out Cross-Validation (LOOCV) is very similar to the validation set approach, but
it improves on its drawbacks. Instead of splitting the data into two large subsets, LOOCV takes
a different approach: each time, it leaves out just one observation to use as the validation set,
and uses the remaining n−1 observations for training. The model is fit on these n−1 data points,
and then it predicts the left-out observation. The squared error for this prediction is recorded.
This process is repeated for every single observation in the dataset. That means if we have n
data points, the model will be trained n times, each time leaving out a different observation. At
the end, all n squared errors are averaged to give the LOOCV estimate of the test error:
LOOCV has some clear advantages over the simple validation set approach. First, it has low
bias, because each training set contains almost all the data (n−1observations), rather than just
half the dataset as in the basic validation method. This makes its error estimates much closer
to what we would expect if we trained on the full dataset. Second, LOOCV results are
deterministic. Unlike the validation set method, which gives different answers depending on
how we randomly split the data, LOOCV always gives the same result because every possible
split is considered.
For example, in the Auto dataset (predicting mpg from horsepower using polynomial
regression), LOOCV can be used to estimate the test error for models of different polynomial
degrees. The error curve from LOOCV provides insight into which model complexity works best,
without depending on random data splits.
The main drawback of LOOCV is its computational cost. Since the model must be trained n
times, it can be very slow for large datasets or for models that take a long time to train.
Fortunately, for linear regression (including polynomial regression), there is a mathematical
shortcut that makes LOOCV as fast as fitting just one model. This shortcut uses a formula
involving the residuals and “leverage” values from the regression, which adjusts errors
appropriately for each observation.
Finally, LOOCV is a general method that can be applied to almost any predictive model, such
as logistic regression, linear discriminant analysis, and more. However, the special shortcut
formula only works for linear regression; for other models, the full n-times fitting process is
necessary.
LOOCV can be viewed as a special case of k-fold CV, where k = n (the number of observations).
However, in practice, k is usually set to 5 or 10, leading to 5-fold CV or 10-fold CV. The main
reason for this is computational efficiency: LOOCV requires fitting the model n times, which can
be extremely slow for large datasets or complex models, while 10-fold CV requires fitting only
10 models, making it much more feasible. Beyond computation, 5- or 10-fold CV also has
advantages in terms of balancing bias and variance, as discussed later.
When applied to the Auto dataset, 10-fold CV produced slightly different results depending on
how the data was split into folds, but the variability was far less than with the simple validation
set approach. In general, k-fold CV provides a more stable and reliable estimate of test error
than a single train/validation split.
To better understand how accurate cross-validation is, researchers often turn to simulated
data, where the true test error is known. In such cases, the test MSE curves from LOOCV and
10-fold CV usually look very similar to the true curve, even if they slightly over- or under-
estimate the actual error values. Importantly, even when the error values are not perfect, the
location of the minimum error point, that is, the level of model flexibility that produces the
best performance, is almost always correctly identified by cross-validation.
Thus, k-fold CV not only provides a good estimate of test error but also serves as a practical tool
for model selection, helping us choose the model or level of complexity that will generalize
best to unseen data.
So far, we have seen that LOOCV is almost unbiased because each training set uses nearly all
the data (n−1 observations). In comparison, k-fold CV uses fewer observations in each training
set (around (k−1)n/k, so its estimates have a bit more bias. From this perspective, LOOCV might
look like the better choice.
However, bias is not the only issue we also need to consider variance. LOOCV tends to have
higher variance than k-fold CV. This happens because in LOOCV, each of the n models is
trained on nearly identical datasets, so their results are highly correlated. Averaging many highly
correlated results does not reduce variance much. In contrast, k-fold CV with k<n trains on
more distinct subsets, so the models are less correlated with each other. This leads to a lower
variance in the final error estimate.
In other words, LOOCV has low bias but high variance, while k-fold CV (with k=5or 10) balances
the two by introducing a small amount of bias but reducing variance significantly. Empirical
studies show that 5-fold or 10-fold CV usually gives the most reliable estimates of test
error, avoiding both the high variance of LOOCV and the high bias of the simple validation set
approach.
Up to this point, cross-validation was illustrated in regression problems where the outcome Y
is quantitative, and test error was measured using mean squared error (MSE). However, cross-
validation is equally useful for classification problems where Y is qualitative. In this case,
instead of MSE, the measure of test error is simply the misclassification rate, the proportion
of observations incorrectly predicted. For example, in LOOCV, the error rate is the average
number of misclassified observations across all folds. The same idea applies to k-fold CV and
the validation set approach.
In practice, we do not know the Bayes boundary or true test errors, so we rely on cross-
validation to guide model choice. The left panel of Figure 5.8 shows training error, test error, and
10-fold CV error for logistic regression with polynomial terms up to order 10. As expected,
training error decreases with higher flexibility, but test error follows a U-shaped curve: it
decreases initially, then increases due to overfitting. The 10-fold CV curve closely tracks the
test error, even though it slightly underestimates it. Importantly, CV correctly identifies the best
model complexity—here around 3rd or 4th order polynomials.
A similar pattern is seen with the K-nearest neighbors (KNN) classifier (right panel of Figure
5.8). Training error decreases as K becomes smaller (i.e., the model becomes more flexible),
but test error again follows a U-shape. Cross-validation provides a reliable estimate of test error
and helps identify the best value of K. Thus, cross-validation is a powerful tool for model
selection in classification, just as in regression.
19.7 Measuring Classifier Performance
For classification, especially for two-class problems, a variety of measures have been
proposed. There are four possible cases:
● For a positive example, if the prediction is also positive, this is a true positive (TP).
● If the prediction is negative for a positive example, this is a false negative (FN).
● For a negative example, if the prediction is also negative, this is a true negative (TN).
In some two-class problems, we make a distinction between the two classes and hence the
two types of errors: false positives and false negatives. Different measures, appropriate in
different settings, are given in the following table.
Let us consider an authentication application where users log in by voice. A false positive
means wrongly logging in an impostor, and a false negative means refusing a valid user. It is
clear that the two types of errors are not equally bad; the former is much worse.
● True Positive Rate (TP-rate / Hit Rate): Proportion of valid users authenticated.
● False Positive Rate (FP-rate / False Alarm Rate): Proportion of impostors wrongly
accepted.
Suppose the system returns a probability of the positive class, ˆP(C1|x). Then, for the negative
class, we have ˆP(C2|x) = 1 − ˆP(C1|x). We choose “positive” if ˆP(C1|x) > θ.
● If θ is close to 1, we rarely choose the positive class. This gives almost no false positives
but also very few true positives.
● If we decrease θ, we increase the number of true positives but risk introducing false
positives.
For different values of θ, we can obtain multiple pairs of (TP-rate, FP-rate). By connecting
them, we get the ROC curve.
● Any classifier below the diagonal can be improved by flipping its decisions.
If two ROC curves intersect, one classifier may be better under certain loss conditions, while
the other is better under others.
ROC allows a visual analysis, but we can also reduce it to a single number: the Area Under the
Curve (AUC).
As in ROC analysis, for different thresholds, we can plot Precision vs Recall curves.
X = [
['young', 'high'],
['young', 'high'],
['middle', 'medium'],
['old', 'low'],
['old', 'low'],
['old', 'medium'],
['middle', 'medium'],
['young', 'low']
# Labels (target)
le_age = LabelEncoder()
le_income = LabelEncoder()
le_label = LabelEncoder()
# Encode features
import numpy as np
# Encode labels
y_encoded = le_label.fit_transform(y)
model = GaussianNB()
[Link](X_encoded, y_encoded)
test = [Link]([[le_age.transform(['middle'])[0],
le_income.transform(['high'])[0]]])
pred = [Link](test)
print("Prediction:", le_label.inverse_transform(pred)[0])
import numpy as np
import pandas as pd
# 1. Load data
df = [Link]
y = df["MEDV"]
X, y, test_size=0.2, random_state=42
model = LinearRegression()
[Link](X_train, y_train)
y_pred = [Link](X_test)
# 7. Visualization (optional)
[Link]()
import numpy as np
# Example data
X = [Link]([1, 2, 3, 4, 5]).reshape(-1, 1)
y = [Link]([1, 4, 9, 16, 25])
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)
model = LinearRegression()
[Link](X_poly, y)
X_test_poly = [Link](X_test)
y_pred = [Link](X_test_poly)
import numpy as np
y = [Link]([1, 3, 2, 5, 7, 8, 7])
lasso = Lasso(alpha=0.1)
[Link](X, y)
y_pred = [Link](X_test)
import numpy as np
import pandas as pd
df = [Link]
X = [Link](columns=["MEDV"])
y = df["MEDV"]
# 3. Train-test split
ridge_model.fit(X_train, y_train)
# 5. Predict
y_pred = ridge_model.predict(X_test)
# 6. Evaluate
mse = mean_squared_error(y_test, y_pred)
[Link]("Actual MEDV")
[Link]("Predicted MEDV")
[Link]()
iris = datasets.load_iris()
X = [Link]
y = [Link]
# Split dataset into training and testing sets (80% train, 20% test)
svm_classifier = SVC(kernel='linear')
y_pred = svm_classifier.predict(X_test)
print("Accuracy:", accuracy)
import numpy as np
weight = [Link](1)
# Learning rate
lr = 0.01
# Training loop
for i in range(1000):
y_pred = X * weight
# Calculate error
error = y - y_pred