0% found this document useful (0 votes)
8 views176 pages

Reading Notes

The document outlines the curriculum for a Machine Learning course as part of a B.A. program, detailing topics such as supervised and unsupervised learning, regression, classification, and clustering. It includes practical guidelines for students, recommended readings, and examples of machine learning applications across various fields. Additionally, it emphasizes the importance of data preprocessing techniques like feature scaling to enhance model performance.

Uploaded by

ppdocsss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views176 pages

Reading Notes

The document outlines the curriculum for a Machine Learning course as part of a B.A. program, detailing topics such as supervised and unsupervised learning, regression, classification, and clustering. It includes practical guidelines for students, recommended readings, and examples of machine learning applications across various fields. Additionally, it emphasizes the importance of data preprocessing techniques like feature scaling to enhance model performance.

Uploaded by

ppdocsss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

B.A.(PROG.

) COMPUTER APPLICATIONS
DSC-MAJOR PAPER
SEMESTER-V
MACHINE LEARNING

All files are for limited circulation for students to read only. The material provided
is for the students to read only. There is no intention to interfere / alter the
content for any kind of personal or professional use. on this reading reference
material. The University or the faculty does not claim authorship on any of the
readings, website links or videos. The pdfs and videos given have been taken
from open educational resources which are freely available.
Guidelines of [Link]. (H) Computer Science VI Semester/ B.A. Programme V
Semester/ Generic Elective VII Semester (NEP UGCF 2022)
MACHINE LEARNING

DSC17/DSC-A5/GE7c

(Effective from Academic Year 2024-25)

[Link] Topic Chapter Reference No of


Hours
1. Unit 1: Introduction Chapter 1 [3]
Basic definitions and concepts, key
elements, supervised and
unsupervised learning, introduction 5
to reinforcement learning,
applications of ML.

2. Unit 2: Preprocessing
Feature scaling, feature selection
methods, dimensionality reduction Chapter 6 (6.1.1, 6.1.2) [2] 6
(Principal Chapter 10 (10.2) [2] [2]
Component Analysis).
3. Unit 3: Regression
Linear regression with one variable, Chapter 3 (3.1, 3.2) [2]
linear regression with multiple
variables, 12
gradient descent,
over-fitting, regularization. Chapter 6 (6.2.1) [2]
Regression evaluation metrics.
4. Unit 4: Classification
Decision trees, Chapter 3 (3.1 3.2 3.3 3.4) [1]
Naive Bayes classifier, Chapter 6 (6.1, 6.2, 6.7, 6.9) [1]
logistic regression, Chapter 4 (4.3.1, 4.3.2, 4.3.3, [2]
4.3.4) [1]
k-nearest neighbor classifier, Chapter 8 (8.1 8.2) [2] 15
perceptron, multilayer perceptron, Chapter 10 (10.1, 10.2, 10.7)
neural networks, [2]
Support Vector Machine (SVM), Chapter 9 (9.1 9.2 9.3 9.4) [2]
Classification evaluation metrics Chapter 5 (5.1) [3]
Chapter 19 (19.7)
5 Unit 5: Clustering
Approaches for clustering, distance
metrics, K-means clustering, Chapter 12 (12.4.1, 12.4.2) [2] 7
hierarchical clustering.
Essential/recommended readings

1. Mitchell, T.M. Machine Learning, McGraw Hill Education, 2017.


2. James, G., Witten. D., Hastie. T., Tibshirani., R. An Introduction to Statistical
Learning with Applications in Python, Springer, 2023.
3. Alpaydin, E. Introduction to Machine Learning, MIT press, Third Edition.

Additional References

1. Flach, P., Machine Learning: The Art and Science of Algorithms that Make Sense
of Data, Cambridge University Press, 2015.
2. Christopher & Bishop, M., Pattern Recognition and Machine Learning,
New York: Springer-Verlag, 2016.
3. Sebastian Raschka, Python Machine Learning, Packt Publishing Ltd, 2019.

Practicals

For practical Labs for Machine Learning, students may use softwares like MATLAB/
Octave/ Python/ R. Utilize publically available datasets from online repositories like
[Link] and [Link]

For evaluation of the regression/classification models, perform experiments as follows:

● Split datasets into training and test sets and evaluate the decision models
● Perform k-cross-validation on datasets for evaluation

Report the efficacy of the machine learning models as follows:

● MSE and R2 score for regression models


● Accuracy, TP, TN, FP, TN, error, Recall, Specificity, F1-score, AUC
for classification models

For relevant datasets make prediction models for the following:

1. Naïve Bayes Classifier

2. Simple Linear Regression

3. Multiple linear regression


4. Polynomial Regression
5. Lasso and Ridge Regression

6. Logistic regression

7. Artificial Neural Network

8. K-NN classifier
9. Decision tree classification

10. SVM classification


11. K-means clustering

12. Hierarchical clustering


1.1 WHAT IS MACHINE LEARNING?

We live in the age of big data. Earlier, only large companies collected and stored data in big
computer centers. But with personal computers, the internet, and smartphones, everyone has
become both a producer and a consumer of data. Every online purchase, movie rental,
website visit, blog post, or even movement tracked by GPS creates data.

People want personalized products and services, like a supermarket recommending the right
products, or a streaming service suggesting the next movie. To do this, computers need to
detect patterns in data.

Normally, computers solve problems using an algorithm—a step-by-step set of instructions.


For example, sorting numbers has well-defined algorithms. But some problems, like predicting
customer behavior or identifying spam emails, are not so straightforward. We know the input
(an email) and the output (spam or not spam), but there’s no clear rule to transform one into
the other.

Here’s where machine learning (ML) comes in. Instead of manually writing algorithms, we let
computers learn from data. For spam detection, thousands of labeled examples (spam or not
spam) can be used to train a model that learns the underlying patterns.

The idea is not to find a perfect explanation, but a useful approximation. If patterns from past
data remain relevant, the model can make reliable predictions about the future.

When ML methods are applied to huge datasets, it’s called data mining—similar to extracting
valuable minerals from tons of raw material. For example, banks use ML for fraud detection
and credit scoring, manufacturers use it for optimization, doctors for diagnosis, telecom
companies for network improvement, and scientists for analyzing vast datasets in physics,
astronomy, and biology.

But machine learning is more than just handling databases—it’s also a core part of artificial
intelligence (AI). An intelligent system must be able to adapt to changes without programmers
manually coding every possible scenario.

For example, in face recognition, humans can easily recognize friends despite changes in
lighting, hairstyle, or angle, but cannot explain the exact steps. A computer, however, can learn
patterns in face images (like symmetry, position of eyes, nose, mouth) and recognize people
by matching those patterns. This falls under pattern recognition.

In technical terms, machine learning is about programming computers to optimize a


performance measure using past data or experiences. A model is created with adjustable
parameters and learning means tuning those parameters with training data to improve
performance. Models can be:

• Predictive – to make forecasts.

• Descriptive – to extract knowledge.

• Or both.

ML heavily uses statistics (to infer from samples) and computer science (to design efficient
algorithms that can train on huge datasets and make fast predictions). In some cases, the
speed and efficiency of learning and prediction are just as important as their accuracy.

1.2 Examples of Machine Learning Applications

Machine Learning is used in many real-world applications. Let’s go through some important
ones.

1.2.1 Learning Associations (Association Rules)

In retail, one common application is market basket analysis. This means finding patterns in
products that customers often buy together.

For example:

• If people who buy beer also tend to buy chips, then when a customer buys beer, the
system can recommend chips.

• This is useful for cross-selling—suggesting additional products a customer is likely to


buy.

Mathematically, this is expressed as a conditional probability:

• Example: P(chips | beer) = 0.7 → 70% of people who buy beer also buy chips.
• More advanced systems also consider customer attributes (age, gender, location, etc.)
to make more personalized recommendations.

Applications: online shopping recommendations, book suggestions, or predicting which web


links a user is likely to click.

1.2.2 Classification and Prediction

In classification problems, we assign data into categories (classes).

Example: Credit Scoring in Banking

• A bank needs to decide whether a loan


applicant is low-risk or high-risk.

• Inputs: customer attributes like income,


savings, job, age, past financial history, etc.

• Output: a class label (low-risk / high-risk)


or a probability of default.

A learned rule might look like:

IF income > θ1 AND savings > θ2 THEN Low-


Risk

ELSE High-Risk

This rule is called a discriminant function—it separates data points into different classes.

The main goal: Prediction. Once trained on past data, the model can predict the risk of new
loan applicants.

1.2.3 Pattern Recognition

Machine learning is heavily used in pattern recognition, where the goal is to identify objects,
signals, or structures.

• Optical Character Recognition (OCR): Recognizing characters (like zip codes on mail or
amounts on checks) from printed or handwritten text. Since handwriting varies, ML
learns from examples rather than fixed rules.
• Face Recognition: Input is an image, and the task is to assign it to a person’s identity.
Challenges include lighting, pose, or obstructions (glasses, beard).

• Medical Diagnosis: Input is patient data (age, symptoms, test results), and output is a
disease class. Missing data is common, so models must handle uncertainty.

• Speech Recognition: Converts spoken words into text. Inputs are acoustic signals, and
outputs are words. Accents, tone, and speed make this task complex. ML models often
use language models to improve accuracy.

Applications also extend to natural language processing (NLP)—such as spam filtering, text
summarization, sentiment analysis on social media, and machine translation.

1.2.4 Biometrics

Biometrics is about identifying or authenticating people based on physical or behavioral traits.

Examples:

• Physical traits: face, fingerprint, iris, palm print.

• Behavioral traits: voice, handwriting style, gait, typing rhythm.

Machine learning is used in both individual recognizers (e.g., fingerprint scanner) and in
multimodal systems that combine several inputs for more accurate and secure decisions.

1.2.5 Knowledge Extraction and Compression

Learning rules from data is not only about prediction—it also gives insight.

• Example: If a bank learns which customers are low-risk, it gains knowledge about the
characteristics of safe borrowers. This helps in targeted marketing.

• ML models also perform compression by summarizing large, complex datasets into


simple rules or patterns that are easy to understand and use.

Regression

In supervised learning, regression is used when the output is a continuous value. For example,
a dataset of used cars where mileage is taken as the input, and price is the output. A simple
linear model is fitted:
Y = w1x + w0

Here, w1 (slope) and w0 (intercept) are parameters that the algorithm optimizes to minimize
the error between predicted and actual prices.

If a straight line is too restrictive, we can use higher-order models such as:

y =w2x2+w1x+w0

or even more complex nonlinear functions.

Regression is also widely used in optimization tasks. For example, when designing a coffee-
roasting machine, inputs like temperature, roasting time, and bean type affect the taste. By
experimenting with different settings and recording consumer feedback, we can build a
regression model to predict coffee quality. The machine can then iteratively adjust its settings
toward the best configuration. This approach is called response surface design.

Another application is in robotics, such as autonomous driving. The car’s sensors (camera,
GPS, lidar, etc.) provide inputs, and regression predicts outputs like the steering angle needed
to stay on the road without hitting obstacles.

Ranking

Sometimes, instead of predicting exact values, we want to learn relative preferences. This is
called ranking. For example, in a movie recommendation system, we want to order films based
on how much a user is likely to enjoy them. Using attributes like genre, actors, and past user
ratings, a model can learn a ranking function that helps suggest new movies.

Unsupervised Learning

Unlike supervised learning, unsupervised learning has no labeled outputs. The goal is to
discover hidden patterns and structures in the data. This is also known as density estimation.

A common method is clustering, which groups similar inputs together.

Customer Segmentation: Companies analyze customer demographics and purchase history


to group similar customers. This helps design targeted products, services, and marketing
strategies. It can also identify outliers (unique customers), which may reveal new business
opportunities.
Image Compression: Pixels with similar colors can be grouped together. Instead of storing
millions of color shades, we can store only a smaller set of representative colors, saving
storage and transmission space.

Document Clustering: News articles can be grouped into categories such as politics, sports, or
fashion by comparing the words they contain. A predefined lexicon (word list) is used to
represent documents as vectors.

Clustering is also critical in bioinformatics. DNA and proteins are represented as sequences.
By clustering recurring subsequences (called motifs), scientists can discover structural or
functional elements in proteins.

Reinforcement Learning

In reinforcement learning (RL), the goal is not to predict values or labels but to learn a policy—
a sequence of actions that maximizes long-term reward.

Game Playing: In chess, a single move is not important by itself. What matters is whether it
contributes to a winning strategy. RL algorithms learn such strategies by trial and error.

Robotics: A robot navigating a room must learn the correct sequence of moves to reach a
target while avoiding obstacles.

Challenges: RL becomes harder when the agent has incomplete or uncertain information. For
example, a robot with only a camera may know there is a wall nearby but not its exact location.
Multi-agent RL is another challenge, such as a team of robots coordinating to play soccer.
Feature Scaling

Introduction

Feature scaling is a data preprocessing technique used in machine learning to bring all the
features (independent variables) of a dataset onto a comparable scale.

• Many machine learning algorithms are sensitive to the magnitude of features.

• If features are measured in different units (e.g., age in years, income in lakhs, height in
cm), variables with larger values can dominate the model.

• To prevent bias and improve performance, we rescale or normalize features without


distorting differences in their ranges or distributions.

Why Feature Scaling is Important

1. Equal Importance of Features: Prevents features with large values from dominating.

2. Faster Convergence: Gradient descent converges faster when features are scaled.

3. Better Distance Calculations: Algorithms like K-Means, KNN, and SVM rely on distance
metrics (Euclidean, Manhattan) that are sensitive to scale.

4. Improved Model Accuracy: Scaling often leads to better model performance.

5. Handles Units of Measurement: Brings different units (kg, cm, INR) to a common scale.

Types of Feature Scaling

1. Min–Max Normalization

• Rescales data to a fixed range [0,1] (or [-1,1]).

• Formula:
• Example: If salary ranges from ₹10,000 to ₹1,00,000, then salary = ₹50,000 →
normalized value = 0.44.

• Use case: Neural networks, deep learning, where values must be bounded.

2. Standardization (Z-Score Normalization)

• Transforms data so that it has mean = 0 and standard deviation = 1.

• Formula:

where μ = mean, σ = standard deviation.

• Example: If student heights have μ = 160 cm and σ = 10 cm, a height of 170 cm → z-


score = 1.

• Use case: Regression, SVM, PCA, clustering.

3. Robust Scaling

• Uses median and interquartile range (IQR) instead of mean and standard deviation.

• Formula:

• Useful when dataset contains outliers.

• Use case: Financial data, medical data with extreme values.


4. Unit Vector Scaling (Normalization to unit length)

• Scales each feature vector to have length = 1.

• Formula:

• Commonly used in text mining and NLP (TF–IDF vectors).

When Feature Scaling is Needed

Required for:

• Gradient Descent–based algorithms (Logistic Regression, Neural Networks).

• Distance-based algorithms (KNN, K-Means, SVM, PCA).

Not required for:

• Decision Trees, Random Forests, Gradient Boosted Trees (scale-invariant algorithms).

Example (Before vs After Scaling)

Feature Age (Years) Salary (₹) After Scaling (0–1)


Person A 25 50,000 (0.2, 0.44)
Person B 50 1,00,000 (0.8, 1.0)

Without scaling, salary dominates. With scaling, both features contribute equally.

Practical Considerations

• Apply scaling after train-test split to prevent data leakage.

• Different scaling for different models: Try both Min–Max and Standardization.

• Pipeline Integration: In Python’s scikit-learn, use StandardScaler, MinMaxScaler,


RobustScaler.
If features are in different
units (e.g., age in years, salary
in lakhs), the feature with
larger values can dominate.

Standardization ensures all


features contribute fairly to
the model.
FEATURE SELECTION METHODS

Feature selection is a dimensionality reduction technique used in machine learning to select


the most relevant input variables (features) for building efficient models.

• In real-world datasets, not all features contribute equally—some may be redundant,


irrelevant, or noisy.

• Feature selection improves model performance, training speed, interpretability, and


generalization.

It is different from feature extraction:

• Feature Selection → choose a subset of existing features.

• Feature Extraction → create new features by transforming existing ones (e.g., PCA).

Importance of Feature Selection

1. Reduces Overfitting – by eliminating irrelevant features, models generalize better.

2. Improves Accuracy – focuses on the most informative attributes.

3. Faster Training & Prediction – smaller datasets reduce computation.

4. Better Interpretability – selected features provide more meaningful insights.

5. Handles Curse of Dimensionality – useful in high-dimensional datasets (e.g., text,


genomics).

Categories of Feature Selection Methods

Feature selection methods are broadly classified into three categories:

1. Filter Methods

• Apply statistical tests to rank features independent of the model.

• Fast and scalable for high-dimensional data.

(a) Correlation Coefficient

• Measures linear relationship between feature x and target y.


• High correlation → feature is useful.

• Example: Pearson correlation.

(b) Chi-Square Test (χ² test)

• Used for categorical features.

• Checks independence between feature and class label.

• Example: In spam detection, word occurrence frequency vs spam/ham classification.


(c) Information Gain / Mutual Information

• Measures reduction in uncertainty about the target when a feature is known.

• Common in decision trees and text classification.

Advantages: Simple, fast, independent of learning algorithm.

Disadvantages: Ignores feature interactions, may select redundant features.


0
2. Wrapper Methods

• Use a predictive model to evaluate subsets of features.

• Iteratively select features that improve model performance.

(a) Forward Selection

• Start with no features → add features one by one → keep the one that improves
performance most.

(b) Backward Elimination

• Start with all features → remove least useful one step by step.

(c) Recursive Feature Elimination (RFE)

• Trains a model (e.g., SVM, Logistic Regression), ranks features by importance, removes
weakest features iteratively.

Advantages: Considers interactions between features, often better performance.


Disadvantages: Computationally expensive, slower for large datasets.

3. Embedded Methods

• Perform feature selection during model training.

• Combine the advantages of filter and wrapper methods.

(a) LASSO (L1 Regularization)

• Shrinks less important feature coefficients to zero, effectively performing selection.

• Useful in linear regression, logistic regression.


(b) Ridge Regression (L2 Regularization)

• Penalizes large coefficients but does not eliminate features completely.

(c) Tree-Based Feature Importance

• Decision Trees, Random Forests, and Gradient Boosted Trees assign feature
importance scores.

• Features with higher importance contribute more to prediction.

Advantages: Efficient, balances bias and variance, works well with high-dimensional data.
Disadvantages: Algorithm-dependent, less interpretable in complex models.

Comparison of Methods

Method Type Examples Speed Accuracy Suitable For

Filter Correlation, χ², Info Gain Fast Moderate High dimensions (text, genomics)

Wrapper Forward, Backward, RFE Slow High Small/medium datasets

Embedded LASSO, Decision Trees, RF Medium High Large datasets, regularized models

Dimensionality Reduction: Principal Component Analysis (PCA)

In real-world datasets, we often deal with high-dimensional data (many features). High
dimensionality creates several issues:

• Curse of Dimensionality – data becomes sparse in high-dimensional space.

• High Computation Cost – training ML models takes more time.

• Overfitting – irrelevant/redundant features degrade model accuracy.

Dimensionality Reduction is the process of reducing the number of input variables while
preserving as much information as possible.

Two common approaches are:

• Feature Selection → choose a subset of existing features.

• Feature Extraction → create new features by transforming original ones (PCA, LDA).
Principal Component Analysis (PCA) is the most widely used feature extraction technique.

What is PCA?

Principal Component Analysis (PCA)

PCA (Principal Component Analysis) is a dimensionality reduction technique used in data


analysis and machine learning. It helps you to reduce the number of features in a dataset while
keeping the most important information. It changes your original features into new features
these new features don’t overlap with each other and the first few keep most of the important
differences found in the original data.

PCA is commonly used for data preprocessing for use with machine learning algorithms. It
helps to remove redundancy, improve computational efficiency and make data easier to
visualize and analyze especially when dealing with high-dimensional data.

How Principal Component Analysis Works?

PCA uses linear algebra to transform data into new features called principal components. It
finds these by calculating eigenvectors (directions) and eigenvalues (importance) from the
covariance matrix. PCA selects the top components with the highest eigenvalues and projects
the data onto them simplify the dataset.

Note: It prioritizes the directions where the data varies the most because more variation =
more useful information.

Imagine you’re looking at a messy cloud of data points like stars in the sky and want to simplify
it. PCA helps you find the "most important angles" to view this cloud so you don’t miss the big
patterns. Here’s how it works step by step:

Step 1: Standardize the Data

Different features may have different units and scales like salary vs. age. To compare them
fairly PCA first standardizes the data by making each feature have:

• A mean of 0

• A standard deviation of 1
Step 2: Calculate Covariance Matrix

Next PCA calculates the covariance matrix to see how features relate to each other whether
they increase or decrease together. The covariance between two features x1and x2 is:

The value of covariance can be positive, negative or zeros.

Step 3: Find the Principal Components

PCA identifies new axes where the data spreads out the most:

• 1st Principal Component (PC1): The direction of maximum variance (most spread).

• 2nd Principal Component (PC2): The next best direction, perpendicular to PC1 and so
on.

These directions come from the eigenvectors of the covariance matrix and their importance
is measured by eigenvalues. For a square matrix A an eigenvector X (a non-zero vector) and
its corresponding eigenvalue λ satisfy:

AX =λX

This means:

• When A acts on X it only stretches or shrinks X by the scalar λ.


• The direction of X remains unchanged hence eigenvectors define "stable directions"
of A.

Eigenvalues help rank these directions by importance.

Step 4: Pick the Top Directions & Transform Data

After calculating the eigenvalues and eigenvectors PCA ranks them by the amount of
information they capture. We then:

1. Select the top k components hat capture most of the variance like 95%.

2. Transform the original dataset by projecting it onto these top components.

This means we reduce the number of features (dimensions) while keeping the important
patterns in the data.

Transform this 2D dataset into a 1D representation while preserving as much variance as


possible.

In the above image the original dataset has two features "Radius" and "Area" represented by
the black axes. PCA identifies two new directions: PC₁ and PC₂ which are the principal
components.

• These new axes are rotated versions of the original ones. PC₁ captures the maximum
variance in the data meaning it holds the most information while PC₂ captures the
remaining variance and is perpendicular to PC₁.
• The spread of data is much wider along PC₁ than along PC₂. This is why PC₁ is chosen
for dimensionality reduction. By projecting the data points (blue crosses) onto PC₁ we
effectively transform the 2D data into 1D and retain most of the important structure
and patterns.

Why is PCA Useful?

Instead of comparing students on two subjects (English and History), we now have one
number (PC1 score) that summarizes their performance. This is simpler for analysis, ranking,
or visualization. For example, you could use PC1 scores to rank students or group them into
performance categories.

Implementation of Principal Component Analysis in Python

Hence PCA uses a linear transformation that is based on preserving the most variance in the
data using the least number of dimensions. It involves the following steps:

Step 1: Importing Required Libraries

We import the necessary library like pandas, numpy, scikit learn, seaborn and matplotlib to
visualize results.

import numpy as np

import pandas as pd

from [Link] import StandardScaler

from [Link] import PCA

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from [Link] import confusion_matrix

import [Link] as plt

import seaborn as sns

Step 2: Creating Sample Dataset

We make a small dataset with three features Height, Weight, Age and Gender.
data = {

'Height': [170, 165, 180, 175, 160, 172, 168, 177, 162, 158],

'Weight': [65, 59, 75, 68, 55, 70, 62, 74, 58, 54],

'Age': [30, 25, 35, 28, 22, 32, 27, 33, 24, 21],
'Gender': [1, 0, 1, 1, 0, 1, 0, 1, 0, 0] # 1 = Male, 0 = Female

df = [Link](data)

print(df)

Output:

Dataset

Step 3: Standardizing the Data

Since the features have different scales Height vs Age we standardize the data. This makes all
features have mean = 0 and standard deviation = 1 so that no feature dominates just because
of its units.

X = [Link]('Gender', axis=1)

y = df['Gender']

scaler = StandardScaler()

X_scaled = scaler.fit_transform(df)

Step 4: Applying PCA algorithm

• We reduce the data from 3 features to 2 new features called principal components.
These components capture most of the original information but in fewer dimensions.

• We split the data into 70% training and 30% testing sets.

• We train a logistic regression model on the reduced training data and predict gender
labels on the test set.

pca = PCA(n_components=2)

X_pca = pca.fit_transform(X_scaled)

X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size=0.3, random_state=42)


model = LogisticRegression()

[Link](X_train, y_train)

y_pred = [Link](X_test)

Step 5: Evaluating with Confusion Matrix

The confusion matrix compares actual vs predicted labels. This makes it easy to see where
predictions were correct or wrong.

cm = confusion_matrix(y_test, y_pred)

[Link](figsize=(5,4))

[Link](cm, annot=True, fmt='d', cmap='Blues',


xticklabels=['Female', 'Male'], yticklabels=['Female', 'Male'])

[Link]('Predicted Label')

[Link]('True Label')

[Link]('Confusion Matrix')

[Link]()

Output:

Advantages of Principal Component Analysis

1. Multicollinearity Handling: Creates new, uncorrelated variables to address issues when


original features are highly correlated.

2. Noise Reduction: Eliminates components with low variance enhance data clarity.

3. Data Compression: Represents data with fewer components reduce storage needs and
speeding up processing.

4. Outlier Detection: Identifies unusual data points by showing which ones deviate
significantly in the reduced space.

Disadvantages of Principal Component Analysis

1. Interpretation Challenges: The new components are combinations of original variables


which can be hard to explain.
2. Data Scaling Sensitivity: Requires proper scaling of data before application or results
may be misleading.

3. Information Loss: Reducing dimensions may lose some important information if too
few components are kept.

4. Assumption of Linearity: Works best when relationships between variables are linear
and may struggle with non-linear data.

5. Computational Complexity: Can be slow and resource-intensive on very large datasets.

6. Risk of Overfitting: Using too many components or working with a small dataset might
lead to models that don't generalize well.

Objectives of PCA

1. Reduce the number of dimensions (features).

2. Minimize information loss.

3. Remove correlation among features.

4. Improve visualization of high-dimensional data.

5. Speed up machine learning algorithms.

Mathematical Working of PCA

Step 1: Standardize the Data

• Scale features so that they have mean = 0 and variance = 1.

• Necessary because PCA is sensitive to different feature scales.

Step 2: Compute the Covariance Matrix

• Measures how features vary with respect to each other.

• For dataset X with n features:


Step 3: Calculate Eigenvalues and Eigenvectors

• Eigenvectors represent directions of principal components.

• Eigenvalues represent the amount of variance captured by each principal component.

Step 4: Sort Eigenvalues and Select Top k Components

• Arrange eigenvalues in descending order.

• Select top k eigenvectors corresponding to the largest eigenvalues.

Step 5: Transform the Data

• Project original dataset onto new subspace:

Xnew = X⋅W

where W = matrix of top k eigenvectors.


`
Note: In step 3:
Example

Suppose we have two features: Height (cm) and Weight (kg).

• They are correlated → taller people usually weigh more.

• PCA finds a new axis (principal component) that represents “overall body size.”

• Instead of using both Height & Weight, we can reduce to 1 dimension (PC1) with
minimal loss of information.

Applications of PCA

1. Data Compression – reduce storage and computation.

2. Noise Filtering – ignore components with very low variance.


3. Visualization – reduce high-dimensional data to 2D/3D.

4. Preprocessing – before clustering (K-Means), classification (SVM, Logistic Regression).

5. Image Recognition – eigenfaces for face recognition.

6. Genomics & Bioinformatics – analyzing gene expression data.

SIMPLE LINEAR REGRESSION

Simple Linear Regression is the most basic statistical method used for prediction. It is applied
when we want to study the relationship between a single predictor (independent variable, X)
and a quantitative response (dependent variable, Y).

The key assumption is that the relationship between X and Y is approximately linear.
3.1.2 Assessing Accuracy of Coefficient Estimates

So far, we fit a line assuming the relationship between X and Y is exactly linear.
But in reality:

• ϵ: Error term (captures everything the model cannot explain).

Why do we need ϵ?

• The true relationship is not perfectly linear.

• Other factors (not included in the model) may influence Y.

• There could be measurement errors in data collection.

Key Assumption:
We usually assume ϵ is independent of X and has mean zero.

This means errors are random noise, not systematic bias.

Where,

yi = actual value,

y^i = predicted value from regression.


In simple linear regression, the R² statistic is equivalent to the squared correlation between
the predictor and the response. However, this relationship does not extend naturally to
multiple linear regression, where several predictors are used simultaneously. Since
correlation only measures the relationship between two variables, it cannot capture the
combined effect of multiple predictors. In this setting, R² serves as the appropriate measure
of how well the predictors, taken together, explain the variability in the response.

import numpy as np

import [Link] as plt

from sklearn.model_selection import train_test_split

from [Link] import mean_squared_error, mean_absolute_error,


r2_score

from sklearn.linear_model import LinearRegression

import [Link] as sm

# Dataset

X = [Link]([1,2,3,4,5,6,7,8,9,10]).reshape(-1,1) # Feature

y = [Link]([2,4,5,4,5,7,8,9,10,12]) # Target

# Train-Test Split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)

# --- sklearn Linear Regression ---

model = LinearRegression()

[Link](X_train, y_train)

y_pred = [Link](X_test)

# Evaluation Metrics
mse = mean_squared_error(y_test, y_pred)

rmse = [Link](mse)

mae = mean_absolute_error(y_test, y_pred)

r2 = r2_score(y_test, y_pred)

print("Evaluation Metrics:")

print(f"Mean Squared Error (MSE): {mse:.2f}")

print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")

print(f"Mean Absolute Error (MAE): {mae:.2f}")

print(f"R² Score: {r2:.2f}")

# --- statsmodels for p-value ---

X2 = sm.add_constant(X) # add intercept term

ols_model = [Link](y, X2).fit()

print("\nOLS Regression Results:")

print(ols_model.summary()) # includes coefficients, p-values, R², etc.

# Plot results

[Link](X, y, color="blue", label="Actual Data")

[Link](X, [Link](X), color="red", label="Regression Line")

[Link]("X")

[Link]("y")

[Link]()

[Link]()

OUTPUT
MSE = 0.80

RMSE = 0.89

MAE = 0.74

R² Score = 0.91

 Intercept (β₀) = 1.0667 (p = 0.081 → not statistically significant at


5%)

 Slope (β₁ for X) will appear next in the summary with a very small p-
value (strong evidence that X significantly predicts y).

 R² = 0.945 (the model explains ~94.5% of the variance in y).

 Prob (F-statistic) = 2.63e-06 (model is highly significant overall).

OLS stands for Ordinary Least Squares. It’s the most common method used to estimate the
coefficients (β₀, β₁, …) in linear regression.
Why OLS is important

• It gives estimates of intercept and slopes.


• It provides statistical tests (t-tests, p-values, confidence intervals).
• It gives overall model significance (F-test, R²).

When OLS is not ideal

OLS relies on some assumptions. If these assumptions fail, OLS can give biased, inefficient, or
misleading results.

Linearity assumption: OLS assumes the relationship between X and y is linear. If the true
relationship is nonlinear, OLS won’t capture it well.

Multicollinearity (predictors correlated): If predictors are highly correlated, OLS coefficients


become unstable.

Outliers: OLS minimizes squared errors, so it is very sensitive to outliers (a single extreme
point can distort the regression line).

Homoscedasticity assumption: OLS assumes the variance of errors is constant. If variance


changes (heteroscedasticity), OLS standard errors and p-values become unreliable.
Normality assumption of errors: For hypothesis testing, OLS assumes errors are normally
distributed. If not, p-values may be inaccurate.

Alternatives to OLS

Depending on the problem, here are some better approaches:

1. Ridge Regression (L2 Regularization)

• Penalizes large coefficients.


• Helps when predictors are multicollinear.
• Shrinks coefficients but doesn’t set them to zero.

2. Lasso Regression (L1 Regularization)

• Like Ridge, but also performs feature selection (sets some coefficients to zero).
• Useful when you have many predictors and want to identify the important ones.

3. Elastic Net

• Combination of Ridge + Lasso.


• Useful when predictors are many and correlated.

4. Robust Regression

• Example: Huber Regression, RANSAC Regression.


• Less sensitive to outliers compared to OLS.

5. Generalized Linear Models (GLM)

• If y is not continuous (e.g., binary, count data), use alternatives:


o Logistic Regression (binary outcomes).
o Poisson Regression (count data).

6. Non-linear Regression / Machine Learning models

• If relationship isn’t linear:


o Decision Trees, Random Forest, Gradient Boosting.
o Neural Networks.

MULTIPLE LINEAR REGRESSION

Simple linear regression models the response using a single predictor. In practice, we often
have multiple predictors. For example, in the Advertising dataset, sales were modeled using
TV advertising. But we also have data on radio and newspaper advertising, and we want to
know how these media are related to sales.

A naive approach is to fit separate simple regressions for each predictor (e.g., sales vs. radio,
sales vs. newspaper). For instance:

• A $1,000 increase in radio advertising is associated with ~203 units higher sales.
• A $1,000 increase in newspaper advertising is associated with ~55 units higher sales.

With p predictors, the model is:

Linear Regression

Simple Regression Results

From the Advertising data, we fit separate simple linear regressions:

• Sales vs. Radio Advertising


o Intercept = 9.312
o Radio coefficient = 0.203
o Interpretation: A $1,000 increase in radio advertising is associated with ~203
additional sales units (since sales are measured in thousands).
• Sales vs. Newspaper Advertising
o Intercept = 12.351
o Newspaper coefficient = 0.055
o Interpretation: A $1,000 increase in newspaper advertising is associated with
~55 additional sales units.

These results show that both radio and newspaper advertising are positively associated with
sales, but radio has a stronger effect.

MULTIPLE LINEAR REGRESSION MODEL

To analyze the effects of TV, radio, and newspaper advertising simultaneously, we use the
multiple regression model
Multiple Regression Coefficient Estimates

Example: Advertising Data

When TV, radio, and newspaper budgets are used together to predict sales, we obtain the
following insights:

• For a fixed level of TV and newspaper spending, increasing radio advertising by


$1,000 increases sales by about 189 units.
• The coefficients for TV and radio are quite similar to their simple regression
counterparts.
• However, for newspaper, the coefficient estimate drops close to zero, and the p-value
(~0.86) indicates no statistical significance.

This shows that simple and multiple regression results can differ substantially.

Why Do Results Differ?

• In simple regression, the slope of newspaper spending measures its association with
sales ignoring TV and radio.
• In multiple regression, the newspaper coefficient measures the effect of newspaper
advertising while holding TV and radio fixed.

In the Advertising dataset:

• Radio and newspaper are correlated (r=0.35).


• If radio advertising drives sales, and newspaper spending tends to rise in markets
where radio spending is also higher, then a simple regression of sales vs. newspaper
will show a positive relationship — even if newspaper has no direct effect on sales.
• In this case, newspaper is acting as a surrogate for radio, “borrowing credit” for radio’s
effect.
General Insight

This phenomenon is very common: when predictors are correlated, simple regression can give
misleading results.

Example (Analogy):

• Suppose we regress shark attacks on ice cream sales at a beach. We would see a
positive relationship.
• But the true driver is temperature: hot weather → more people at the beach → more
ice cream sales and more shark attacks.
• Once we include temperature in a multiple regression, the effect of ice cream
disappears — correctly showing no direct relationship.
• The Hat Matrix “projects” the observed values y onto the fitted regression line/plane.
• The coefficients β are the "weights" that determine the regression line, while H is the
geometric operator that uses those weights to produce predictions.

Is There a Relationship Between the Response and Predictors?

The first step in regression analysis is to ask:

Do the predictors, taken together, explain variation in the response variable?

This is addressed using hypothesis testing.

Hypotheses

We test whether all regression coefficients (except the intercept) are zero:
Suppose you fit a multiple linear regression model with 2 predictors (k=2) and a sample size
of 20 (n=20). After fitting the model, you have:

• SSR (Sum of Squares Regression) = 50

• SSE (Sum of Squares Error) = 30

Step 1: Calculate Degrees of Freedom

• dfreg = k = 2

• dfres= n – k – 1 = 20 – 2 – 1 = 17

Step 2: Calculate Mean Squares

What is a p-value (p-test) in hypothesis testing?

A p-value is a statistical measure that helps you decide whether your observed results are
likely under a given null hypothesis. In other words, it tells you how probable your data would
be if the null hypothesis—typically that there is no effect or no difference—were true.
• Low p-value (≤ 0.05): Strong evidence against the null hypothesis; you reject the null
and accept the alternative hypothesis.

• High p-value (> 0.05): Weak evidence against the null hypothesis; you fail to reject the
null hypothesis.

Example

Suppose you test whether a new drug lowers blood pressure compared to a placebo. After
gathering your data and performing a statistical test (say, a t-test), you get a p-value of 0.03.

• Interpretation: There is a 3% probability that the observed difference (or a more


extreme one) would occur if there were really no difference between the drug and
placebo groups (i.e., if the null hypothesis were true).

• Decision: Because 0.03 < 0.05, you have statistically significant evidence to reject the
null hypothesis and conclude that the drug likely has an effect

What is a t-test?

A t-test is a statistical hypothesis test used to determine if there is a significant difference


between the means of two groups. It helps assess whether observed differences in sample
data are likely due to chance or whether they reflect a true difference in the population.

Key points:

• Used for comparing the means of two groups.

• Works when the data is approximately normally distributed and the population
variance is unknown.

• Often used for small sample sizes (typically n ≤ 30).

• Types include independent two-sample t-test (comparing different groups), paired t-


test (comparing related groups), and one-sample t-test (comparing a sample mean
against a known value).

Example of a t-test

Suppose you want to know if a new teaching method affects student test scores.
• Group 1: 10 students using the traditional method; scores: 85, 90, 88, 75, 95, 80, 70,
85, 78, 92

• Group 2: 10 students using the new method; scores: 88, 91, 94, 79, 99, 87, 82, 90, 85,
95

You perform a t-test to compare the means of the two groups:

1. Null hypothesis (H₀): There is no difference in the mean test scores between the two
groups.

2. Alternative hypothesis (H₁): There is a difference in means between the groups.

3. Calculate the t statistic using the formula:

A t-test determines if the difference between two group means is statistically significant or
could have happened by random chance.
df = n− 1

3. Find the p-value by comparing the calculated t-statistic to the t-distribution with df
degrees of freedom. You use a t-distribution table or statistical software:

• For a two-tailed test, the p-value = 2 × area in the tail beyond the absolute value of the
t-statistic.

• For a one-tailed test, the p-value = area in the tail beyond the calculated t-statistic.

The p-value represents the probability of observing a t-statistic as extreme as (or more
extreme than) the one calculated if the null hypothesis is true.

Example

Say your sample mean xˉ= 8, hypothesized mean μ=9, sample standard deviation s=2, and
sample size n=31.

Calculate:
DECIDING ON IMPORTANT VARIABLES (VARIABLE SELECTION)

After we confirm (via the F-test) that some predictors matter, the next step is figuring out
which predictors are truly important.

Trying all subsets is impractical, so we use automated methods:

(a) Forward Selection

• Start with the null model (intercept only).


• Add the predictor that gives the biggest improvement (lowest RSS, highest
significance).
• Keep adding predictors one by one.
• Stop when no variable improves the model significantly.

Advantage: Always works, even when p>n.

Disadvantage: Greedy — may include a variable early that later becomes redundant.

(b) Backward Selection

• Start with all predictors in the model.


• Remove the predictor with the largest p-value (least significant).
• Refit the model and repeat until all remaining predictors are statistically significant
(below a threshold).
Advantage: Systematic pruning of unimportant variables.

Limitation: Cannot be used if p>n (more predictors than data points).

(c) Mixed (Stepwise) Selection

• A combination of forward and backward approaches.


• Start like forward selection: add the most useful predictor first.
• But after adding, check all included predictors:
o If any variable has become insignificant (p-value too high), remove it.
• Continue this “add + check + possibly remove” cycle until the final model has only
significant predictors.

Advantage: More flexible — avoids keeping redundant predictors.

3. Model Fit

Once we choose a subset of predictors, we evaluate how well the model fits the data. Two
common measures are:

1. Residual Standard Error (RSE):


o A measure of the typical size of prediction errors.
o Smaller RSE = better fit.

2. R2 (Coefficient of Determination):
o Fraction of variance in the response explained by the model.
o Formula:

4. Predictions

Once the model is fit, we can use it to predict outcomes. But predictions carry three kinds of
uncertainty:

1. Uncertainty in coefficient estimates (reducible error):


o The regression plane we estimate is only an approximation of the true plane.
o We use confidence intervals to quantify how close our estimated mean
prediction Y^ is to the true mean response f(X).
2. Model bias:
o The true relationship might not be linear.
o Even with perfect estimation, a linear model is just the best linear
approximation.
o This is another form of reducible error.
3. Random error in individual outcomes (irreducible error):
o Even if we knew the true regression plane f(X), individual responses Y vary
randomly.
o This randomness limits prediction accuracy.

Confidence Interval vs Prediction Interval

• Confidence Interval (CI):


o Quantifies uncertainty about the average response.
o Example: If TV = $100k and Radio = $20k, the 95% CI for mean sales = [10,985,
11,528].
o Interpretation: If we repeated the study many times, 95% of such intervals
would contain the true mean sales.
• Prediction Interval (PI):
o Quantifies uncertainty about the response for a specific observation.
o Example: For the same ad spend, 95% PI for sales = [7,930, 14,580].
o Always wider than CI because it includes both reducible error (model
uncertainty) and irreducible error (randomness).

Both intervals are centered at the same predicted value (11,256), but PI is wider to reflect
the extra uncertainty.

import numpy as np

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from [Link] import mean_squared_error, mean_absolute_error,


r2_score
# Example dataset (multiple features)

data = {

"x1": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],

"x2": [2, 1, 3, 4, 0, 5, 2, 6, 3, 7],

"y": [5, 6, 7, 10, 8, 13, 9, 15, 11, 18]

df = [Link](data)

# Features and target

X = df[["x1", "x2"]]

y = df["y"]

# Train-test split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)

# Fit model

model = LinearRegression()

[Link](X_train, y_train)

# Predictions

y_pred = [Link](X_test)

# Evaluation Metrics

mse = mean_squared_error(y_test, y_pred)

rmse = [Link](mse)

mae = mean_absolute_error(y_test, y_pred)


r2 = r2_score(y_test, y_pred)

print("Regression Coefficients:", model.coef_)

print("Intercept:", model.intercept_)

print("\nEvaluation Metrics:")

print(f"Mean Squared Error (MSE): {mse:.2f}")

print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")

print(f"Mean Absolute Error (MAE): {mae:.2f}")

print(f"R² Score: {r2:.2f}")

OUTPUT

Regression Coefficients: [1.44 0.78]

Intercept: 1.36

Evaluation Metrics:

Mean Squared Error (MSE): 0.39

Root Mean Squared Error (RMSE): 0.63

Mean Absolute Error (MAE): 0.56

R² Score: 0.98
GRADIENT DESCENT

Gradient Descent is an optimization algorithm widely used in machine learning to minimize


a cost function (also called loss function). The idea is to iteratively adjust model parameters
(weights) so that the error between predicted and actual values is minimized.

How it works

1. Start with an initial guess for parameters (random values).


2. Compute the gradient of the cost function → this tells us the slope or direction of
steepest increase.
3. Update the parameters in the opposite direction of the gradient (steepest descent).

Repeat until convergence (when changes are very small or cost stops decreasing).

EXAMPLE: CALCULATING GRADIENT DESCENT


TYPES OF GRADIENT DESCENT

Batch Gradient Descent – uses the entire dataset to compute the gradient in each step (stable
but slow).

Stochastic Gradient Descent (SGD) – updates parameters after every single training example
(fast, but noisy).

Mini-batch Gradient Descent – uses small batches of data (trade-off between speed and
stability, most commonly used).

Overfitting

Overfitting happens when a model learns the training data too well, including noise and
random fluctuations, instead of learning the underlying pattern.
Symptoms of Overfitting

• Very low training error but very high test error.


• The model is too complex (e.g., too many predictors, too many layers in neural
networks).

Causes

• Too many parameters compared to the amount of data.


• Lack of regularization.
• Noisy data.

Solutions

• Simplify the model (fewer variables, fewer layers).


• Regularization (penalize complexity).
• Cross-validation to tune hyperparameters.
• Get more data.
• Early stopping in iterative training methods like gradient descent.

REGULARIZATION

Regularization is a technique to prevent overfitting by adding a penalty term to the cost


function. This discourages the model from fitting noise by keeping parameter values small.

Types of Regularization

1. L1 Regularization (Lasso Regression):


o Adds the sum of absolute values of coefficients to the cost function.
o Shrinks coefficients but rarely to zero. Keeps all variables but with smaller
influence.
2. Elastic Net:
o Combination of L1 and L2. Useful when predictors are correlated.

Key idea: Regularization balances bias–variance tradeoff, improving generalization on


unseen data.

Regression Evaluation Metrics

Once a regression model is trained, we need metrics to evaluate how well it performs.

Common Metrics

1. Mean Absolute Error (MAE):


o Average of absolute differences between predicted and actual values.
UNIT 3
Classification
3.1 DECISION TREE

Decision tree learning is a widely used approach for approximating discrete-valued target
functions, where the outcome of the learning process is represented in the form of a decision
tree. These trees can also be converted into sets of if-then rules, which improves readability
and interpretability. Among inductive inference algorithms, decision tree methods are
considered highly popular due to their efficiency and applicability across diverse domains.
They have been successfully applied in areas such as medical diagnosis and credit risk
assessment for loan applicants.

3.2 Decision Tree Representation

Decision trees classify examples by guiding them from the root node of the tree to a leaf node,
which provides the final classification. Each internal node represents a test on a particular
attribute, and each branch corresponds to one of the possible values of that attribute. An
example is classified by starting at the root, applying the test specified at that node, and
moving down the branch that matches the attribute value of the instance. This process
continues until a leaf node is reached, where the classification is assigned.
Figure 3.1 illustrates a decision tree for the concept PlayTennis. It determines whether a
Saturday morning is suitable for playing tennis. For instance, the case
(Outlook = Sunny, Temperature = Hot, Humidity = High, Wind = Strong)
follows the leftmost branch of the tree and is classified as a negative example (PlayTennis =
No).

In general, a decision tree represents a disjunction of conjunctions of attribute constraints.


Each path from the root to a leaf corresponds to a conjunction of attribute tests, while the
complete tree represents a disjunction of these paths. For example, the PlayTennis tree
corresponds to the following logical expression:

• (Outlook = Sunny ∧ Humidity = Normal)

• ∨ (Outlook = Overcast)

• ∨ (Outlook = Rain ∧ Wind = Weak)

3.3 Appropriate Problems for Decision Tree Learning

Although different variations of decision tree algorithms exist, the method is particularly
effective for problems that share the following characteristics:

Instances described by attribute-value pairs: Each example is defined by a fixed set of


attributes (e.g., Temperature) and their values (e.g., Hot, Mild, Cold). The simplest scenario
involves attributes with a small set of discrete values. However, extensions of the basic
algorithm can also handle real-valued attributes (e.g., numerical representations of
temperature).

Discrete-valued target functions: Decision trees commonly assign Boolean outcomes such as
yes or no, but the method can be extended to handle multiple classes. Although less common,
extensions also exist for continuous-valued outputs.

Disjunctive concepts: Decision trees naturally represent target concepts that require
disjunctive expressions.

Noisy data: The method is robust to errors in both class labels and attribute values.
Incomplete data: Decision trees can be applied even when some attribute values are missing.
Techniques for handling such situations allow the model to make use of partially observed
examples.

Due to these properties, decision tree learning has been applied to many practical problems,
such as classifying patients by disease, diagnosing equipment malfunctions, and predicting
whether loan applicants are likely to default. These types of tasks, which involve assigning
examples to one of several discrete categories, are referred to as classification problems.

3.4 The Basic Decision Tree Learning Algorithm

The ID3 algorithm builds a decision tree step by step. It starts by asking: Which attribute
should be tested at the root of the tree? To answer this, each attribute in the dataset is tested
using a statistical method to measure how well it separates the training examples. The
attribute with the best classification ability is chosen as the root.

After that, branches are created for each possible value of this attribute. The training examples
are then distributed to these branches based on their attribute values. The same process is
repeated at every branch: the best attribute is selected, and further branches are created. This
continues until the decision tree is complete.

The search is called greedy because once an attribute is chosen, the algorithm does not go
back to reconsider earlier choices.

3.4.1 Choosing the Best Attribute

The most important step in ID3 is deciding which attribute to test at each node. The best
attribute is the one that separates the data most effectively. To measure this, the algorithm
uses a statistical concept called information gain.

[Link] Entropy – Measuring Homogeneity

Before defining information gain, it is necessary to understand entropy.

• Entropy is a measure from information theory that shows how mixed (or pure) a collection
of examples is.

• If all examples belong to the same class, entropy is 0 (perfectly pure).


• If examples are evenly split between classes, entropy is 1 (most impure).
For example, consider a dataset S with 14 examples: 9 positive and 5 negative (written as [9+,
5-]). The entropy is calculated based on the proportion of positive and negative examples.
• If all examples are positive (p = 1, n = 0), entropy = 0.
• If examples are evenly split (p = 0.5, n = 0.5), entropy = 1.
• For other distributions, entropy lies between 0 and 1.
Mathematically, for a dataset S with proportions pip_ipi of each class, entropy is:

where c is the number of possible classes.

Algorithm of decision Tree classifier

Entropy can also be seen as the average number of bits required to encode the classification
of a random example.
[Link] Information Gain – Reduction in Entropy
Information gain tells how much an attribute helps in classifying the data. It is the expected
reduction in entropy after splitting the dataset according to an attribute.
Mathematically:

The first term is the entropy of the whole dataset. The second term is the weighted average
entropy after the split. The difference gives the information gain.
The attribute with the highest information gain is selected at each step.
Example: Attribute "Wind"
Suppose the dataset has 14 examples [9+, 5-]. The attribute Wind can take values {Weak,
Strong}.
• For Wind = Weak: 6 positive, 2 negative → [6+, 2-]
• For Wind = Strong: 3 positive, 3 negative → [3+, 3-]
After calculating entropy for each branch and taking weighted average, the information gain
for Wind is:
Gain(S,Wind)=0.048
Example: Attribute "Humidity"
The same dataset can be split by Humidity.
• For Humidity = High: [3+, 4-]
• For Humidity = Normal: [6+, 1-]
Information gain for Humidity is:
Gain(S,Humidity)=0.151
Thus, Humidity is a better classifier than Wind because it provides higher information gain.

3.4.2 Illustrative Example – PlayTennis Dataset


Table 3.2 shows training examples for the concept PlayTennis, with attributes Outlook,
Temperature, Humidity, and Wind. The target variable is PlayTennis (Yes or No).

Day Outlook Temperature Humidity Wind PlayTennis

D1 Sunny Hot High Weak No

D2 Sunny Hot High Strong No

D3 Overcast Hot High Weak Yes

D4 Rain Mild High Weak Yes


Day Outlook Temperature Humidity Wind PlayTennis

D5 Rain Cool Normal Weak Yes

D6 Rain Cool Normal Strong No

D7 Overcast Cool Normal Strong Yes

D8 Sunny Mild High Weak No

D9 Sunny Cool Normal Weak Yes

D10 Rain Mild Normal Weak Yes

D11 Sunny Mild Normal Strong Yes

D12 Overcast Mild High Strong Yes

D13 Overcast Hot Normal Weak Yes

D14 Rain Mild High Strong No

Information Gain Calculations for All Attributes


• Gain(S, Outlook) = 0.246
• Gain(S, Humidity) = 0.151
• Gain(S, Wind) = 0.048
• Gain(S, Temperature) = 0.029
The attribute with the highest gain is Outlook, so it is selected as the root of the tree.
Building the Tree
1. Root: Outlook
o Outlook = Overcast → always PlayTennis = Yes → becomes a leaf node.
o Outlook = Sunny → needs further splitting (nonzero entropy).
o Outlook = Rain → needs further splitting (nonzero entropy).
The process continues until:
1. All attributes are used, or
2. All examples in a branch have the same classification (entropy = 0).
The final tree successfully classifies all training examples.
# Import necessary libraries
from [Link] import load_iris
from sklearn.model_selection import train_test_split
from [Link] import DecisionTreeClassifier
from [Link] import classification_report, confusion_matrix,
accuracy_score

# Load dataset
iris = load_iris()
X, y = [Link], [Link]

# Split into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)

# Create and train the Decision Tree model


clf = DecisionTreeClassifier(random_state=42)
[Link](X_train, y_train)

# Make predictions
y_pred = [Link](X_test)

# Print evaluation metrics


print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("\nClassification Report:")
print(classification_report(y_test, y_pred,
target_names=iris.target_names))

print("Accuracy Score:", accuracy_score(y_test, y_pred))

Advantages

• Very easy to explain to people, even easier than linear regression.

• Mimics human decision-making processes, making it intuitive.

• Can be displayed graphically and interpreted easily, even by non-experts.

• Naturally handle qualitative predictors without dummy variable coding.

Disadvantages

• Generally less accurate than other regression and classification methods.

• Can be non-robust: small changes in data may lead to very different trees.

What is Overfitting?

Overfitting happens when a decision tree learns the training data too well, including its noise,
outliers, or random fluctuations. As a result, it performs very well on training data but poorly
on new, unseen data (test data). For example,

• Suppose you're building a decision tree to classify whether an email is spam or not.
• If the tree is too deep and complex, it might learn that “email sent on Tuesday at 3:27
PM” means spam just because it happened in the training data. But that is not a
generalizable pattern—it's just noise.

Key Signs of Overfitting in Decision Trees:

• The tree is too deep with many branches.


• It creates very specific rules that don’t apply to most data.
• Training accuracy is very high, but test accuracy is much lower.

What is Pruning?

Pruning means cutting back the size of a decision tree by removing parts that don’t help
improve performance on new data. The idea is to simplify the tree so it doesn’t overfit. There
are two types of pruning:

• Pre-pruning (early stopping): Stop growing the tree early based on a condition (like max
depth, min samples per leaf).
• Post-pruning (cost complexity pruning): Let the tree grow fully, then cut back branches
that don’t help.

Cost Complexity Pruning (Weakest Link Pruning)

This is a popular post-pruning method used in algorithms like CART (Classification and
Regression Trees). It adds a penalty for large trees. It balances between:

• How well the tree fits the data (low error)


• How complex the tree is (number of terminal nodes or leaves)

The goal is to minimize:

Cost(T) = Training Error(T) + α × ∣Terminal Nodes in T∣

Where:

• T = the tree
• α = penalty parameter (controls trade-off between accuracy and complexity)
• ∣Terminal Nodes∣ = number of leaf nodes (model complexity)

• If α=0 → No penalty for complexity → Full tree (possibly overfitting).


• If α is large → Heavier penalty on complexity → Smaller, simpler tree (more pruning).

Just like Lasso Regression shrinks coefficients by adding a penalty, cost complexity pruning
shrinks the tree by penalizing unnecessary splits.

What Are Ensemble Methods in Machine Learning?

Ensemble methods are techniques that combine predictions from multiple models to make a
final, better prediction than any single model on its own. The idea is similar to the saying:

Why Use Ensemble Methods?

Single models (like one decision tree or logistic regression) can:

• Overfit or underfit
• Be sensitive to noise
• Make biased decisions

Ensembles reduce these problems by:

• Improving accuracy
• Reducing variance
• Reducing bias
• Handling noisy data better

Bagging (Bootstrap Aggregating)

Bagging means training many models on different random subsets of the data and then
combining their predictions. The goal is to reduce variance (i.e., make the model more stable
and less sensitive to noise in the data).
How it works:

1. Take multiple random samples of your dataset with replacement (bootstrap samples).
2. Train a separate model (usually the same type, like decision trees) on each sample.
3. Combine the outputs:
o For classification: use majority vote
o For regression: use average

Example:

Imagine you're asking 5 friends to guess how many jelly beans are in a jar.

• Each friend sees different sets of guesses (not the full list).
• They each make a guess (model prediction).
• You take the average (or majority vote) of all guesses for your final answer.

Common Bagging Algorithm: Random Forest (more on this below)


Simple Python Program: Classification using Bagging
from [Link] import load_iris
from [Link] import BaggingClassifier
from [Link] import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from [Link] import accuracy_score

# Load dataset
iris = load_iris()
X, y = [Link], [Link]

# Split into train and test


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)

# Create base classifier


base_clf = DecisionTreeClassifier()

# Create Bagging Classifier


bagging_clf = BaggingClassifier(base_estimator=base_clf,
n_estimators=10,
random_state=42)

# Train the model


bagging_clf.fit(X_train, y_train)

# Predict
y_pred = bagging_clf.predict(X_test)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

What's happening here?

• DecisionTreeClassifier is the base model.


• BaggingClassifier trains 10 decision trees on random subsets of the
training data.
• The predictions are combined using majority voting.
• We evaluate the model on the test data using accuracy.
2. Boosting: Boosting builds models one after another, and each new model focuses on the
mistakes made by the previous ones. Goal is to reduce bias and build a strong model from
many weak models.

How it works:

1. Start with a simple model (a weak


learner, usually a small decision tree).
2. See what it got wrong.
3. Build a new model that tries to fix
those errors.
4. Repeat the process, adding more models that focus on correcting previous mistakes.
5. Combine all models in a weighted way.

Imagine you're learning to shoot arrows.

• First shot: you miss the target.


• Second shot: you adjust based on your last miss.
• Third shot: adjust again.
• Over time, you get better because you learn from your past mistakes.

Common Boosting Algorithms: AdaBoost, Gradient Boosting, XGBoost, LightGBM,


CatBoost
Classification using Boosting (AdaBoost)
from [Link] import load_iris
from [Link] import AdaBoostClassifier
from [Link] import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from [Link] import accuracy_score

# Load the Iris dataset


iris = load_iris()
X, y = [Link], [Link]

# Split into training and test sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)

# Create a base estimator (a shallow decision tree)


base_estimator = DecisionTreeClassifier(max_depth=1)

# Create the AdaBoost classifier


boosting_clf = AdaBoostClassifier(base_estimator=base_estimator,
n_estimators=50,
learning_rate=1.0,
random_state=42)

# Train the boosting classifier


boosting_clf.fit(X_train, y_train)

# Predict on the test set


y_pred = boosting_clf.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

• DecisionTreeClassifier(max_depth=1) is used as the weak learner (a


decision stump).
• AdaBoostClassifier builds an ensemble of weak learners in a
sequential manner, focusing more on previously misclassified
examples.
• n_estimators=50 means we build 50 weak learners.
• learning_rate controls the contribution of each classifier.

3. Random Forest: Random Forest is a bagging method using decision trees, with an extra
twist: each tree uses a random subset of features when splitting. The goal is to improve
accuracy and prevent overfitting (like regular bagging, but better for trees).

How it works:

1. Take many bootstrap samples of the data.


2. Train a decision tree on each sample.
3. When building each tree, only consider a random subset of features at each split.
4. Combine the results (majority vote or average).

Example:

Suppose you're trying to guess someone’s favorite food.

• You ask 100 different people (trees).


• Each person asks different questions (random features).
• Everyone gives their guess.
• You go with the most common answer.

Compared to plain bagging:

• Regular bagging: full feature set per tree.


• Random Forest: random features + random data per tree → more diversity → better
performance.

Summary Table

Feature Bagging Boosting Random Forest


Purpose Reduce variance Reduce bias Reduce variance & overfit
Models trained In parallel Sequentially In parallel
Focus Stability Correcting mistakes Stability with randomness
Common Models Decision Trees Small Decision Trees Decision Trees
Example Algo Bagged Trees AdaBoost, XGBoost Random Forest

What is Out-of-Bag (OOB) Error Estimation?

When using bagging methods like Random Forest, we train multiple models on random
subsets of the training data (these are called bootstrap samples).

Each of these samples leaves out some data points — and we can use these left-out
observations as a test set to estimate the model's performance. That’s called the Out-of-Bag
(OOB) Error Estimation.
Why is it useful?

• You don’t need a separate validation set or run cross-validation.


• It gives a reliable estimate of test error.
• It’s built-in and efficient, especially with Random Forests.

Key Concepts

1. Bootstrap Sampling:

• For each tree in a bagging model (like Random Forest), we draw a random sample with
replacement from the training set.
• This means some data points are repeated, and others are left out.

2. Out-of-Bag (OOB) Data:

• On average, about 1/3 of the original data is not selected in each bootstrap sample.
• These unselected data points are called OOB observations for that tree.

3. How OOB Error is Computed:

Let’s say we train 100 trees in a Random Forest:

1. For each training point:


o There will be many trees where that point was OOB (not used for training).
2. We pass the data point through only those trees (where it was OOB).
3. Aggregate their predictions (e.g., majority vote for classification, average for
regression).
4. Compare this OOB prediction to the true label.
5. Repeat for all data points.
6. Compute the overall OOB error = the percentage of incorrect OOB predictions.

Analogy

Imagine you're training 100 students (trees) to classify emails as spam or not.
• Each student is trained on different emails (random samples).
• Some emails are never seen by a student.
• Later, you ask each student to classify the emails they didn't train on.
• You collect the answers from all students who didn’t see a particular email — and
compare it to the actual label.

That gives you a good idea of how well your group performs on unseen data, without needing
a separate test.

What is Bayesian Additive Regression Trees (BART)?

BART is a machine learning algorithm that uses a sum of many small decision trees (like
boosting), and applies Bayesian methods to make predictions and quantify uncertainty. So,
you can think of it as:

Boosting-style model + Bayesian statistics = BART

• Instead of using one big tree, BART uses many small trees.
• These trees add together (just like in boosting) to model the output.
• But instead of training them in a greedy way like XGBoost or Gradient Boosting, BART:
o Uses a Bayesian framework to model the trees.
o Samples from a posterior distribution over possible tree structures and
predictions using MCMC (Markov Chain Monte Carlo).
o Produces probabilistic predictions, not just point estimates.

Why Use BART?

• It gives accurate predictions, like boosting.


• But it also gives you uncertainty estimates (confidence intervals, posterior
distributions).
• It’s non-parametric, meaning it can fit very complex, nonlinear relationships.
• Useful in applications where trust and interpretability are important (e.g. medicine,
economics, causal inference).
How Does BART Work?

The target variable y is modeled as:

Y = f(x) + ϵ

Where f(x) is the sum of many small decision trees:

Bayesian Twist:

• BART places priors on:


o The structure of each tree (shallow trees are more likely)
o The predicted values at the leaves (usually centered around 0)
o The overall noise level σ
• Then, it uses MC to sample from the posterior distribution of all trees and leaf values,
given the data.

What Do You Get from BART?

Predictions: Like boosting, but from a sum of trees

Uncertainty: You get full distributions, not just point estimates

Interpretability: You can examine tree structures, variable importance

Flexibility: Handles regression, classification, and even causal inference


Simple Analogy

Imagine you have a committee of doctors (trees), and each one gives a small opinion (a
number) about a patient's health risk. Instead of one doctor making a big decision, you
average all their opinions.

But instead of just giving a number, each doctor also says how confident they are. The final
decision is not just a prediction — it includes uncertainty.

That’s what BART does.

BART vs Other Tree Methods

Feature Decision Trees Random Forest Gradient Boosting BART


Uses Multiple Trees (single)
Combines Trees N/A Voting/Averaging Sequential Additive Bayesian Additive
Uncertainty Output (posterior dist.)
Bayesian?

Regularization Manual (pruning) Bagging Learning rate, depth Priors on trees


Training Method Greedy Random sampling Gradient-based MCMC sampling

3.5 Bayesian Theorem

Bayesian learning methods hold a significant place in the study of machine learning for two
main reasons. First, algorithms based on Bayesian principles, such as the Naive Bayes classifier,
are among the most practical and effective approaches for certain types of learning problems.
For instance, studies conducted by Michie et al. (1994) compared the Naive Bayes classifier
with other learning algorithms, including decision tree and neural network models. The results
showed that Naive Bayes performs competitively in many situations and, in some cases, even
outperforms these alternatives. Because of this reliability, the Naive Bayes classifier is often
considered one of the most effective algorithms, especially in tasks like classifying text
documents such as electronic news articles.

The second reason Bayesian methods are important lies in the perspective they provide for
understanding a wide range of learning algorithms, even those that do not explicitly deal with
probabilities. For example, algorithms like FIND-S and Candidate-Elimination can be analyzed
under Bayesian principles to identify conditions in which they produce the most probable
hypothesis based on training data. Similarly, Bayesian reasoning helps explain design choices
in neural network learning, such as minimizing the sum of squared errors when exploring
possible network structures. It also provides justification for using alternative error functions,
such as cross-entropy, when predicting probabilities. Beyond this, Bayesian analysis sheds
light on the inductive biases of decision tree algorithms that favor shorter trees and relates
closely to the Minimum Description Length (MDL) principle. A basic understanding of Bayesian
methods is therefore crucial for analyzing and characterizing many algorithms in machine
learning.

Key features of Bayesian learning methods include:

• Each training example can either increase or decrease the estimated probability that a
hypothesis is correct, allowing for a flexible approach rather than discarding hypotheses
after a single inconsistency.

• Prior knowledge can be incorporated with observed data to calculate the probability of a
hypothesis. This is achieved by assigning prior probabilities to candidate hypotheses and
defining probability distributions for observed data under each hypothesis.

• Hypotheses can provide probabilistic predictions, such as "a patient has a 93% chance of
full recovery."

• New instances can be classified by combining predictions from multiple hypotheses, with
each weighted according to its probability.

• Even in cases where Bayesian methods are computationally complex, they serve as a
standard of optimal decision-making against which other algorithms can be compared.

Despite these advantages, Bayesian methods face practical challenges. One difficulty is the
need for prior knowledge of many probabilities, which are often unavailable. In such cases,
these probabilities are estimated using background knowledge, existing data, or assumptions
about underlying distributions. Another challenge is the high computational cost of
determining the Bayes optimal hypothesis, which grows linearly with the number of candidate
hypotheses. However, in specialized situations, this cost can be reduced significantly.

3.6 Bayes Theorem


In machine learning, one common task is to identify the best hypothesis from a set of possible
hypotheses H, given some observed training data D. A natural way to define the “best”
hypothesis is to select the one with the highest probability, taking into account both the
observed data and any prior knowledge about the hypotheses. Bayes theorem provides a
systematic method for calculating such probabilities.

Formally, Bayes theorem combines the prior probability of a hypothesis, the likelihood of the
observed data given the hypothesis, and the overall probability of the data, to produce the
posterior probability of the hypothesis.

• P(h): The prior probability of hypothesis h, before considering the training data. It reflects
any background knowledge about how likely h is to be correct. If no prior information is
available, equal probability can be assigned to all candidate hypotheses.

• P(D): The probability of observing the data D, without considering which hypothesis is
correct.

• P(D | h): The probability of observing the data D assuming hypothesis h holds. This is called
the likelihood.

• P(h | D): The posterior probability of hypothesis h, after observing data D. This value
indicates the updated confidence in h.

Bayes theorem can be written as:

Intuitively, the posterior probability increases if the hypothesis was already likely (high prior
probability) or if the hypothesis strongly predicts the observed data (high likelihood). On the
other hand, if the data is very common regardless of the hypothesis, the posterior probability
decreases.

MAP and ML Hypotheses


When dealing with multiple candidate hypotheses in H, the most probable one given the data
is called the maximum a posteriori (MAP) hypothesis. Using Bayes theorem, the MAP
hypothesis can be expressed as:

Since P(D) is independent of h, it is often omitted during comparison.

If all hypotheses are assumed to be equally probable a priori, the problem simplifies to finding
the hypothesis that maximizes the likelihood of the data:

Such a hypothesis is called the maximum likelihood (ML) hypothesis.

Bayes theorem is not limited to machine learning; it applies to any set of mutually exclusive
propositions whose probabilities sum to one. However, in the context of learning, the
hypotheses generally represent possible target functions, and the data corresponds to training
examples.

Example: Medical Diagnosis

Consider a medical case with two possible hypotheses:

1. The patient has a particular form of cancer.

2. The patient does not have this cancer.

The available information is based on a laboratory test with two possible results: positive (+)
or negative (−). The prior probability of having this cancer in the general population is 0.008
(0.8%). The test is not perfectly accurate:

• If the disease is present, the test shows positive in 98% of cases.

• If the disease is absent, the test correctly shows negative in 97% of cases.
Now suppose a patient’s test result is positive. To decide which hypothesis is most probable,
the MAP approach is used. Applying Bayes theorem shows that although the posterior
probability of cancer increases significantly compared to its prior probability, the more
probable hypothesis is still that the patient does not have cancer.

This example highlights two important points:

• Bayesian inference depends strongly on prior probabilities, which must be known or


estimated.

• Hypotheses are not completely accepted or rejected. Instead, their probabilities are
updated as new evidence becomes available.

3.7Bayes Optimal Classifier

• The Bayes Optimal Classifier is a method used in machine learning to make the best
possible prediction about a new instance, based on all the available hypotheses and data.
• Often, attention is given to the MAP (Maximum a Posteriori) hypothesis, which is the
single hypothesis with the highest probability after seeing the training data. However,
relying only on this one hypothesis may not always give the most accurate classification.

• For example,
Imagine there are three hypotheses with probabilities 0.4, 0.3, and 0.3. The first one (0.4)
is the MAP hypothesis. If this hypothesis says a new example is positive, but the other
two say it is negative, then overall the chance of it being positive is 0.4, while the chance
of it being negative is 0.6. The MAP hypothesis would say positive, but the Bayes Optimal
Classifier would say negative, because that has the higher probability when all hypotheses
are taken into account.

• In general, the Bayes Optimal Classifier works by combining the predictions of all
hypotheses, giving each hypothesis a weight according to its probability. The classification
chosen is the one with the highest overall probability after this weighted combination.
• This method is guaranteed to perform as well as possible, on average, given the training
data, the hypothesis space, and prior knowledge. In other words, no other classifier using
the same information can consistently do better.
• An interesting point is that the final classification produced by the Bayes Optimal Classifier
does not always match the predictions of any single hypothesis. Instead, it may act as if
there were a new hypothesis created from a combination of the existing ones.

3.8 Naïve Bayes Classifier

The Naive Bayes Classifier is a method used to predict a class based on data. It is called naive
because it assumes that all features are independent from each other, even if in reality they
may not be.
It uses probabilities to decide the class of a new example. The method looks at:

• The overall chance of each class (prior probability).


• The chance of each feature given that class (conditional probability).

The final prediction is the class with the highest probability after multiplying these values
together.

The class with the highest value is chosen.

Example: Play Tennis

We have training data of 14 days.

Target: PlayTennis = Yes or No

Attributes: Outlook, Temperature, Humidity, Wind

Step 1: Prior Probabilities

Step 2: New Instance

(Sunny, Cool, High, Strong)


Step 3: Conditional Probabilities

From training data:

Step 4: Apply Formula

For Yes:

For No:

PlayTennis = No

from [Link] import load_iris


from sklearn.model_selection import train_test_split

from sklearn.naive_bayes import GaussianNB

from [Link] import classification_report, accuracy_score,


confusion_matrix

# Load Iris dataset

iris = load_iris()

X, y = [Link], [Link]

# Split data into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,


random_state=42)

# Initialize the Gaussian Naive Bayes classifier

gnb = GaussianNB()

# Train the model

[Link](X_train, y_train)

# Predict labels for test set

y_pred = [Link](X_test)

# Print evaluation metrics

print("Confusion Matrix:")

print(confusion_matrix(y_test, y_pred))

print("\nClassification Report:")

print(classification_report(y_test,y_pred,
target_names=iris.target_names))

print("Accuracy Score:", accuracy_score(y_test, y_pred))


3.9 Logistic Regression

In many situations, the response variable of interest is categorical rather than continuous. For
instance, in the Default dataset, the response variable default takes on two possible
outcomes: Yes or No. Instead of attempting to model the outcome directly, logistic regression
focuses on estimating the probability that the outcome belongs to a particular category.

For the credit default data, the goal is to estimate the probability that a customer will default,
given their balance. This probability can be expressed as:

Pr(default = Yes | balance)

For convenience, denote this probability by p(balance). Since probabilities must always lie
between 0 and 1, the model should produce values in this range for all possible balances.
Once these probabilities are estimated, they can be converted into classifications. For
example, if p(balance)>0.5, then the customer may be predicted to default. Alternatively, if a
financial institution prefers to be more cautious, it might adopt a lower threshold, such as
p(balance)>0.1, when labeling a customer as likely to default.

3.9.1 The Logistic Model

A key question is how to model the relationship between the predictor X and the probability
p(X)=Pr(Y=1∣X). One naive approach is to use linear regression, so that
This method, however, produces probabilities that are not always valid. In particular, the
fitted line may predict negative probabilities for small values of X or probabilities greater than
one for large values of X. Such predictions are nonsensical because probabilities must always
remain within the interval [0, 1]. This problem occurs generally whenever a straight line is
fitted to a binary response variable.

To overcome this issue, logistic regression employs a transformation that always returns
probabilities between 0 and 1. The model uses the logistic function, defined as:

The logistic curve naturally produces an S-shaped relationship between X and the predicted
probability. For very small values of X, the probability approaches zero, but never falls below
it; for very large values, the probability approaches one, but never exceeds it. Thus,
predictions remain meaningful across the entire range of inputs. In practice, logistic
regression is estimated using the method of maximum likelihood, which identifies the
parameters β0 and β1 that make the observed outcomes most probable.

Odds and Log-Odds can be rearranged to highlight the role of odds in logistic regression:

The ratio p(X)/[1−p(X)] is known as the odds. Odds range from zero to infinity, with values
close to zero indicating a very low probability and large values indicating a very high
probability. For instance, if p(X)=0.2, the odds are 0.2/0.8=0.25, meaning one default is
expected for every four non-defaults. If p(X)=0.9, then the odds are 0.9/0.1=9, corresponding
to nine defaults for every non-default.

Taking the logarithm of both sides of Equation yields:


The left-hand side is called the log-odds or logit. Logistic regression is therefore a linear model
in terms of the log-odds.

3.9.2 Estimating the Regression Coefficients

The coefficients β0 and β1 in the logistic model are unknown and must be estimated from the
training data. In linear regression, the least squares method is commonly used to estimate
coefficients. Although logistic regression could, in principle, be fitted using a nonlinear least
squares procedure, the method of maximum likelihood is preferred because it provides more
reliable statistical properties.

The intuition behind maximum likelihood in this context is straightforward: the estimates β0
and β1 chosen so that the predicted probabilities align as closely as possible with the observed
outcomes. For individuals who actually defaulted, the fitted probability should be close to
one, while for those who did not default, the fitted probability should be close to zero.

This reasoning can be formalized using the likelihood function,

where p(xi) is given by the logistic model. The values β0 and β1 are obtained by maximizing
this likelihood function.

Maximum likelihood is a general estimation principle and is widely used across statistical
modeling. In fact, least squares estimation in linear regression can be viewed as a special case
of maximum likelihood.

Table presents the estimated coefficients for the Default dataset when predicting default
from balance. The estimated slope coefficient is β1=0.0055. This means that for every unit
increase in balance, the log-odds of default increase by 0.0055 units.
Logistic Regression on Balance

Coefficient Std. Error z-statistic p-value

Intercept -10.6513 0.3612 -29.5

Balance 0.0055 0.0002 24.9

In this example, the p-value for balance is extremely small, which strongly rejects the null
hypothesis and confirms that default probability is indeed related to balance. The estimated
intercept has less practical interpretive value, serving mainly to ensure that predicted
probabilities are properly centered to match the observed overall default rate in the data.

3.9.3 Making Predictions

Once the coefficients are estimated, predicted probabilities can be calculated for any given
input. For example, using the results in Table 4.1, the probability of default for an individual
with a balance of $1,000 is:

This corresponds to a probability of less than 1%. In contrast, the probability for a balance of
$2,000 is much higher:

or approximately 58.6%.

Incorporating Qualitative Predictors


Logistic regression can also handle qualitative predictors through the use of dummy variables.
In the Default dataset, the variable student indicates whether an individual is a student or
not. Coding this as 1 for students and 0 for non-students allows it to be included in the
regression model.

The results of fitting such a model are shown in Table. The coefficient for student[Yes] is
positive and statistically significant, suggesting that students are associated with a higher
probability of default compared to non-students.

Logistic Regression on Student Status

Coefficient Std. Error z-statistic p-value

Intercept -3.5041 0.0707 -49.55

Student [Yes] 0.4049 0.1150 3.52

Using these estimates, the predicted probabilities are:

• For a student:

• For a non-student:

This confirms that students, on average, have a slightly higher default risk compared to non-
students.

3.9.4 Multiple Logistic Regression

The logistic regression framework naturally extends from a single predictor to multiple
predictors. By analogy with the transition from simple to multiple linear regression in Chapter
3, the model in (4.4) generalizes to:
Example: Balance, Income, and Student Status

Table 4.3 reports the results of fitting a logistic regression model to the Default data, using
balance, income (measured in thousands of dollars), and student status (coded as a dummy
variable) as predictors.

Logistic Regression with Three Predictors

Coefficient Std. Error z-statistic p-value

Intercept -10.8690 0.4923 -22.08

Balance 0.0057 0.0002 24.74

Income 0.0030 0.0082 0.37

Student [Yes] -0.6468 0.2362 -2.74

Several points stand out from this output:

• The coefficient for balance is positive and highly significant, confirming that larger
balances are associated with higher probabilities of default.
• The coefficient for income is small and not statistically significant, indicating little
evidence of an association once balance and student status are included.
• Surprisingly, the coefficient for student status is negative, suggesting that students are
less likely to default than non-students, holding balance and income fixed.
This result appears contradictory to Table, where student status was associated with a higher
probability of default when used as the only predictor.

Understanding the Apparent Paradox

Example: Smoking, Exercise, and Heart Disease

Imagine you are studying the relationship between exercise and the risk of developing heart
disease.

You collect data and find that:

• People who exercise more have a higher rate of heart disease.

This seems surprising — isn’t exercise supposed to be good for your heart?

But here's the key: you forgot to consider smoking.

Step 1: Simple Model (No Smoking Variable)

You run a logistic regression:

logit(Heart Disease) = β0 + β1⋅Exercise

And find:

• β1>0: more exercise → higher heart disease risk.

This is misleading, because you’re missing a critical factor.

Step 2: Understand What’s Really Going On

Let’s look deeper:

• Smokers are more likely to have heart disease.


• Smokers might also be told to exercise more by their doctors.
• So in your data, people who exercise more are more likely to be smokers.
That means:

• Exercise is associated with smoking.


• Smoking causes heart disease.

This makes smoking a confounder — it creates a false relationship between exercise and heart
disease when it's not included in the model.

Step 3: Full Logistic Regression (Include Confounder)

Now you run a new logistic regression:

logit(Heart Disease) = β0 + β1⋅Exercise + β2⋅Smoking

Now you find:

• β1<0: exercise reduces heart disease risk (as expected).


• β2>0: smoking increases heart disease risk.

This model shows the true relationship — once you account for smoking, the positive link
between exercise and heart disease disappears and reverses.

This example shows why it's essential to include confounding variables like smoking. If you
don’t, logistic regression may give you a wrong sign or strength for your predictors. Just like
in the earlier student default example, adding the right variable (like balance or smoking)
reveals the true effect of another variable (like student status or exercise) on the outcome.

Making Predictions

With the estimated coefficients from Table 4.3, probabilities can be calculated for specific
individuals.

• Student with a balance of $1,500 and income of $40,000:


• Non-student with the same balance and income:

Thus, conditional on balance and income, the student is less likely to default than the non-
student.

from [Link] import load_breast_cancer

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split

from [Link] import accuracy_score, classification_report,


confusion_matrix

# Load dataset

data = load_breast_cancer()

X, y = [Link], [Link]

# Split into training and testing data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)

# Initialize Logistic Regression model

model = LogisticRegression(max_iter=1000, solver='liblinear')

# Train the model

[Link](X_train, y_train)

# Predict on test set


y_pred = [Link](X_test)

# Print evaluation metrics

print("Accuracy:", accuracy_score(y_test, y_pred))

print("\nConfusion Matrix:")

print(confusion_matrix(y_test, y_pred))

print("\nClassification Report:")

print(classification_report(y_test,y_pred,
target_names=data.target_names))
3.10 INTRODUCTION – INSTANCE-BASED LEARNING

1. Basic Concept – Instance-based learning stores training examples instead of building a


global model, and predicts outcomes by retrieving the most similar instances when a new
query appears.
2. Local Approximations – Instead of one model for all data, these methods construct simple
models around the query point, making them effective for complex target functions that
can be explained locally.
3. Representation Flexibility – Instances may be represented numerically or symbolically,
with case-based learning extending this to symbolic reasoning for domains like legal cases,
scheduling, or help desk systems.
4. Advantages – Can approximate highly complex functions, allows both numeric and
symbolic representations, and supports problem-solving by reusing past experiences.
5. Disadvantages – Computationally expensive at prediction time, needs efficient retrieval
methods, and similarity measures may mislead when irrelevant attributes dominate.
6. Key Methods – Includes k-Nearest Neighbor (k-NN), Locally Weighted Regression (LWR),
Radial Basis Function (RBF) networks, and Case-Based Reasoning (CBR), often contrasted
with eager learners like decision trees and neural networks.

The k-nearest neighbor (k-NN) algorithm is one of the simplest machine learning techniques.
It works by storing all the training examples and predicting new cases by checking which
stored examples are closest to them. Every data point is considered as a position in an n-
dimensional space, where each attribute represents one axis. The closeness between two
points is usually measured using the Euclidean distance formula. For two instances xi and xj,
where ar(x) is the value of the r-th attribute of instance x, the distance is calculated as:

Here, ar(xi): The value of the r-th attribute (or feature/component) for data point xi.

When a new query point x is given, the algorithm finds the k nearest training examples. If
the target function is categorical (like yes/no or red/blue), the algorithm assigns the most
common class among those k neighbors. If the target is numerical, the algorithm calculates
the average value of the neighbors. For example, if k=1, the class of the nearest neighbor is
directly assigned. If k>1, the predicted class is:

A more refined version is distance-weighted k-NN, where closer neighbors are given more
importance than distant ones. A common method is to weight each neighbor using the inverse
square of its distance:

For classification, the prediction is then based on the weighted votes of neighbors. For
regression (when the output is a number), the predicted value is the weighted average of the
k nearest neighbors:

This ensures that points closer to the query have a stronger influence on the final prediction.
If the query exactly matches a training point, then the algorithm directly uses that point’s
class.

The strength of k-NN is that it is very easy to understand and works well when there is enough
training data. It also handles noisy data better when distance-weighting is used, as it smooths
out the effect of random errors. However, k-NN also faces some challenges. One big issue is
the curse of dimensionality. When there are many attributes, especially irrelevant ones, the
distance measure can become misleading. Two instances that are very similar in the
important attributes may still appear far apart when all irrelevant attributes are considered.
To solve this, attributes can be given different weights or irrelevant ones can be removed.
Another problem is that k-NN can be slow when the dataset is very large, since each new
query requires calculating distances to all training examples. To make it faster, data structures
like kd-trees can be used to organize the examples and quickly find nearest neighbors.

In short, k-NN is a lazy learning algorithm because it does not build a general model
beforehand. Instead, it makes decisions only when a query is given, based on the nearest
stored examples. With distance-weighting and proper feature selection, it becomes a very
effective method for both classification and regression tasks.

from [Link] import load_iris

from sklearn.model_selection import train_test_split

from [Link] import KNeighborsClassifier

from [Link] import classification_report, confusion_matrix,


accuracy_score

# Load the iris dataset

iris = load_iris()

X, y = [Link], [Link]

# Split data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,


random_state=42)

# Create and train the KNN classifier with k=3

knn = KNeighborsClassifier(n_neighbors=3)

[Link](X_train, y_train)

# Predict on the test set

y_pred = [Link](X_test)

# Print evaluation metrics


print("Confusion Matrix:")

print(confusion_matrix(y_test, y_pred))

print("\nClassification Report:")

print(classification_report(y_test, y_pred,
target_names=iris.target_names))

print("Accuracy Score:", accuracy_score(y_test, y_pred))


12.4.1 K-Means Clustering

K-means clustering is a simple and widely used method for partitioning a dataset into K distinct,
non-overlapping groups. The user first specifies the desired number of clusters, K, and the
algorithm then assigns each observation to exactly one of those clusters. The idea behind K-
means is intuitive: a good clustering is one in which observations within the same cluster are as
similar as possible, while being as different as possible from observations in other clusters. This
is achieved by minimizing the within-cluster variation, which measures how many observations
in a cluster differ from each other.

Mathematically, the goal is to partition the dataset into K clusters so that the total within-cluster
variation, summed across all clusters, is as small as possible. In practice, this variation is
usually measured using the squared Euclidean distance between observations. Thus, K-means
clustering seeks to minimize the sum of squared distances of each observation from the mean
(centroid) of its assigned cluster.
Finding the exact solution to this optimization problem is extremely difficult because the
number of possible ways to split n observations into K clusters grows exponentially with n.
Instead, K-means uses an iterative algorithm that converges to a local optimum. The algorithm
works as follows:

(1) each observation is randomly assigned to one of the K clusters.

(2) the cluster centroids (the mean of all observations in the cluster) are calculated.

(3) each observation is reassigned to the cluster whose centroid is closest.

Steps (2) and (3) are repeated until the assignments stop changing. At every step, the objective
function (within-cluster variation) decreases, so the algorithm always improves until it
stabilizes.

Where,
This equation relates two different but equivalent ways of measuring how spread out (or
"scattered") the points in a cluster are.

Because K-means only guarantees a local optimum, the result depends on the initial random
assignments. This means different runs of the algorithm may produce different clustering’s. To
address this, it is common to run K-means multiple times with different starting points and then
choose the best solution, i.e., the one with the lowest within-cluster variation.
A practical challenge in applying K-means is deciding on the number of clusters K. The choice
of K is not straightforward, and different values can lead to very different results. This issue,
along with other practical considerations such as initialization strategies and computational
efficiency, is an important part of applying K-means clustering effectively.

12.4.2 Hierarchical Clustering

One limitation of K-means clustering is that it requires us to pre-specify the number of clusters,
K. Hierarchical clustering is an alternative approach that does not require committing to a
specific value of K in advance. A key advantage of hierarchical clustering is that it produces a
tree-based representation of the observations, called a dendrogram, which provides a visual
summary of the data’s clustering structure.

The most common form of hierarchical clustering is agglomerative, or bottom-up, clustering. In


this method, the dendrogram is built starting from individual observations, which are gradually
merged into larger clusters until all observations belong to a single group. The dendrogram is
usually displayed as an upside-down tree, where the leaves represent individual observations,
and fusions higher up in the tree indicate the merging of clusters.

To interpret a dendrogram, we look at the height at which observations or clusters fuse.


Observations that merge at the bottom of the tree are very similar, whereas those that fuse near
the top are less similar. The vertical axis of the dendrogram represents the dissimilarity between
observations, and the height of each fusion shows how different the groups being merged are.
Importantly, the horizontal arrangement of observations in the dendrogram carries no
meaning—only the vertical fusion heights matter.

Clusters can be identified by making a horizontal “cut” across the dendrogram at a chosen
height. All the groups formed below the cut line represent clusters. For example, cutting at a
higher level may result in just two broad clusters, while cutting lower down can reveal more
detailed groupings. This makes hierarchical clustering flexible: a single dendrogram can provide
multiple levels of clustering, allowing analysts to choose the number of clusters based on the
problem at hand.
The hierarchical clustering algorithm itself is straightforward. First, each observation starts as
its own cluster. Then, at each step, the two clusters that are most similar are merged, reducing
the number of clusters by one. This process continues until all observations form a single
cluster. However, to perform this merging, we need a definition of “dissimilarity” between
clusters. This is where linkage methods come in.

In hierarchical clustering, particularly agglomerative clustering (bottom-up), linkage


methods determine how the distance between two clusters is computed when merging
them. Linkage methods define how to compute this inter-cluster distance based on the
distances between their points.
Single Linkage: Tends to form long, chain-like clusters → sensitive to noise and outliers. Looks
for the shortest bridge between the two groups.

Complete Linkage: Tends to form compact, spherical clusters → avoids chaining effect. Looks
for the longest bridge.

Average Linkage: A compromise between single and complete; balances compactness and
chaining. Averages all "bridges".

Centroid Linkage: Can result in inversions (clusters being merged that are not closest).
Measures distance between centers.

Ward's Method: Most commonly used for hierarchical clustering of real-valued data; closely
related to k-means clustering. Measures increase in total cluster variance if merged.

The choice of dissimilarity measures is also crucial. Euclidean distance is most used, but in
some cases, correlation-based distance may be more appropriate. For example, in online
shopping data, using Euclidean distance would group shoppers who purchase few items
overall, regardless of their preferences. In contrast, correlation-based distances would cluster
shoppers with similar purchasing patterns, even if some buy more frequently than others. Thus,
correlation-based distance may be better for applications where the goal is to group based on
patterns rather than volume.
Another important consideration is whether to scale variables before computing dissimilarities.
If features are on different scales (e.g., centimeters vs. kilometers) or if some features vary
much more than others (e.g., frequent purchases like socks vs. rare purchases like computers),
they can dominate the clustering process. Scaling the variables to have standard deviation one
gives all features equal importance, which is often desirable. However, whether scaling is
appropriate depends on the context of the analysis.

Overall, hierarchical clustering


offers flexibility by avoiding the
need to specify the number of
clusters in advance, while providing
a rich, interpretable dendrogram. At
the same time, results can be
strongly influenced by the choice of
linkage method, dissimilarity
measure, and scaling, so careful consideration of these factors is essential.
Write a program to perform kmeans clustering

import numpy as np

import [Link] as plt

from [Link] import KMeans

# Create some sample data points (each row is one point)

X = [Link]([

[1, 2],

[2, 1],

[3, 4],

[5, 8],

[8, 8],

[9, 10]

])

# Set the number of clusters

k = 2

# Create the KMeans model

kmeans = KMeans(n_clusters=k, init='k-means++', random_state=42)

# Fit the model to the data and predict cluster labels

labels = kmeans.fit_predict(X)

# Print the cluster centers

print("Cluster centers:\n", kmeans.cluster_centers_)

# Plot the clusters

[Link](X[:, 0], X[:, 1], c=labels, cmap='viridis')

[Link](kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],

s=200, c='red', marker='X') # Show centroids

[Link]('K-Means Clustering')

[Link]('Feature 1')
[Link]('Feature 2')

[Link]()

Write a program to perform hierarchical clustering

import numpy as np

import [Link] as plt

from [Link] import dendrogram, linkage

# Sample data points (each row is a point with features)

data = [Link]([

[1, 2],

[1, 4],

[1, 0],

[4, 2],

[4, 4],

[4, 0]

])

# Perform hierarchical/agglomerative clustering using Ward's method

linked = linkage(data, method='ward')

# Plot the dendrogram to show cluster merges

[Link](figsize=(8, 5))

dendrogram(linked, labels=[f'Pt{i+1}' for i in range(len(data))])

[Link]('Hierarchical Clustering Dendrogram')

[Link]('Data Points')

[Link]('Distance')

[Link]()
10.1 Single Layer Neural Networks

A neural network is a model that takes an input vector consisting of multiple variables and
builds a nonlinear function to predict a given response. Suppose we have an input vector of p
variables:

X=(X1, X2, X3, ....., XP)

The goal is to build a function f(X) that predicts the response Y. Earlier, we studied nonlinear
models such as decision trees, boosting, and generalized additive models. What makes neural
networks different from these methods is their specific layered structure, which mimics how
information flows in the brain.

Let us consider a simple feed-forward neural network. Suppose we have p=4 predictors. In
neural network terminology:

● The predictors X1, X2, X3, X4 make up the input layer.


● These inputs connect to a set of hidden units. We can decide how many hidden units
to use (say K=5).
● Each input connects to every hidden unit through weighted connections.
● Finally, the outputs of the hidden units combine to produce the final prediction.

The general form of the model is:

where each hidden unit function hk(X) is given by:

Here:
• f(X) is the final prediction or output the model gives for input X.

• β₀ is a starting value called the bias.

• The ∑ (sum) symbol means "add up" for each hidden unit (from 1 to K).

• βₖ is a coefficient (weight) for each hidden unit's output.

• hₖ(X) is the output from the k-th hidden unit for input X.

• Each hₖ(X) is itself a function:

• g is the activation function (like turning the hidden unit "on" or "off" depending on
the signal).

• wₖ₀ is a bias term for each hidden unit.

• wₖⱼ represents the strength of the connection (weight) between input Xⱼ and
hidden unit k.

• The sum inside g(...) adds up these weighted inputs.

So, this formula says:

1. Each hidden unit takes the inputs (X), combines them with its own weights and adds a
bias.

2. It passes this sum through an activation function g.

3. All these outputs from each hidden unit are weighted by βₖ and added together with the
main bias β₀.

4. The result is the final prediction f(X).

In summary, the network transforms the original inputs into new nonlinear features through the
activation function, and then fits a linear regression on them.

The key component of neural networks is the activation function g(z), which introduces
nonlinearity. Without it, the entire model would collapse into a simple linear regression.
Common activation functions:

Sigmoid Function

● Maps values into the range (0,1).


Historically popular, especially in logistic regression.
Used to represent “firing” of neurons — values close to 1 mean active, values
close to 0 mean silent.

● ReLU (Rectified Linear Unit)

○ Very popular in modern deep learning.


○ Simpler and computationally efficient.
○ Thresholds at zero but still allows for flexibility due to the bias term wk0

The choice of activation is crucial: sigmoid helps with probabilistic interpretation, while ReLU is
preferred for efficiency and handling complex patterns.

To understand how nonlinear activations capture interactions, let’s consider a simple example:

● Inputs: p=2, so X=(X1,X2).


● Hidden units: K=2.
● Activation function: g(z)=z2(just for illustration).

Parameters:

Now:

Final function:

This example shows that even simple nonlinear functions of linear combinations can produce
interaction effects between input variables.

However, in practice, we do not use quadratic activations, since they always lead to second-
degree polynomials. Functions like sigmoid and ReLU are more powerful because they do not
impose such strict limitations.

The nonlinearity of the activation function is what gives neural networks their power. If g(z) were
simply the identity function (i.e., g(z)=z), then:

which is nothing more than a linear regression model.

With nonlinear activations:


● The model can represent complex nonlinear relationships.
● It can capture interactions between features.
● It goes beyond traditional linear models, making it suitable for high-dimensional and
real-world problems.

The final step is fitting the model to data. This requires estimating the unknown parameters:

● Output weights: β0,β1,...,βK


● Hidden layer weights: wk0,wk1,...,wkp

For regression tasks, the most common choice of loss function is the squared error:

Here, the parameters are chosen to minimize the total squared difference between predicted
and actual values.

10.2 Multilayer Neural Networks

Modern neural networks often use more than one hidden layer. Although a single hidden layer
with enough units can theoretically approximate almost any function, training such a large
single-layer network is very difficult. Using multiple layers with fewer units makes the learning
task easier. This approach allows the network to learn step by step, building up more abstract
features at each layer.

Example: Handwritten Digit Recognition (MNIST)

A well-known example for testing neural networks is the MNIST dataset. It contains images of
handwritten digits from 0 to 9. Each image is 28 by 28 pixels, which makes 784 pixels in total.
Every pixel has a grayscale value between 0 and 255 that shows how dark it is.

To use these images in a neural network, the pixels are arranged into an input vector of length
784. The correct output is the digit label, which is represented using one-hot encoding. For
example, if the digit is “3,” the output vector will have a 1 in the fourth position and 0 in all others.
There are 60,000 training images and 10,000 test images in this dataset.

Digit recognition might look simple for humans because our brains are highly adapted to
recognize patterns. However, for machines this task was not easy, and research on digit
recognition in the late 1980s helped push the
development of modern neural networks.

The network used for MNIST has three main parts: the
input layer, two hidden layers, and the output layer.

● The input layer has 784 units (one for each pixel).
● The first hidden layer has 256 units.
● The second hidden layer has 128 units.
● The output layer has 10 units, one for each digit from 0 to 9.

In total, this network has around 235,000 parameters (weights) that need to be learned. Each
hidden layer applies a transformation to its input, using an activation function such as ReLU or
sigmoid, so that the network can detect more complex patterns as the data flows through the
layers.
Finally, the output layer produces ten values, one for each digit. To convert these values into
probabilities, the softmax function is used. This ensures that all outputs are non-negative and
add up to one. The digit with the highest probability is chosen as the prediction.

The first hidden layer is as in

To train the network, we need a loss function that measures how well the predicted probabilities
match the true labels. For classification tasks like this, the most common choice is the cross-
entropy loss. This loss punishes the model if it assigns a low probability to the correct class. By
minimizing this loss with optimization techniques such as gradient descent, the network
gradually improves its accuracy.

To train this network, since the response is qualitative, we look for coefficient estimates that
minimize the negative multinomial log-likelihood

If the problem had been predicting numerical values instead of classes, then a squared error
loss would have been more suitable. But since this is a classification task, cross-entropy works
best.
When tested on the MNIST dataset, neural networks perform much better than simpler linear
models such as logistic regression and linear discriminant analysis (LDA).

● A neural network with ridge regularization achieves an error rate of about 2.3%.
● A neural network with dropout regularization performs even better, with an error rate
of only 1.8%.
● In comparison, logistic regression has an error rate of 7.2%, and LDA has 12.7%.

This shows that neural networks are much more powerful for pattern recognition tasks
because they can capture nonlinear relationships in the data.

Even though MNIST has 60,000 training examples, the network still has more parameters than
data points. This can easily cause overfitting, where the model memorizes the training data but
fails to generalize to new data.

To avoid overfitting, regularization methods are used. Two common techniques are:

● Ridge Regularization: Adds a penalty for large weights, preventing the model from
becoming too complex.
● Dropout Regularization: Randomly drops out some units during training so the network
does not depend too heavily on specific neurons.

Both of these techniques help the model perform better on test data.

10.7 Fitting a Neural Network

Fitting a neural network is more complex compared to linear or logistic regression. The main
challenge comes from the nonlinear structure, the large number of parameters, and the risk of
overfitting. While modern software makes this process easier, it is important to understand the
principles behind model fitting.

We start with the simple neural network introduced earlier in Section 10.1. In this network, the
parameters include:

● The weights β = (β₀, β₁, …, βₖ).

● The hidden layer weights for each unit wₖ = (wₖ₀, wₖ₁, …, wₖₚ) where k = 1, …, K.
Given training data (xᵢ, yᵢ) for i = 1, …, n, we try to fit the model by minimizing the squared
error loss:

This looks simple, but the optimization is non-convex. This means:

● There may be multiple solutions (local minima and global minima).


● The problem becomes harder for deeper networks.
● Overfitting is very likely without proper regularization.

Two major strategies are used to make fitting practical:

● Slow Learning (Gradient Descent)


○ The model is trained iteratively using gradient descent.
○ Learning is slow and controlled to prevent unstable updates.
○ Training stops early when overfitting is detected (early stopping).

● Regularization

○ Penalties such as ridge (L2) or lasso (L1) are added to control parameter size.
○ This prevents the model from memorizing noise in the data.

The loss function for gradient descent can be written as:


where θ represents all parameters (weights and biases).

The gradient descent algorithm works as follows:

1. Start with an initial guess θ₀ for all parameters.


2. At each step t:
○ Compute the gradient of the loss with respect to the parameters:

○ Update the parameters in the opposite direction of the gradient:

where η is the learning rate.


3. Continue until the loss stops decreasing.

Intuition: Imagine standing in a hilly landscape. Gradient descent is like walking downhill step
by step until reaching a valley (a minimum). Sometimes this valley is global (best solution), other
times it is only local.

10.7.1 Backpropagation
The key to making gradient descent work efficiently is backpropagation, which applies the chain
rule of differentiation.

For one training example, the loss is:

Where

● With respect to βₖ:

● With respect to wₖⱼ:

Here:

● A portion of the error (yᵢ – f(xᵢ)) is assigned to each hidden unit.

● The chain rule propagates this error backward through the network → hence the term
backpropagation.
10.7.2. Regularization and Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent (SGD)


○ Instead of computing gradients on the full dataset, we use minibatches (small
random subsets of data).
○ This speeds up training and helps avoid poor local minima.
○ One pass through all minibatches is called an epoch.
Regularization
Adds penalty terms to prevent overfitting.
○ For example, ridge regularization modifies the loss to:

○ Lasso regularization uses absolute values instead of squares.


○ Early stopping also acts as a regularizer — stop training when validation error
starts increasing.

10.7.3 Dropout Learning

● Dropout is another regularization technique.


● During training, some units (neurons) are randomly dropped (set to 0) with probability
p.
● Surviving units are scaled by 1 / (1 - p) to maintain balance.
● This prevents over-specialization and improves generalization.

10.7.4 Network Tuning

Building a good neural network requires careful tuning:

● Number of hidden layers and number of units per layer.


● Regularization strength (dropout rate, ridge/lasso penalties).
● SGD parameters (batch size, learning rate, number of epochs).
● Data augmentation in some cases.

With trial and error, networks can achieve very low error rates (e.g., <1% on MNIST digits), but
over-tuning can cause overfitting.
import tensorflow as tf
from tensorflow import keras
from [Link] import layers

# Sample dataset (X: inputs, y: labels)


X = [[0,0], [0,1], [1,0], [1,1]] # XOR inputs
y = [0, 1, 1, 0] # XOR outputs

# Build the model


model = [Link]([
[Link](2, activation="sigmoid", input_shape=(2,)), # hidden
layer
[Link](1, activation="sigmoid") # output
layer
])

# Compile the model


[Link](optimizer="adam", loss="binary_crossentropy",
metrics=["accuracy"])

# Train the model


[Link](X, y, epochs=100, verbose=0)

# Test the model


print("Predictions:")
print([Link](X))

This creates the neural network structure:

1. Input shape = (2,) because each input has 2 values (x1, x2).
2. First layer (hidden layer):
o Dense(2) → fully connected layer with 2 neurons.
o activation="sigmoid" → each neuron outputs a value between 0 and 1.
3. Second layer (output layer):
o Dense(1) → 1 neuron output (since XOR has one output).
o activation="sigmoid" → outputs a probability between 0 and 1.

So architecture is: 2 inputs → [2 hidden neurons] → [1 output neuron]

optimizer="adam" → smart version of gradient descent that adjusts learning rates


automatically.

loss="binary_crossentropy" → since XOR is a binary classification problem (0 or 1).

metrics=["accuracy"] → we also want to track accuracy while training.

Training the model:

• X = inputs, y = expected outputs.


• epochs=100 means repeat training 100 times through the dataset.
• verbose=0 hides detailed logs (you can set it to 1 to see progress).

During training:

• The model does forward pass (predicts output).


• Computes loss (error between predicted and actual output).

Uses backpropagation + Adam optimizer to update weights and biases.

What Is a Hyperplane?
In mathematics, a hyperplane is a flat surface that has one dimension less than the space in
which it exists. For example, in two dimensions, a hyperplane is simply a line, and in three
dimensions a hyperplane is a plane. When we move to higher dimensions (p > 3), it becomes
hard to visualize, but the idea is still the same: a hyperplane is always one dimension less than
space.

The general mathematical equation of a hyperplane in two dimensions is:

In p-dimensional space, this extends to:

Any point that satisfies this equation lies on the hyperplane. If the point
does not satisfy the equation exactly, then it lies either on one side of the hyperplane or on the
other side. Specifically, if

then the point lies on one side of the hyperplane. On the other hand, if

then the point lies on the opposite side. In this way, a hyperplane divides the entire space into
two halves.

Once we have such a hyperplane, it can be used as a classifier. A new test observation is
classified depending on which side of the hyperplane it lies. Mathematically, we compute

If f(x) > 0, the observation is assigned to class +1. If f(x)<0, it is assigned to class −1. The size of
∣f(x)∣ also tells us how confident the classification is. If f(x)is large in magnitude, the point lies
far away from the hyperplane, and we are more confident in the class label. If f(x) is close to
zero, the point lies near the hyperplane, and the classification is less certain.
Thus, a classifier based on a separating hyperplane produces a linear decision boundary that
divides the space into two regions, one for each class.

9.1.2 Classification Using a Separating Hyperplane

Now suppose we have an n × p data matrix X that contains n training observations in p-


dimensional space. Each observation can be written as a vector of p features. For example:

These observations belong to two different classes. We label them as:

y1, y2, ..., yn ∈ {+1, –1}

Here +1 represents one class and –1 represents the other class.

We also have a test observation, which is a p-dimensional feature vector:

x = (x_1, x_2, ..., x_p)T

Our goal is to develop a classifier using training data that can correctly classify this test
observation.

If it is possible to construct a separating hyperplane, then the two classes can be perfectly
divided. For the blue class (y_i = +1), the condition is:

We can label the observations from the blue class as yi =-1and those from the purple class as
yi = 1. Then a separating hyperplane has the property that
● If f(x) > 0, we assign the test
observation to class +1.
● If f(x) < 0, we assign the test
observation to class –1.

9.1.3 The Maximal Margin Classifier


In general, if our data can be perfectly separated by a hyperplane, then there will usually be an
infinite number of such hyperplanes. This is because a separating hyperplane can often be
shifted slightly up or down, or rotated a little, without touching any of the training points. For
example, in Figure 9.2, the left panel shows three possible separating hyperplanes. To build a
classifier, we need a clear rule to decide which hyperplane to use among these many
possibilities.

A natural choice is the maximal margin hyperplane (also called the optimal separating
hyperplane). This is the hyperplane that is farthest away from the training observations. To
define this, we first calculate the perpendicular distance from each training observation to a
given hyperplane. The smallest of these distances is called the margin. The maximal margin
hyperplane is the one that maximizes this margin, meaning it is the hyperplane with the largest
possible minimum distance to the training points.

Once we have the maximal margin hyperplane, we can classify a test observation based on
which side of this hyperplane lies. This method is called the maximal margin classifier. The
idea is that if we build a classifier with a large margin on the training data, it will hopefully also
give a large margin on unseen test data, leading to good classification accuracy. However, when
the number of features p is very large, the maximal margin classifier can sometimes overfit the
data.

If the coefficients of the maximal margin hyperplane are β0, β1, ..., βp, then classification is
done using the function:

● If f(x) > 0, we assign the observation to class +1.


● If f(x) < 0, we assign the observation to class –1.

Figure 9.3 shows an example of the maximal margin hyperplane. Compared to the separating
hyperplanes in Figure 9.2, the maximal hyperplane margin clearly results in a larger margin
between the classes. It can be thought of as the mid-line of the widest “slab” that can be placed
between the two classes without touching any training points.
Looking at Figure 9.3, we see that three training observations lie exactly on the dashed lines that
mark the edges of the margin. These points are called support vectors. They are the points that
directly “support” the maximal margin hyperplane. If any of these support vectors were moved
slightly, the position of the hyperplane would change. However, moving any of the other training
observations would not affect the hyperplane if they do not cross into the margin.

This property is very important. It means that the maximal margin hyperplane depends only on
a small subset of the training data (the support vectors), not on all the observations. This idea
will play a central role later when we discuss the support vector classifier and support vector
machines.

9.1.4 Construction of the Maximal Margin Classifier


We now consider the task of constructing the maximal margin hyperplane based on a set of n
training observations

The condition

y_i (β0 + β1 * x_i1 + β2 * x_i2 + ... + βp * x_ip) ≥ M

ensures that each observation is on the correct side of the hyperplane with a cushion of at least
margin M (as long as M > 0).

Without this margin, we would only need

y_i (β0 + β1 * x_i1 + β2 * x_i2 + ... + βp * x_ip) > 0

to separate the classes. But requiring ≥ M gives a stronger condition.

The normalization constraint

is used to scale the coefficients, because if

β0 + β1 * x_i1 + β2 * x_i2 + ... + βp * x_ip = 0

defines a hyperplane, then multiplying the whole equation by any constant k also defines the
same hyperplane. By fixing the sum of squares of the coefficients to 1, we make the margin well-
defined.

Under these constraints, the perpendicular distance from the ith observation to the
hyperplane is given by:
Thus, maximizing M means finding the hyperplane that maximizes the margin between the two
classes. This hyperplane is called the maximal margin hyperplane.

9.1.5 The Non-separable Case

The maximal margin classifier works well when the data can be perfectly separated by a straight
line (or hyperplane in higher dimensions). In such cases, it finds the boundary that leaves the
largest possible margin between the two classes.

However, in many real situations, the data from different classes overlap and no single
hyperplane can separate them. When this happens, the optimization problem for the maximal
margin classifier has no solution,
since it requires a margin greater than
zero.

For example, in Figure 9.4 we see two


groups of points—blue and purple—
that cannot be split perfectly by a
straight line. Because of this, the
maximal margin classifier cannot be
applied.

To handle such cases, the idea of a separating hyperplane is relaxed. Instead of demanding
perfect separation, we allow for some flexibility so that the boundary “almost” separates the
classes. This leads to what is called the soft margin, which forms the basis of the support
vector classifier, a more general version of the maximal margin classifier that works when the
data are not perfectly separable.

9.2 Support Vector Classifiers


9.2.1 Overview of the Support Vector Classifier

In many cases, data points from two classes cannot be perfectly separated by a straight line
(hyperplane). Even if separation is possible, using such a classifier may not always be a good
choice. The reason is that a hyperplane that separates all training points exactly can become
too sensitive to individual observations.

For example, in Figure 9.5 (left panel), we see two classes of points—blue and purple—with a
maximal margin hyperplane. But when a single extra blue point is added (right panel), the
hyperplane shifts dramatically. The new separating line has a very small margin, which makes
it unreliable. Since the distance of points from the hyperplane is a measure of classification
confidence, a tiny margin means we cannot be confident about the predictions. This shows that
the maximal margin classifier can easily overfit, adjusting too much to specific training points
instead of capturing the general pattern.

To solve this issue, it can be useful to allow a classifier that does not perfectly separate the two
classes. By permitting a few misclassifications, we can achieve a boundary that better
represents the overall data. This approach is called the support vector classifier or soft margin
classifier.

The idea is that instead of forcing all points to be on the correct side of both the hyperplane and
the margin, we allow some flexibility. A few observations may fall inside the margin, and some
may even be on the wrong side of the hyperplane. The margin is called “soft” because it can be
violated by certain training points.

For instance, in Figure 9.6 (left panel), most observations lie correctly outside the margin, but a
few fall inside it. In the right panel of Figure 9.6, some points are even misclassified because
they lie on the wrong side of the hyperplane. This situation is expected when no perfect
hyperplane exists. By accepting these violations, the support vector classifier achieves greater
robustness and usually performs better on unseen data.

9.2.2 Details of the Support Vector Classifier

The support vector classifier decides the class of a new observation based on which side of a
hyperplane falls. The hyperplane is chosen so that it separates most of the training data
correctly, but it may misclassify a few points.

To find this hyperplane, we solve the following optimization problem:

Here, M is the margin we want to maximize, ξi are slack variables, and C is a nonnegative tuning
parameter. After solving this, we classify a new observation x^∗ by checking the sign of

The slack variables ξi describe where each observation lies relative to the margin:

If ξi=0, the observation is on the correct side of the margin.


ξi>0, the observation is inside the margin, so it has violated the margin.
ξi>1, the observation is on the wrong side of the hyperplane.

For example, in Figure 9.6 (left panel):


Purple points 3, 4, 5, and 6 are on the correct side of the margin.
Purple point 2 is exactly on the margin.
Purple point 1 is on the wrong side of the margin.
Blue points 7 and 10 are on the correct side of the margin.
Blue point 9 is on the margin.
Blue point 8 is on the wrong side of the margin.
None of these points are on the wrong side of the hyperplane.
In the right panel of Figure 9.6, two extra points (11 and 12) are added. These points are on the
wrong side of both the margin and the hyperplane.

The parameter C controls how many violations are allowed. Equation (9.15) makes sure the
total amount of violation does not exceed C. If C=0, then all ξi=0, so no violations are allowed,
and the problem reduces to the maximal margin hyperplane (which only works if the data are
separable). For C>0, at most C observations can be on the wrong side of the hyperplane, since
that requires ξi>1.

When C is larger, the model allows more violations, and the margin becomes wider. When C is
smaller, fewer violations are allowed, and the margin becomes narrower. Figure 9.7 shows this:
the top-left panel uses the largest value of C, and the margin is wide. As C decreases (top-right,
bottom-left, bottom-right panels), the margin gets narrower.

Another important detail


is that not all
observations affect the
hyperplane. Only those
that lie on the margin or
violate it influence the
classifier. Observations
that are strictly on the
correct side of the
margin have no effect—
moving them around
would not change the
classifier. Points that lie
on the margin or on the
wrong side are called
support vectors.

When C is large, many observations violate the margin, so there are many support vectors. This
makes the classifier depend on many points. In the top-left panel of Figure 9.7, there are many
support vectors. When C is small, there are fewer support vectors, as shown in the bottom-right
panel of Figure 9.7, which has only eight support vectors.

9.3 Support Vector Machines

9.3.1 Classification with Non-Linear Decision Boundaries

The support vector classifier works well when the boundary between two classes is roughly
linear. But in many real problems, the classes are separated by non-linear boundaries.

For example, in the left panel of Figure 9.8, the data clearly has a non-linear boundary. A
support vector classifier or any linear classifier would perform badly. This is confirmed in the
right panel of Figure 9.8, where the linear support vector classifier does not separate the
classes properly.

A similar situation was seen earlier in Chapter 7, where linear regression struggled with non-
linear relationships. There, we solved the issue by enlarging the feature space using
polynomial terms like quadratic or cubic functions.

We can use the same idea here for the support vector classifier. Instead of just using the original
predictors X1, X2,…, Xp, we could add polynomial functions of them. For example, we could fit
a classifier with features:

This would double the number of features, giving 2p features instead of just p.

The optimization problem would then change to:


Why does this give a non-linear boundary?

In the enlarged feature space, the decision boundary is still linear. But in the original space, the
boundary comes from a polynomial equation, which is generally non-linear.

We can even add higher-order polynomials (cubic, quartic, etc.) or interaction terms like Xj. But
if we keep adding more functions, the number of features could become extremely large,
making computation difficult.

This is where the support vector machine (SVM) comes in. It allows us to enlarge the feature
space efficiently using kernels, without explicitly creating all those new features.

9.3.2 The Support Vector Machine

The support vector machine (SVM) is an extension of the support vector classifier that handles
non-linear boundaries by using kernels. Kernels allow us to enlarge the feature space efficiently
without having to compute all polynomial or interaction terms directly.
The details of how the support vector classifier is computed are technical, but the key idea is

It can be shown that:

● The linear support vector classifier can be written as

This greatly reduces the computation.

Now, instead of using just the standard inner product (9.17), we can replace it with a kernel
function:

A kernel measures the similarity between two observations.

● If we use
● we get back the linear kernel, which is just the regular support vector classifier.

● If we use

we get a polynomial kernel of degree d. With d>1, this allows non-linear boundaries.

When the support vector classifier is combined with such a kernel, it is called a
support vector machine. The classifier is then:

● The left panel of Figure 9.9 shows an SVM with a polynomial kernel of degree 3 applied
to the non-linear data from Figure 9.8. It gives a much better decision boundary than
the linear support vector classifier.

● Another popular choice is the radial kernel:

The right panel of Figure 9.9 shows an SVM with a radial kernel on the same data, and it
also gives a good separation.

-
How does the radial kernel work?
● If a test observation x∗ is far (in Euclidean distance) from a training observation xi, then
K(x∗,xi) will be close to 0.
● This means far-away points have little to no influence on the classification of x∗.
So, the radial kernel has a local behavior, where only nearby training points affect the
decision.

Finally, why use kernels instead of directly enlarging the feature space like in Section 9.3.1?
The main reason is computation. With kernels, we only need to compute K(xi,xi′) for all
distinct pairs of observations, without explicitly working in the enlarged feature space.

This is crucial because in many cases the enlarged feature space is extremely large, or even
infinite (as with the radial kernel). Kernels let us work efficiently without ever constructing that
huge space.

9.3.3 An Application to the Heart Disease Data

Decision trees and related methods were applied to the Heart data. The goal was to use 13
predictors such as Age, Sex, and Chol to predict whether a person has heart disease. Here, an
SVM is compared to LDA on this data. After removing 6 missing observations, 297 subjects
remain, split into 207 training and 90 test observations.

First, LDA and the support vector classifier (SVM with polynomial kernel of degree 1) were fitted
to the training data. The left panel of Figure 9.10 shows ROC curves for training set predictions
from both methods.
Both calculate scores of the form

For a cutoff t, predictions are made based on whether . The ROC curve
is formed by calculating false positive and true positive rates across different values of t. The
best classifier will be close to the top left of the ROC plot. Here, LDA and the support vector
classifier both perform well, though the support vector classifier may be slightly better.

The right panel of Figure 9.10 shows ROC curves for SVMs using a radial kernel with different
values of γ. As γ increases, the fit becomes more non-linear, and the ROC curves improve. With
γ=10^−1, the training ROC curve looks almost perfect. But these are training results, which may
not reflect test performance.

Figure 9.11 shows ROC curves for the 90 test observations. In the left panel, the support vector
classifier has a small advantage over LDA (though not statistically significant). In the right panel,
the SVM with γ=10^−1, which performed best on training data, performs worst on test data.
This shows that more flexible models can lower training error but not necessarily improve test
performance. The SVMs with γ=10^−2 and γ=10^−3perform similarly to the support vector
classifier, and all three do better than the SVM with γ=10^−1.

9.4 SVMs with More than Two Classes


Support Vector Machines (SVMs) are mainly used for binary classification problems where data
is divided into two classes. However, many real-world problems involve more than two classes,
and in such cases, the direct idea of separating hyperplanes does not work naturally. To handle
this, two common strategies are applied: one-versus-one and one-versus-all classification.

9.4.1 One-Versus-One Classification

In this approach, separate classifiers are trained for every possible pair of classes.

● For K classes, a total of classifiers are created

● Each classifier distinguishes one class as +1 and the other as –1.


● For a new test observation:
Each classifier predicts one of the two classes.
A voting system is used, where the class that gets the maximum votes is selected.

9.4.2 One-Versus-All Classification

This method compares each class with all the remaining classes together.

● For K classes, a total of K classifiers is trained

● Each classifier predicts whether an observation belongs to its specific class (+1) or to
the rest (–1).
● For a test observation:
A score is calculated for each class.
The class with the highest score is chosen as the predicted class.

5.1 Cross-Validation

When we train a model, we usually look at two kinds of error:

● Training error → how well the model predicts the data it was trained on.
● Test error → how well the model predicts new, unseen data.
What we really care about is the test error, since it tells us how the model will perform in real
situations. But usually, we don’t have a separate test set available.

The training error is easy to calculate, but it often underestimates the true test error
(sometimes by a lot). That’s why we need ways to estimate the test error using only the training
data.

There are two main strategies:

1. Mathematical adjustment → tweak the training error to better estimate the test error
2. Resampling methods → hold out part of the training data, train the model on the rest, and
then test it on the held-out part.

In this section, the focus is on the second strategy: estimating test error by holding out subsets
of data.

5.1.1 The Validation Set Approach

The validation set approach is a simple way to estimate how well a model will perform on
unseen data. In this method, the dataset is randomly divided into two parts: a training set and
a validation (hold-out) set. The model is trained using the training set and then evaluated on
the validation set. The prediction error on the validation set, often measured using Mean
Squared Error (MSE) for regression problems, serves as an estimate of the test error.

To illustrate this, consider the Auto dataset, where we want to predict miles per gallon (mpg)
from horsepower using polynomial regression. The dataset of 392 observations was randomly
split into two equal halves: 196 for training and 196 for validation. The results showed that a
quadratic model (including horsepower and horsepower²) had a much lower validation error
compared to a simple linear model. Interestingly, adding a cubic term did not improve
performance; in fact, the cubic model performed slightly worse than the quadratic one. This
suggests that while the linear model was too simple, a quadratic model was sufficient without
needing more complexity.
However, one drawback of this method is that the results can vary depending on how the data
is split. For example, if we split the Auto dataset multiple times, the estimated test error
changes with each split. Although all splits consistently show that the linear model is
inadequate and the quadratic model is effective, there is no agreement about whether higher-
order polynomials offer any real benefit. This highlights the high variability of the validation set
approach.

Another drawback is that the model is trained on only part of the data, not the full dataset. Since
statistical models generally perform better when trained on more data, the validation error may
be larger than the true test error we would expect if the model were trained on the entire
dataset. In other words, the validation set approach tends to be too pessimistic.

Because of these issues, the validation set approach, though simple and easy to implement, is
not always reliable. To overcome its limitations, more advanced methods such as cross-
validation are used, which provide more stable and accurate estimates of test error.
5.1.2 Leave-One-Out Cross-Validation

Leave-One-Out Cross-Validation (LOOCV) is very similar to the validation set approach, but
it improves on its drawbacks. Instead of splitting the data into two large subsets, LOOCV takes
a different approach: each time, it leaves out just one observation to use as the validation set,
and uses the remaining n−1 observations for training. The model is fit on these n−1 data points,
and then it predicts the left-out observation. The squared error for this prediction is recorded.

This process is repeated for every single observation in the dataset. That means if we have n
data points, the model will be trained n times, each time leaving out a different observation. At
the end, all n squared errors are averaged to give the LOOCV estimate of the test error:

LOOCV has some clear advantages over the simple validation set approach. First, it has low
bias, because each training set contains almost all the data (n−1observations), rather than just
half the dataset as in the basic validation method. This makes its error estimates much closer
to what we would expect if we trained on the full dataset. Second, LOOCV results are
deterministic. Unlike the validation set method, which gives different answers depending on
how we randomly split the data, LOOCV always gives the same result because every possible
split is considered.
For example, in the Auto dataset (predicting mpg from horsepower using polynomial
regression), LOOCV can be used to estimate the test error for models of different polynomial
degrees. The error curve from LOOCV provides insight into which model complexity works best,
without depending on random data splits.

The main drawback of LOOCV is its computational cost. Since the model must be trained n
times, it can be very slow for large datasets or for models that take a long time to train.
Fortunately, for linear regression (including polynomial regression), there is a mathematical
shortcut that makes LOOCV as fast as fitting just one model. This shortcut uses a formula
involving the residuals and “leverage” values from the regression, which adjusts errors
appropriately for each observation.

Finally, LOOCV is a general method that can be applied to almost any predictive model, such
as logistic regression, linear discriminant analysis, and more. However, the special shortcut
formula only works for linear regression; for other models, the full n-times fitting process is
necessary.

5.1.3 k-Fold Cross-Validation

An alternative to leave-one-out cross-validation (LOOCV) is k-fold cross-validation (CV). In this


approach, the dataset is randomly divided into k equally sized groups, or folds. Each fold takes
a turn acting as the validation set, while the remaining k−1 folds are used for training. The
model’s error is measured on the validation fold, and this process is repeated until every fold
has been used once as validation. The k-fold CV estimate of test error is then the average of
these k error values.

LOOCV can be viewed as a special case of k-fold CV, where k = n (the number of observations).
However, in practice, k is usually set to 5 or 10, leading to 5-fold CV or 10-fold CV. The main
reason for this is computational efficiency: LOOCV requires fitting the model n times, which can
be extremely slow for large datasets or complex models, while 10-fold CV requires fitting only
10 models, making it much more feasible. Beyond computation, 5- or 10-fold CV also has
advantages in terms of balancing bias and variance, as discussed later.

When applied to the Auto dataset, 10-fold CV produced slightly different results depending on
how the data was split into folds, but the variability was far less than with the simple validation
set approach. In general, k-fold CV provides a more stable and reliable estimate of test error
than a single train/validation split.

To better understand how accurate cross-validation is, researchers often turn to simulated
data, where the true test error is known. In such cases, the test MSE curves from LOOCV and
10-fold CV usually look very similar to the true curve, even if they slightly over- or under-
estimate the actual error values. Importantly, even when the error values are not perfect, the
location of the minimum error point, that is, the level of model flexibility that produces the
best performance, is almost always correctly identified by cross-validation.

Thus, k-fold CV not only provides a good estimate of test error but also serves as a practical tool
for model selection, helping us choose the model or level of complexity that will generalize
best to unseen data.

5.1.4 Bias-Variance Trade-Off for k-Fold Cross-Validation

So far, we have seen that LOOCV is almost unbiased because each training set uses nearly all
the data (n−1 observations). In comparison, k-fold CV uses fewer observations in each training
set (around (k−1)n/k, so its estimates have a bit more bias. From this perspective, LOOCV might
look like the better choice.

However, bias is not the only issue we also need to consider variance. LOOCV tends to have
higher variance than k-fold CV. This happens because in LOOCV, each of the n models is
trained on nearly identical datasets, so their results are highly correlated. Averaging many highly
correlated results does not reduce variance much. In contrast, k-fold CV with k<n trains on
more distinct subsets, so the models are less correlated with each other. This leads to a lower
variance in the final error estimate.

In other words, LOOCV has low bias but high variance, while k-fold CV (with k=5or 10) balances
the two by introducing a small amount of bias but reducing variance significantly. Empirical
studies show that 5-fold or 10-fold CV usually gives the most reliable estimates of test
error, avoiding both the high variance of LOOCV and the high bias of the simple validation set
approach.

5.1.5 Cross-Validation on Classification Problems

Up to this point, cross-validation was illustrated in regression problems where the outcome Y
is quantitative, and test error was measured using mean squared error (MSE). However, cross-
validation is equally useful for classification problems where Y is qualitative. In this case,
instead of MSE, the measure of test error is simply the misclassification rate, the proportion
of observations incorrectly predicted. For example, in LOOCV, the error rate is the average
number of misclassified observations across all folds. The same idea applies to k-fold CV and
the validation set approach.

To illustrate, logistic regression models were fitted on a simulated two-dimensional


classification dataset. A standard logistic regression with a linear decision boundary gave a test
error rate of 0.201, which is higher than the Bayes error rate of 0.133, indicating that the model
lacked flexibility to capture the true boundary. Adding polynomial terms to the logistic
regression improves flexibility: a quadratic model reduced the error rate slightly (to 0.197), while
a cubic model showed a much larger improvement (0.160). However, going further to a quartic
model slightly increased the error rate again (0.162), showing that greater complexity does not
always mean better performance.

In practice, we do not know the Bayes boundary or true test errors, so we rely on cross-
validation to guide model choice. The left panel of Figure 5.8 shows training error, test error, and
10-fold CV error for logistic regression with polynomial terms up to order 10. As expected,
training error decreases with higher flexibility, but test error follows a U-shaped curve: it
decreases initially, then increases due to overfitting. The 10-fold CV curve closely tracks the
test error, even though it slightly underestimates it. Importantly, CV correctly identifies the best
model complexity—here around 3rd or 4th order polynomials.
A similar pattern is seen with the K-nearest neighbors (KNN) classifier (right panel of Figure
5.8). Training error decreases as K becomes smaller (i.e., the model becomes more flexible),
but test error again follows a U-shape. Cross-validation provides a reliable estimate of test error
and helps identify the best value of K. Thus, cross-validation is a powerful tool for model
selection in classification, just as in regression.
19.7 Measuring Classifier Performance

For classification, especially for two-class problems, a variety of measures have been
proposed. There are four possible cases:

● For a positive example, if the prediction is also positive, this is a true positive (TP).
● If the prediction is negative for a positive example, this is a false negative (FN).
● For a negative example, if the prediction is also negative, this is a true negative (TN).

● If we predict a negative example as positive, this is a false positive (FP).

In some two-class problems, we make a distinction between the two classes and hence the
two types of errors: false positives and false negatives. Different measures, appropriate in
different settings, are given in the following table.
Let us consider an authentication application where users log in by voice. A false positive
means wrongly logging in an impostor, and a false negative means refusing a valid user. It is
clear that the two types of errors are not equally bad; the former is much worse.

● True Positive Rate (TP-rate / Hit Rate): Proportion of valid users authenticated.
● False Positive Rate (FP-rate / False Alarm Rate): Proportion of impostors wrongly
accepted.

Suppose the system returns a probability of the positive class, ˆP(C1|x). Then, for the negative
class, we have ˆP(C2|x) = 1 − ˆP(C1|x). We choose “positive” if ˆP(C1|x) > θ.

● If θ is close to 1, we rarely choose the positive class. This gives almost no false positives
but also very few true positives.
● If we decrease θ, we increase the number of true positives but risk introducing false
positives.

For different values of θ, we can obtain multiple pairs of (TP-rate, FP-rate). By connecting
them, we get the ROC curve.

● Ideally, a classifier has a TP-rate = 1 and FP-rate = 0.


● A classifier is better if its ROC curve is closer to the upper-left corner.
● On the diagonal line, we make as many true decisions as false ones – this is the worst
one can do.

● Any classifier below the diagonal can be improved by flipping its decisions.
If two ROC curves intersect, one classifier may be better under certain loss conditions, while
the other is better under others.

ROC allows a visual analysis, but we can also reduce it to a single number: the Area Under the
Curve (AUC).

● An ideal classifier has AUC = 1.


● Comparing AUC values of different classifiers gives us their general performance
averaged across loss conditions.

In information retrieval, a system searches a database for relevant records:

● True Positives: Relevant records that are retrieved.


● False Negatives: Relevant records not retrieved.
● False Positives: Non-relevant records that are retrieved.
● Precision = (Retrieved & Relevant) / (Total Retrieved).
● Recall = (Retrieved & Relevant) / (Total Relevant).

As in ROC analysis, for different thresholds, we can plot Precision vs Recall curves.

● Sensitivity (Recall, TP-rate): TP / P → How well positives are identified.


● Specificity: TN / N → How well negatives are identified.
When K > 2 classes, we can use the class confusion matrix. This is a K × K matrix where entry
(i, j) contains the number of instances belonging to class Ci but assigned to class Cj.

● Ideally, all off-diagonal values are zero (no misclassification).


● The confusion matrix helps identify which classes are most frequently confused.
● Alternatively, one can define K separate two-class problems, each one separating one
class from the rest (one-vs-all classification).
PROGRAM FOR NAÏVE BAYES CLASSIFICATION

from sklearn.naive_bayes import GaussianNB

from [Link] import LabelEncoder

# Sample data (Age, Income)

X = [

['young', 'high'],

['young', 'high'],

['middle', 'medium'],

['old', 'low'],

['old', 'low'],

['old', 'medium'],

['middle', 'medium'],

['young', 'low']

# Labels (target)

y = ['no', 'no', 'yes', 'yes', 'yes', 'no', 'yes', 'yes']

# Convert categorical data to numeric using LabelEncoder

le_age = LabelEncoder()

le_income = LabelEncoder()

le_label = LabelEncoder()

# Encode features

age_encoded = le_age.fit_transform([row[0] for row in X])

income_encoded = le_income.fit_transform([row[1] for row in X])

# Combine features into single array

import numpy as np

X_encoded = [Link](list(zip(age_encoded, income_encoded)))

# Encode labels
y_encoded = le_label.fit_transform(y)

# Train Naive Bayes model

model = GaussianNB()

[Link](X_encoded, y_encoded)

# Predict a new sample

# Example: ['middle', 'high']

test = [Link]([[le_age.transform(['middle'])[0],
le_income.transform(['high'])[0]]])

pred = [Link](test)

# Decode and print result

print("Prediction:", le_label.inverse_transform(pred)[0])

CODE FOR MULTIPLE LINEAR REGRESSION

import numpy as np

import pandas as pd

from [Link] import fetch_openml

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from [Link] import mean_squared_error, mean_absolute_error

import [Link] as plt

# 1. Load data

boston = fetch_openml(name="boston", version=1, as_frame=True)

df = [Link]

# 2. Prepare features (X) and target (y)

X = [Link](columns=["MEDV"]) # 'MEDV' is the target: median house


value

y = df["MEDV"]

# 3. Split into train and test sets


X_train, X_test, y_train, y_test = train_test_split(

X, y, test_size=0.2, random_state=42

# 4. Train multiple linear regression model

model = LinearRegression()

[Link](X_train, y_train)

# 5. Make predictions on test set

y_pred = [Link](X_test)

# 6. Evaluate model performance

mse = mean_squared_error(y_test, y_pred)

mae = mean_absolute_error(y_test, y_pred)

print(f"Mean Squared Error (MSE): {mse:.2f}")

print(f"Mean Absolute Error (MAE): {mae:.2f}")

# 7. Visualization (optional)

[Link](y_test, y_pred, alpha=0.7, color="blue")

[Link]([y_test.min(), y_test.max()], [y_test.min(), y_test.max()],


color="red", lw=2)

[Link]("True MEDV ($1000s)")

[Link]("Predicted MEDV ($1000s)")

[Link]("True vs Predicted House Prices")

[Link]()

CODE FOR POLYNOMIAL REGRESSION

import numpy as np

from sklearn.linear_model import LinearRegression

from [Link] import PolynomialFeatures

# Example data

X = [Link]([1, 2, 3, 4, 5]).reshape(-1, 1)
y = [Link]([1, 4, 9, 16, 25])

# Transform features to polynomial features

poly = PolynomialFeatures(degree=2)

X_poly = poly.fit_transform(X)

# Fit polynomial regression model

model = LinearRegression()

[Link](X_poly, y)

# Predict on new data

X_test = [Link]([6, 7, 8]).reshape(-1, 1)

X_test_poly = [Link](X_test)

y_pred = [Link](X_test_poly)

print(y_pred) # Predicted values for X_test

CODE FOR LASSO REGRESSION

import numpy as np

from sklearn.linear_model import Lasso

# Create sample feature data (X) and target values (y)

X = [Link]([[1], [6], [2], [7], [8], [9], [3]])

# Feature: e.g., some measurement

y = [Link]([1, 3, 2, 5, 7, 8, 7])

# Target: e.g., outcome to predict

# Instantiate the Lasso model

# 'alpha' controls the regularization strength; higher means more


regularization

lasso = Lasso(alpha=0.1)

# Fit the model to the data

[Link](X, y)

# Prepare new data for prediction


X_test = [Link]([[4], [5]])

# Predict outcomes for new, unseen data

y_pred = [Link](X_test)

print(y_pred) # Display the predicted values for X=8 and X=9

CODE FOR RIDGE REGRESSION

import numpy as np

import pandas as pd

from [Link] import fetch_openml

from sklearn.model_selection import train_test_split

from sklearn.linear_model import Ridge

from [Link] import mean_squared_error, mean_absolute_error

import [Link] as plt

# 1. Load Boston Housing data

boston = fetch_openml(name="boston", version=1, as_frame=True)

df = [Link]

# 2. Features (X) and Target (y)

X = [Link](columns=["MEDV"])

y = df["MEDV"]

# 3. Train-test split

X_train, X_test, y_train, y_test = train_test_split(X, y,


test_size=0.2, random_state=42)

# 4. Create and train Ridge Regression model

ridge_model = Ridge(alpha=1.0) # You can tune alpha

ridge_model.fit(X_train, y_train)

# 5. Predict

y_pred = ridge_model.predict(X_test)

# 6. Evaluate
mse = mean_squared_error(y_test, y_pred)

mae = mean_absolute_error(y_test, y_pred)

print(f"Mean Squared Error (MSE): {mse:.2f}")

print(f"Mean Absolute Error (MAE): {mae:.2f}")

# 7. Optional: Plot actual vs predicted

[Link](y_test, y_pred, color='blue', alpha=0.6)

[Link]([y_test.min(), y_test.max()], [y_test.min(), y_test.max()],


color='red', linewidth=2)

[Link]("Actual MEDV")

[Link]("Predicted MEDV")

[Link]("Ridge Regression: Actual vs Predicted")

[Link]()

CODE FOR SVM

from sklearn import datasets

from sklearn.model_selection import train_test_split

from [Link] import SVC

from [Link] import accuracy_score

# Load sample dataset (Iris dataset)

iris = datasets.load_iris()

X = [Link]

y = [Link]

# Split dataset into training and testing sets (80% train, 20% test)

X_train, X_test, y_train, y_test = train_test_split(X, y,


test_size=0.2, random_state=42)

# Create SVM classifier with a linear kernel

svm_classifier = SVC(kernel='linear')

# Train the classifier


svm_classifier.fit(X_train, y_train)

# Predict labels for test set

y_pred = svm_classifier.predict(X_test)

# Calculate and print accuracy

accuracy = accuracy_score(y_test, y_pred)

print("Accuracy:", accuracy)

CODE FOR NEURAL NETWORK

import numpy as np

# Training data: X input and y output

X = [Link]([[0], [1], [2], [3], [4], [5]])

y = [Link]([[0], [2], [4], [6], [8], [10]])

# Initialize weight randomly

weight = [Link](1)

# Learning rate

lr = 0.01

# Training loop

for i in range(1000):

# Prediction: simple linear model y = weight * x

y_pred = X * weight

# Calculate error

error = y - y_pred

# Update weight with gradient descent

weight += lr * (X.T @ error)

print("Learned weight:", weight)

print("Prediction for 6:", weight * 6)

You might also like