What is Machine
Learning?
Definition
Machine learning is when we teach a computer to learn from data.
Instead of writing if-else rules, we give the computer many examples,
and it figures out the patterns.
Example:
We give it thousands of house listings with size, location, number of
rooms → it learns to predict price.
We give it thousands of labeled emails (spam/not spam) → it learns
what makes an email spammy.
What is Machine
Learning?
Why Does This Matter?
ML allows us to:
Make predictions (weather, sales)
Classify things (images, diseases, emails)
Find hidden patterns (clustering customers)
Automate decisions (credit approvals, recommendations)
Applications in the Real World:
YouTube: recommends videos based on watch history.
Banks: detect fraud in credit card transactions.
Healthcare: predict patient risk from medical records.
ML vs. Traditional
Programming
Traditional: If X, do Y → hard to write rules for every case.
ML: Give data, let the system learn rules itself.
Datasets — Understanding
the Heart of ML
What is a Dataset?
A dataset is structured data we use to teach our machine learning
model.
It’s usually in table format:
◦ Rows → individual examples (data points, records)
◦ Columns → features (input variables) + target (output we want to predict)
Example: House Prices Dataset
Size (sqft) Bedrooms Location Price ($)
1200 3 Suburban 250,000
1500 4 Urban 300,000
900 2 Rural 150,000
•Features (inputs): Size, Bedrooms, Location
•Target (output): Price
In ML, we train the model using these
input → output pairs,
so the model can later predict price for a new house.
Types of Data in
Datasets
Numerical data: Sizes, prices, ages, weights.
Categorical data: Locations (Urban, Suburban, Rural), colors, brands.
Text data: Product reviews, tweets, emails.
Image data: Pixel values.
Understanding Data
Quality
•Are there missing values?
→ E.g., house without a listed location.
•Are there outliers?
→ E.g., a tiny house listed at $10 million.
•Is the data balanced?
→ For classification tasks, do we have equal
•numbers of positive and negative samples?
Example Problem:
Predicting Student
Performance
Study Hours Attendance (%) Final Grade (Pass/Fail)
20 95 Pass
5 50 Fail
15 80 Pass
Dataset:
Features → Study Hours, Attendance
Target → Final Grade
We can train a model to predict if a student will pass
based on their study habits.
Supervised Learning
— Regression
What is Regression?
We predict continuous outcomes (numbers, not categories).
Example:
Predicting house prices.
Predicting temperatures.
Predicting sales revenue.
Linear Regression
Explained
We try to fit a line (or plane) through the data points
to minimize prediction error.
Mathematically:
Y=mX+b
We want to find the best a and b such that
predictions are close to actual prices.
Example: Predicting Student
Exam Scores Based on Study
Hours
Study Hours Exam Score (%)
1 50
2 55
3 65
4 70
5 75
6 80
7 85
8 90
9 95
Linear Regression
Formula:
Score=m⋅(Hours)+b
Calculations:
Final Linear
Regression Equation:
y=5.49 x+ 46.45
Add Value of X and get results
Supervised Learning —
Classification
We predict categories or classes.
Example:
Spam vs. not spam.
Healthy vs. sick.
Pass vs. fail.
Supervised Learning —
Classification
Common Algorithms
Logistic Regression: Despite the name, it’s for classification.
Decision Trees: Tree structure making splits at each feature.
K-Nearest Neighbors (KNN): Looks at closest neighbors to classify.
Example: Predicting
Grading
Study Hours Attendance Pass
4 90 0
8 95 1
7 90 1
Logistic Regression (Despite
the name, it's for
classification)
Purpose: Used for binary or multiclass classification, not regression.
How it works:
It estimates the probability that a given input belongs to a certain class.
It uses the logistic (sigmoid) function to convert any real-valued
number into a value between 0 and 1.
Example (Binary Classification):
Input: Features of an email (e.g., contains "free", all caps, etc.)
Output: Probability it's spam.
If the probability is > 0.5, predict spam (class 1); else, not spam (class 0).
Mathematically:
Why use it?
It's fast, interpretable, and works well with linearly separable data.
Predict if a student passes
(1) or fails (0) based on
hours studied
•One feature: hours studied (x)
•Model parameters:
•Intercept β0=−4
•Coefficient β1=1.5
Logistic (sigmoid)
function
Example 1
Example 2
Example 3
Summary Table
Sigmoid Output
Hours Studied (x) z Value Prediction
(Probability)
1 -2.5 0.076 Fail (0)
3 0.5 0.622 Pass (1)
5 3.5 0.970 Pass (1)
Decision Trees (Tree
structure making splits at
each feature)
Purpose: Can be used for classification or regression.
How it works:
It splits the dataset based on the value of input features.
Each node in the tree represents a decision rule (e.g., "age < 30?").
Each leaf node represents a final class label or value.
Example:
Root: Is income > 50k?
◦ Yes → Is age > 35?
◦ Yes → Class A
◦ No → Class B
◦ No → Class C
Splitting criteria:
Classification: Entropy to decide the best split.
Regression: Uses Mean Squared Error (MSE).
Advantages:
Easy to understand and interpret.
Can handle both numerical and categorical data.
Disadvantages:
Can overfit (solved using pruning or using ensembles like Random Forests).
Scenario: Should we play tennis?
We have weather data for 14 days, and we want to decide if we should play tennis based
on the weather conditions.
The dataset looks like this:
Day Weather Temperature Humidity Wind Play Tennis
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Cloudy Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
7 Cloudy Cool Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Cloudy Mild High Strong Yes
13 Cloudy Hot Normal Weak Yes
14 Rain Mild High Strong No
Step 1: Calculate overall
entropy
We have 14 examples:
9 Yes
5 No
Calculate information
gain for “Weather”
Split by Weather:
Sunny (5): 2 Yes, 3 No → entropy ≈ 0.971
Overcast (4): 4 Yes, 0 No → entropy = 0
Rain (5): 3 Yes, 2 No → entropy ≈ 0.971
Compare with other
attributes
You would similarly compute the information gain for Temperature,
Humidity, and Wind.
It turns out Outlook has the highest information gain, so the decision
tree will split on Outlook at the root.
Build the tree
[Weather]
/ | \
Sunny Overcast Rain
/ | \
(check more) Yes (check more
Entropy measures the impurity or uncertainty in the dataset.
Information Gain tells us how much knowing an attribute reduces entropy.
We choose the attribute with the highest information gain to split at each
node.)
K-Nearest Neighbors (KNN)
— Looks at closest neighbors
to classify
Purpose: Classification (or regression) based on proximity to training data.
How it works:
For a new input, KNN:
◦ Calculates the distance from this point to all points in the training set.
◦ Finds the K closest points (neighbors).
◦ Assigns the most common class (for classification) or average value (for
regression) among those K neighbors.
Distance metrics commonly used:
Euclidean distance (for continuous features)
Manhattan distance
Minkowski distance
Example (K=3):
You're trying to classify a fruit as apple or orange.
You look at the 3 closest fruits in your dataset.
If 2 are apples and 1 is orange, you classify the new fruit as apple.
Advantages:
Simple and intuitive.
No training phase — all work happens during prediction.
Disadvantages:
Slow with large datasets (since it calculates distance to all training
points).
Sensitive to irrelevant or unscaled features.
Example: Predicting Fruit
Type (Apple = 0, Orange =
1)
Feature 2 (Color
Point (ID) Feature 1 (Weight) Class (Label)
Score)
A 150 0.8 0 (Apple)
B 170 0.9 0 (Apple)
C 140 0.7 0 (Apple)
D 130 0.4 1 (Orange)
E 160 0.6 1 (Orange)
New point to predict:
•Weight = 155, Color Score = 0.65
Calculate Euclidean
distance
Calculate Euclidean
distance
Color Score
Weight Difference Difference
Compare to Distance dd
(155−x)2(155 - x)^2 (0.65−y)2(0.65 -
y)^2
(0.65 - 0.8)^2 = √(25 + 0.0225) ≈
A (150, 0.8) (155 - 150)^2 = 25
0.0225 5.000
(0.65 - 0.9)^2 = √(225 + 0.0625) ≈
B (170, 0.9) (155 - 170)^2 = 225
0.0625 15.000
(0.65 - 0.7)^2 = √(225 + 0.0025) ≈
C (140, 0.7) (155 - 140)^2 = 225
0.0025 15.000
(0.65 - 0.4)^2 = √(625 + 0.0625) ≈
D (130, 0.4) (155 - 130)^2 = 625
0.0625 25.000
(0.65 - 0.6)^2 = √(25 + 0.0025) ≈
E (160, 0.6) (155 - 160)^2 = 25
0.0025 5.000
Select K nearest neighbors
(let’s use K = 3)
Sorted by distance:
A → 5.0 → Class 0
E → 5.0 → Class 1
B or C → 15.0 → Class 0 or 0
So, neighbors are:
A (Apple, 0)
E (Orange, 1)
B or C (Apple, 0)
Majority vote
Neighbors’ classes: 0, 1, 0
→ Majority: Apple (0)
Final Prediction:
✅ The new point is classified as Apple (0)
Summary Table
Algorithm Type Key Idea Strengths Weaknesses
Uses Assumes linear
Logistic Fast,
Classification probability decision
Regression interpretable
with sigmoid boundary
Easy to
Splits data
Classification/ interpret, Prone to over
Decision Tree using feature
Regression handles mixed fitting
thresholds
data
Slow
Simple, no
Classification/ Uses nearest prediction,
KNN training
Regression neighbors sensitive to
needed
scaling
Naive Bayes Theorem in
Machine Learning
What is Bayes' Theorem?
Bayes’ Theorem is a method of calculating conditional probabilities:
P(A∣B)=P(B∣A)⋅P(A) by P(B)
Where:
P(A∣B) is the posterior probability: probability of A given B.
P(B∣A) is the likelihood.
P(A) is the prior probability.
P(B) is the evidence.
What is Naive Bayes?
Naive Bayes is a supervised machine learning algorithm based on
Bayes’ Theorem, with the naive assumption that features are
independent given the class label.
This simplifies the computation:
Where:
•C is the class.
•X=(x1,x2,...,xn) are the features.
Real-Life Example 1:
Email Spam Detection
Feature Description
Words like "Buy", "Free", "Money"
Class Spam or Not Spam
Naive Bayes:
•Learns probability of words occurring in spam and non-spam.
•Classifies new email based on presence of words.
Why Naive Bayes?
Even though words aren't truly independent (some often appear together), the
assumption works well in practice.
Real-Life Example 2:
Weather Prediction
Given a new condition: Sunny, Cool, Normal, Weak
Naive Bayes can compute:
Weather Temp Humidity Wind Play
Sunny Hot High Weak No
Sunny Cool Normal Strong Yes
Rainy Mild Normal Weak Yes
Real-Life Example 3:
Customer Purchase
Behavior
Suppose you're analyzing online shoppers.
Features Description
Time on website Short or Long
Referral source Google, Facebook, Email
Purchase made Yes or No
Naive Bayes can predict:
Given a user from Facebook who spent a long time on the website,
will they buy?
Suggested Student Activity for
Lab
Ask students to collect 10 sample emails and manually label them as
"Spam" or "Not Spam." Use word frequency as features and build a
simple Naive Bayes classifier (manually or in code).
Random Forest
Introduction
Random Forest is an ensemble machine learning algorithm that builds
multiple decision trees and merges their outputs for better accuracy
and stability.
It is used for classification and regression tasks.
Why Random Forest?
Decision Trees are easy to interpret but prone to over fitting.
Random Forest mitigates over fitting by using multiple trees and
averaging their outputs.
“Wisdom of the crowd” – combining many models leads to better
generalization.
Core Concepts
Ensemble Learning
Combines predictions of multiple models (weak learners) to form a stronger predictor.
Two main types:
◦ Bagging (Bootstrap Aggregating): Random Forest uses this.
◦ Boosting:
Bagging in Random Forest
Data is bootstrapped: sampled with replacement to create multiple datasets.
A decision tree is trained on each bootstrap sample.
Each tree makes a prediction → majority vote (classification) or average (regression).
Random Feature Selection
At each split in the tree, only a random subset of features is considered.
This ensures decorrelated trees, improving accuracy and reducing overfitting.
How Random Forest
Works – Step-by-Step
Choose the number of trees (e.g., 100).
For each tree:
◦ Sample data with replacement (bootstrapping).
◦ Build a decision tree using a random subset of features at each split.
Make predictions:
◦ For classification → take majority vote.
◦ For regression → take the mean prediction.
Advantages of
Random Forest
High accuracy, especially for large datasets.
Handles missing data and categorical features well.
Reduces variance and avoids overfitting.
Provides feature importance.
Scales well to large datasets and high-dimensional data.
Limitations
Slower and more memory-intensive than a single decision tree.
Less interpretable than individual trees.
Might overfit on noisy data (if not tuned properly).
Applications
•Medical diagnosis
•Financial risk prediction
•Fraud detection
•Customer segmentation
•Stock market prediction
•Image classification (with feature vectors)
Bagging – Bootstrap
Aggregating
Introduction
Bagging, short for Bootstrap Aggregating, is an ensemble learning
technique designed to improve the accuracy and stability of machine
learning algorithms, especially high-variance models like decision trees.
What is Bagging?
Bagging is an ensemble technique that trains multiple instances of the
same model on different random subsets of the training data (with
replacement) and combines their predictions.
It's mainly used to reduce variance and avoid overfitting.
How Bagging Works
(Step-by-Step)
•Create multiple bootstrap samples from the original dataset.
•Each sample is created by random sampling with replacement.
•Train a separate model (e.g., a decision tree) on each bootstrap
sample.
•Combine predictions from all models:
•Classification: Use majority voting.
•Regression: Use average of predictions.
Real-World
Applications
Fraud detection (less bias, high generalization)
Medical diagnosis (low variance needed)
Financial forecasting
Email spam classification
Multiple Linear
Regression (Multi-
Feature Regression)
What is Multiple Linear Regression?
Multiple Linear Regression is a statistical method used to model the
relationship between one dependent variable and two or more
independent variables by fitting a linear equation to observed data.
Equation Form
The general equation for multiple linear regression is:
y=β0+β1x1+β2x2+⋯+βnxn+ε Where:
y = dependent variable (target)
x1,x2,…,xn = independent variables (features)
Β0 = intercept
β1,β2,…,βn = coefficients
ε= error term
Use Cases
•Predicting house prices using size, location, age, etc.
•Estimating sales using marketing budget, season, etc.
•Forecasting salary based on education, experience, city, etc.
Assumptions of
Multiple Linear
Regression
Linearity: The relationship between dependent and independent
variables is linear.
Independence: Observations are independent of each other.
Homoscedasticity: Constant variance of errors.
No Multicollinearity: Independent variables are not highly correlated.
Normality: Errors are normally distributed.
Support Vector Machines
(SVM)
Introduction
Support Vector Machine (SVM) is a powerful supervised learning
algorithm used for classification and regression tasks, but it's mostly
known for classification.
SVM aims to find the optimal decision boundary (hyperplane) that best
separates classes in the feature space.
Intuition Behind SVM
Imagine plotting two categories of data in space. There are many
possible lines that can separate them — but SVM looks for the best
one, i.e., the one that:
SVM Terminology
Hyperplane: A decision boundary that separates classes.
•Support Vectors: Data points closest to the hyperplane.
•Margin: The gap between support vectors of different classes.
•Kernel: A function used to transform non-linearly separable data into a
higher dimension where it becomes linearly separable.
SVM Objective
(Mathematics)
For binary classification: