0% found this document useful (0 votes)
67 views49 pages

Chapter 2 Supervised Learning

Uploaded by

nghia.nt1002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views49 pages

Chapter 2 Supervised Learning

Uploaded by

nghia.nt1002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

TRƯỜNG ĐẠI HỌC GIAO THÔNG VẬN TẢI TP.

HCM
VIỆN CÔNG NGHỆ THÔNG TIN, ĐIỆN, ĐIỆN TỬ

Chapter 2
Supervised Learning

Ph.D Nguyen Thi Khanh Tien


Introduction to Supervised Learning
Supervised learning is a machine learning technique
where a model is trained on labeled data. This means
that the model is provided with both input data and its
corresponding correct output. By analyzing these
examples, the model learns to predict the output for new,
unseen data.

Supervised learning algorithms aim to identify a


pattern or relationship between the input and output
variables, effectively mapping input data (x) to output
data (y). This learned mapping function can then be used
to make predictions on new data.

Real-world applications of supervised learning include


tasks such as risk assessment, image classification, fraud
detection, spam filtering, and more
How Supervised Learning Works?
In supervised learning, models are trained using labelled
dataset, where the model learns about each type of data.
Once the training process is completed, the model is
tested on the basis of test data (a subset of the training
set), and then it predicts the output.
Suppose we have a dataset of different types of shapes
which includes square, rectangle, triangle, and Polygon.
Now the first step is that we need to train the model for
each shape.
• If the given shape has four sides, and all the sides are
equal, then it will be labelled as a Square.
• If the given shape has three sides, then it will be
labelled as a triangle.
• If the given shape has six equal sides then it will be
labelled as hexagon.
Now, after training, we test our model using the test set,
and the task of the model is to identify the shape.
The machine is already trained on all types of shapes, and
when it finds a new shape, it classifies the shape on the
bases of a number of sides, and predicts the output.
Steps Involved in Supervised Learning
Steps Involved in Supervised Learning
Data Collection and Preparation:
Training:
● Gather data: Collect a dataset that is relevant to the problem
you want to solve. ● Iterative process: The model learns from the training data by
adjusting its parameters to minimize the difference between its
● Clean and preprocess: Remove noise, handle missing values,
predicted outputs and the actual labels.
and normalize or standardize the data to ensure consistency.
● Optimization: Use techniques like gradient descent to find the
● Labeling: Assign correct labels or outputs to each data point.
optimal parameter values.
Choose a Model:
Evaluation:
● Select an algorithm: Consider factors like the nature of your
data (e.g., numerical, categorical), the desired output (e.g., ● Metrics: Use appropriate metrics to measure the model's
classification, regression), and computational resources. performance on the validation and test sets.
● Common metrics: Examples include accuracy, precision,
● Common algorithms: Examples include linear regression,
recall, F1-score, mean squared error (MSE), and root mean
logistic regression, decision trees, random forests, support
squared error (RMSE).
vector machines (SVMs), and neural networks.
Deployment:
Split the Data:
● Training set: Use this portion of the data to train the model. ● Integration: Once satisfied with the model's performance,
deploy it into a production environment.
● Validation set: Evaluate the model's performance during
● Monitoring: Continuously monitor the model's performance
training and fine-tune hyperparameters.
and retrain it if necessary to adapt to changing data or
● Test set: Assess the model's final performance on unseen data.
conditions.
Advantages and Disadvantages of Supervised Learning

Advantages Disadvantages

● Supervised learning resolves various ● Computation time, or running time, is huge


computation issues encountered in the real for supervised learning.
world, including spam detection, object and ● Supervised learning models frequently need
image identification, and many more. updates.
● Supervised learning uses past experience to ● Pre-processing of data is a big challenge for
optimize performance and predict the outputs predicting the output.
based on past experiences. ● Anyone can overfit supervised algorithms
● The training data can be reused unless there easily. It happens when a statistical model
is any feature change. matches its training data.
Types of Supervised Learning
Based on the given datasets the machine learning problem is
Classification Regression
categorized into two types, Classification and Regression.

Goal To predict categorical To predict continuous


Supervised Learning outcomes numerical values

Output type Categorical (e.g., yes/no, class Continuous (e.g., numbers)


labels)

Evaluation Accuracy, Mean squared error (MSE),


Classification Regression
metrics Precision, mean absolute error (MAE),
Recall, root mean squared error
F1-score,... (RMSE), R-squared
● K-nearest
Neighbor ● Linear Regression
● Logistic ● Support Vector Examples Email spam detection (spam House price prediction
Regression Regression
or not spam) Stock price forecasting
● Support Vector ● Neuron Network
Machines Regression Image recognition (cat, dog, Sales prediction
● Naive Bayes ● Decision Tree or other)
Classifier Regression Customer churn prediction
● Decision Trees ● Lasso Regression (churn or not churn)
● Random Forest ● Ridge Regression
● Neuron Network
Machine Learning for Classification
Classification is a supervised learning technique
used to predict the category or class of new data
points based on a set of labeled training
examples. In classification, a model learns
patterns from the training data to categorize
unseen observations into predefined classes or
groups.
There are two types of Classifications:
• Binary classification: Predicting outcomes
with only two possible classes (e.g., spam or
not spam, male or female, yes or no).
• Multi-class classification: Predicting
outcomes with more than two possible classes
(e.g., classifying types of crops, classifying
types of music).
Classification algorithms are widely used
in various fields, including:
• Healthcare: Diagnosing diseases
• Finance: Fraud detection
• Marketing: Customer segmentation
Classification Metrics in Machine Learning

Confusion Matrix is a performance measurement for the machine learning classification problems where the output can
be two or more classes. It is a table with combinations of predicted and actual values.
True Positive. We predicted positive and it’s true.
True Negative. We predicted negative and it’s true.
False Positive: (Type 1 Error). We predicted positive and it’s false.
False Negative: (Type 2 Error). We predicted negative and it’s
false.

Precision explains how many of the correctly predicted cases actually


turned out to be positive.
Recall explains how many of the actual positive cases we were able to
predict correctly with our model.
Precision F1 Score is the harmonic mean of precision and recall.
F1 Score could be an effective evaluation metric in the following
cases:
Recall - When FP and FN are equally costly.
- Adding more data doesn’t effectively change the outcome
- True Negative is high
F1 Score.
Classification Metrics in Machine Learning

ROC / AUC. The Receiver Operator Characteristic (ROC) graph provides an elegant way of presenting multiple confusion matrices
produced at different thresholds. A ROC plots the relationship between the true positive rate and the false positive rate.

True positive rate = Recall = Sensitivity


= true positive / (true positive + false negative)
False positive rate = 1 – specificity
= false positive / (false positive + true negative)
Classification Metrics in Machine Learning
Scikit-learn metrics
Classification Algorithms

Types Common Classification Algorithms

Linear Classifiers ● Logistic Regression: A statistical model that predicts the probability of a categorical outcome.
● Support Vector Machines (SVMs): A set of supervised learning methods that create hyperplanes to
separate data points into different classes.

Decision Trees and ● Decision Trees: A tree-like model where each internal node represents a test on an attribute, each
Ensembles branch represents the outcome of the test, and each leaf node represents a class label.
● Random Forests: An ensemble of decision trees, where each tree is built on a random subset of the
data and features.
● Gradient Boosting Machines (GBM): An ensemble method that builds models sequentially, each
correcting the errors of the previous model.

Naive Bayes ● Naive Bayes: A probabilistic classifier based on Bayes' theorem, assuming independence between
features.

K-Nearest Neighbors ● K-Nearest Neighbors: A non-parametric classification algorithm that assigns a class to a data point based on
(KNN) the majority class of its k nearest neighbors.

Neural Networks ● Artificial Neural Networks (ANNs): A computational model inspired by the human brain, consisting of
interconnected nodes (neurons) organized in layers.
● Convolutional Neural Networks (CNNs): Specialized ANNs for processing and analyzing image data.
● Recurrent Neural Networks (RNNs): ANNs designed to handle sequential data, such as text or time series.
Scikit-learn classifiers
K-Nearest Neighbors
K-Nearest Neighbors (KNN) is a simple yet effective
supervised machine learning algorithm that classifies or
predicts data points based on their proximity to nearby
examples in the training data.

How KNN Works:

1. Given a new data point: Calculate its distance to all


data points in the training set.
Distance metrics Formulas
2. Select the k nearest neighbors: Identify the k
closest training examples to the new data point.
3. Assign a class or predict a value: Classification: Euclidean distance
Assign the class that is most frequent among the k (p=2)
nearest neighbors.
Manhattan distance
Key Parameters: (p=1)

● k-value: The number of neighbors considered for Minkowski distance


classification or prediction. A smaller k-value can
lead to overfitting, while a larger k-value can result in
underfitting.
● Distance metric: The method used to calculate the Hamming distance
distance between data points. Common metrics
include Euclidean distance, Manhattan distance, and
Minkowski distance.
K-Nearest Neighbors
Advantages of KNN: Applications of KNN:
● Simplicity: Easy to understand and implement.
● Non-parametric: Doesn't make assumptions about the ● Image recognition: Classifying images based on
data distribution. similar features.
● Versatility: Can be used for both classification and ● Recommendation systems: Suggesting items based
regression. on user preferences.
● Accuracy: Can achieve high accuracy for ● Customer segmentation: Grouping customers based
well-structured data. on their behavior.
Disadvantages of KNN: ● Medical diagnosis: Predicting diseases based on
● Computational cost: Can be slow for large datasets. patient data.
● Sensitivity to k-value: Choosing the right k-value is
crucial. In scikit-learn,
● Curse of dimensionality: Can be less effective in
high-dimensional spaces. 1- [Link]
● Sensitive to noise: Noisy data can influence the
→ [Link]
results.
→[Link]
The iris dataset _nearest_centroid.html
→[Link]
ris_dataset.html#sphx-glr-auto-examples-datasets-plot-iris-dat 2- Example: how to use KNeighborsClassifier.
aset-py →[Link]
_classification.html#sphx-glr-auto-examples-neighbors-plot-c
lassification-py
Logistic Regression
Logistic regression is a statistical model used to predict the How Logistic Regression Works:
probability of a binary outcome (e.g., yes/no, true/false, 1/0). It's a
widely used technique in various fields, including machine learning, - Input: The model takes a set of predictor variables
statistics, and data science. (features) as input.
- Linear Combination: The model calculates a linear
combination of the input features, weighted by coefficients.
- Sigmoid Function: The result of the linear combination is
passed through a sigmoid function (also known as a logistic
function). This function maps any real value to a value
between 0 and 1.
- Probability Estimation: The output of the sigmoid
function is interpreted as the probability of the positive
outcome.

Applications of Logistic Regression: The Sigmoid Function:

- Predicting customer churn: Determining whether a customer is likely sigmoid(x) = 1 / (1 + e^(-x))


to stop using a product or service. where: x is the linear combination of the input features.
- Credit scoring: Assessing the risk of a loan default.
- Medical diagnosis: Predicting the presence or absence of a disease. e is Euler's number (approximately 2.71828).
- Email spam filtering: Identifying spam emails based on their content.
- Sentiment analysis: Determining the sentiment (positive, negative, or
neutral) of a text.
Logistic Regression
Advantages of Logistic Regression Disadvantages of Logistic Regression

- Simplicity: Logistic regression is relatively easy to understand - Assumption of linearity: Logistic regression assumes a linear
and implement. relationship between the predictor variables and the log odds of the
- Interpretability: The coefficients of the model can be interpreted outcome. If this assumption is violated, the model's performance may
to understand the importance of each predictor variable. suffer.
- Efficiency: The model is computationally efficient, making it - Limited to binary or categorical outcomes: Logistic regression is
suitable for large datasets. primarily designed for binary or categorical outcomes. For continuous
- Versatility: Logistic regression can be extended to handle outcomes, other regression techniques like linear regression or
multi-class classification problems using techniques like generalized linear models might be more appropriate.
one-vs-rest or multinomial logistic regression. - Can't handle multicollinearity: If the predictor variables are highly
- Robustness: It's less sensitive to outliers compared to some other correlated (multicollinearity), it can lead to unstable coefficients and
classification algorithms. difficulty in interpreting the model.
- May not perform well with non-linear relationships: If the
In scikit-learn, relationship between the predictors and the outcome is highly
non-linear, logistic regression might not capture the underlying
- class sklearn.linear_model.LogisticRegression patterns effectively.
- Sensitive to outliers: While generally robust, logistic regression can
→[Link]
[Link]#sklearn.linear_model.LogisticRegression
still be affected by outliers, especially if they are influential points.
-
- Example: Logistic Regression 3-class Classifier

→[Link]
[Link]#sphx-glr-auto-examples-linear-model-plot-iris-logistic-py
Support Vector Machines (SVMs)
Support Vector Machines (SVMs) are a powerful machine learning Hyperplane: There can be multiple lines/decision boundaries to
algorithm used for classification and regression tasks. They are segregate the classes in n-dimensional space, but we need to find
particularly effective in high-dimensional spaces and are known for out the best decision boundary that helps to classify the data
their ability to handle complex decision boundaries. points. This best boundary is known as the hyperplane of SVM.
SVM algorithm can be used for Face detection, image classification,
The dimensions of the hyperplane depend on the features present
text categorization, etc.
in the dataset, which means if there are 2 features (as shown in
image), then hyperplane will be a straight line. And if there are 3
features, then hyperplane will be a 2-dimension plane.

Support Vectors: The data points or vectors that are the closest
to the hyperplane and which affect the position of the hyperplane
are termed as Support Vector. Since these vectors support the
hyperplane, hence called a Support vector.
Support Vector Machines (SVMs)
How SVMs Work:

1. Feature Mapping: The SVM maps the input data into a


higher-dimensional feature space. This mapping can be linear or
non-linear, depending on the kernel function used.
2. Hyperplane Separation: The SVM finds the optimal
hyperplane (a decision boundary) that separates the data points
into different classes with the maximum margin. This margin is
the distance between the hyperplane and the nearest data points,
known as support vectors.
3. Classification: New data points are classified based on which
side of the hyperplane they fall on.

Kernel Functions:

The choice of kernel function determines the type of mapping into the higher-dimensional
feature space. Common kernel functions include:

● Linear kernel: A simple kernel that maps the data linearly.


● Polynomial kernel: A kernel that introduces polynomial terms into the feature space.
● Radial basis function (RBF) kernel: A non-linear kernel that maps the data into an
infinite-dimensional feature space.
Support Vector Machines (SVMs)
Non-Linear SVM. So to separate these data points, we need to add one more
dimension. For linear data, we have used two dimensions x and y, so
If data is linearly arranged, then we can separate it by using a
for non-linear data, we will add a third dimension z. It can be
straight line, but for non-linear data, we cannot draw a single
calculated as: z=x2 +y2
straight line.

So now, SVM will divide the datasets into classes in the Since we are in 3-d Space, hence it is looking like a plane parallel to the
following way x-axis. If we convert it in 2d space with z=1, then it will become as
Support Vector Machines (SVMs)
Advantages of SVMs: In Scikit-learn,
- SVMs for classification - SVC, NuSVC and LinearSVC
- Effective in high-dimensional spaces: SVMs can handle complex
→ [Link]
decision boundaries in high-dimensional data.
- Example:
- Robust to outliers: SVMs are less sensitive to outliers due to their
1. Plot different SVM classifiers in the iris dataset
focus on the support vectors.
- Versatile: SVMs can be used for both classification and regression →[Link]
[Link]#sphx-glr-auto-examples-svm-plot-iris-svc-py
tasks.
- Efficient: SVMs can be efficient for large datasets, especially when 2. SVM with custom kernel
using kernel tricks. →[Link]
m_kernel.html#sphx-glr-auto-examples-svm-plot-custom-kerne
Disadvantages of SVMs: l-py
3. RBF SVM parameters
- Computational complexity: Training SVMs can be →[Link]
computationally expensive for large datasets, especially with [Link]
non-linear kernels.
- Choice of kernel: Selecting the appropriate kernel function can be
challenging.
- Sensitivity to hyperparameters: SVMs have hyperparameters
(like the kernel function and regularization parameter) that need to
be tuned for optimal performance.
Naive Bayes Classification
Naive Bayes is a probabilistic classification algorithm based on
Bayes' theorem, which assumes that features are independent given
the class. While this independence assumption is often violated in
real-world data, Naive Bayes can still perform surprisingly well in
many cases.

How Naive Bayes Works

1. Calculate Probabilities:
- Prior probability: The probability of each class occurring
independently of the features.
- Conditional probability: The probability of a feature
occurring given a particular class.
2. Apply Bayes' Theorem:
- Using Bayes' theorem, calculate the posterior probability of
each class given the observed features.
- The class with the highest posterior probability is predicted as
the most likely class.
The primary types of Naive Bayes classifiers

● Assumes features are normally distributed.


● Suitable for continuous numerical data.
01 Gaussian Naive Bayes
● Calculates conditional probabilities based on the mean and
standard deviation of each feature within each class.

● Designed for count data (e.g., word frequencies in text


documents).
02 Multinomial Naive Bayes ● Assumes features follow a multinomial distribution.
● Calculates conditional probabilities based on the frequency
of each feature within each class.

● Suitable for binary features (e.g., presence or absence of a


word in a document).
03 Bernoulli Naive Bayes ● Assumes features follow a Bernoulli distribution.
● Calculates conditional probabilities based on the probability
of a feature being present or absent within each class.
Naive Bayes Classification

Advantages of Naive Bayes Applications of Naive Bayes

- Simplicity: Easy to implement and understand. - Text classification: Spam filtering, sentiment analysis, topic
- Efficiency: Can handle large datasets efficiently. modeling
- Robustness: Can perform well even with noisy - Recommendation systems: Suggesting items based on user
or missing data. preferences
- Medical diagnosis: Predicting diseases based on symptoms
Disadvantages of Naive Bayes - Weather prediction: Forecasting weather conditions
- Independence Assumption: The assumption of In scikit-learn,
feature independence can be violated in many
real-world scenarios. → [Link]
- Sensitivity to Zero Counts: If a feature-class
combination has zero occurrences in the training → example:
data, the conditional probability becomes zero, [Link]
leading to an incorrect prediction. r_comparison.html
Decision Tree Classification
Decision Trees are a popular machine learning
algorithm often used for both classification and regression
tasks.
In the context of classification, they create a tree-like
model where each internal node represents a test on an
attribute (e.g., "Is age greater than 30?"), each branch
represents the possible outcomes of the test, and each leaf
node represents a class label.
The decision tree is a distribution-free or
non-parametric method which does not depend upon
probability distribution assumptions.
Decision trees can handle high-dimensional data with
good accuracy.

Practice:
→ [Link]
Decision Tree Algorithm

The basic idea behind any decision tree algorithm is as follows:

1. Select the best attribute using Attribute Selection Measures (ASM) to split the records.
2. Make that attribute a decision node and breaks the dataset into smaller subsets.
3. Start tree building by repeating this process recursively for each child until one of the conditions will match:
● All the tuples belong to the same attribute value.
● There are no more remaining attributes.
● There are no more instances.
Random Forest Classification
Random Forest is a powerful ensemble learning method that combines multiple
decision trees to make predictions. It's particularly effective for classification tasks
due to its ability to handle large datasets, reduce overfitting, and provide feature
importance.

How Does Random Forest Work?


- Bootstrap Sampling: The algorithm randomly selects subsets of data from the
original dataset with replacement. This creates multiple "bootstrap samples."
- Decision Tree Growth: Each bootstrap sample is used to grow a decision tree. The
trees are grown to their maximum size without pruning.
- Prediction: To make a prediction for a new data point, the algorithm passes it
through each decision tree and collects the predicted class from each.
- Voting: The most frequent class among all the predictions from the individual trees
becomes the final prediction.

Key Parameters in Random Forest


- Number of Trees: The number of decision trees in the forest.
- Maximum Depth: The maximum depth of each decision tree.
- Minimum Samples Split: The minimum number of samples required to split an
internal node.
- Minimum Samples Leaf: The minimum number of samples required to be at a
leaf node.
- Bootstrap: Whether bootstrap sampling is used.
Practice:
→ [Link]
Difference Between Random Forest and Decision Tree
Objective Function in Regression
The objective function in regression is a mathematical expression that quantifies 2. Mean Absolute Error (MAE):
the "error" or "distance" between the predicted values and the actual values. It
serves as a target that the regression model aims to minimize. - Calculates the average absolute difference between predicted and actual
values.
The choice of objective function depends on the specific characteristics of the - Formula: MAE = 1/n * Σ|yi - ŷi|
regression problem and the desired properties of the model. Consider the - Advantages: Less sensitive to outliers than MSE.
following factors: - Disadvantages: Not differentiable at zero, making optimization more
challenging.
● Sensitivity to outliers: If your data contains outliers, MAE or Huber
Loss might be more suitable than MSE. 3. Root Mean Squared Error (RMSE):
● Interpretability: RMSE is often preferred for interpretability as it is in
- The square root of the MSE.
the same units as the target variable. - Formula: RMSE = √(1/n * Σ(yi - ŷi)^2)
● Optimization: MSE is generally easier to optimize due to its - Advantages: Interpretable in the same units as the target variable.
differentiability. - Disadvantages: Same as MSE regarding sensitivity to outliers.

Common Objective Functions: 4. Huber Loss:

1. Mean Squared Error (MSE): - Combines the advantages of MSE and MAE by using a quadratic loss
- Calculates the average squared difference between predicted and actual for small errors and a linear loss for large errors.
- Formula:
values.
Huber Loss = 1/n * Σ(δ^2 * (yi - ŷi)^2 / 2 , if |yi - ŷi| ≤ δ,
- Formula: MSE = 1/n * Σ(yi - ŷi)^2 Huber Loss = |yi - ŷi| - δ^2 / 2, otherwise
- Advantages: Easy to compute, differentiable, and widely used. - Advantages: Robust to outliers while maintaining differentiability.
- Disadvantages: Sensitive to outliers due to squaring.
Overfitting and Underfitting in Machine Learning
Overfitting
Overfitting occurs when a model learns the training data too well, capturing noise and irrelevant
patterns. This leads to a model that performs exceptionally well on the training data but poorly on
new, unseen data.
Characteristics of Overfitting:
- High performance on training data: The model achieves very high accuracy or low error on
the training set.
- Poor performance on validation/test data: The model's performance significantly drops
when evaluated on unseen data.
- Complex model: The model may have too many parameters or features, making it prone to
memorizing the training data instead of learning underlying patterns.
Underfitting
Underfitting happens when a model is too simple to capture the underlying patterns in the data. This
results in a model that performs poorly on both the training and validation/test data.
Characteristics of Underfitting:
- Poor performance on both training and validation/test data: The model consistently
achieves low accuracy or high error on both sets.
- Simple model: The model may have too few parameters or features, limiting its ability to
learn complex relationships.
Addressing Overfitting and Underfitting

Addressing Overfitting and Underfitting


1. Regularization:
- L1 regularization (Lasso): Introduces a penalty term that encourages sparsity,
meaning many model parameters become zero. This can help prevent overfitting
by reducing the complexity of the model.
- L2 regularization (Ridge): Adds a penalty term that discourages large
parameter values, preventing the model from becoming too sensitive to small
changes in the input data.
2. Cross-validation: Splits the data into multiple folds and trains the model on
different subsets to evaluate its performance on unseen data. This helps identify
overfitting or underfitting early in the modeling process.
3. Early stopping: Monitors the model's performance on a validation set during
training and stops the training process when performance starts to deteriorate,
preventing overfitting.
4. Feature engineering: Creating new features or transforming existing ones can
improve the model's ability to capture relevant patterns and reduce overfitting.
5. Simpler model: If the model is overfitting, consider using a simpler model with
fewer parameters.
Loss Function in Machine Learning
A loss function, also known as a cost function or objective function, is a
mathematical function that quantifies the "error" or "distance" between the
predicted values and the actual values in a machine learning model. It serves as a
target that the model aims to minimize during the training process.
Commonly-used loss functions in machine learning

Choosing the Right Loss Function:


The choice of loss function depends on the specific characteristics of the
problem and the desired properties of the model. Consider the following factors:
- Type of problem: Regression or classification?
- Sensitivity to outliers: Is the data noisy or prone to outliers?
- Interpretability: Do you need the error to be in the same units as the
target variable?
Regularized Loss Minimization
Regularized loss minimization is a technique used in machine learning to Overall Objective. The goal of regularized loss minimization is to minimize the
prevent overfitting by adding a penalty term to the loss function. This penalty following objective function:
term is designed to discourage overly complex models, which are more prone to Loss(w) + λ * R(w)
overfitting. where:
● Loss(w) is the loss function.
Loss Function ● λ is the regularization parameter, which controls the strength of the
A loss function quantifies the "error" or "distance" between the predicted values regularization.
and the actual values. Common loss functions include: ● R(w) is the regularization term (L1 or L2).
● Mean Squared Error (MSE)
● Mean Absolute Error (MAE) Benefits of Regularized Loss Minimization
● Cross-Entropy Loss ● Prevents overfitting: By penalizing large parameter values,
regularization can help prevent models from becoming too complex and
Regularization Term memorizing the training data.
The regularization term is added to the loss function to penalize large parameter ● Improves generalization: Regularized models tend to generalize better to
values. The two most common types of regularization are: unseen data.
1. L1 Regularization (Lasso): ● Feature selection: L1 regularization can be used for feature selection by
○ Adds the absolute value of the parameters to the loss function. setting many parameters to zero.
○ Encourages sparsity, meaning many parameters become zero.
○ Can be useful for feature selection.
2. L2 Regularization (Ridge):
○ Adds the square of the parameters to the loss function.
○ Discourages large parameter values.
○ Can help prevent overfitting by reducing the variance of the
model.
Hyper-parameter Tuning
Hyperparameters are settings that are not learned from the data but are set before Hyperparameter Tuning Techniques:
training a machine learning model. They control the behavior of the learning 1. Grid Search:
algorithm. Tuning these hyperparameters can significantly impact a model's ○ Defines a grid of hyperparameter values and trains a
performance. model for each combination.
Common Hyperparameters: ○ Time-consuming for large search spaces.
● Learning rate: Controls how quickly the model adjusts its parameters during 2. Random Search:
training. ○ Randomly samples hyperparameter values from a
● Regularization strength: Controls the amount of regularization applied to specified distribution.
the model (e.g., L1 or L2 regularization). ○ Often more efficient than grid search for large search
● Number of hidden layers and neurons: Determines the complexity of a spaces.
neural network. 3. Bayesian Optimization:
● Batch size: The number of samples used in each training iteration. ○ Uses a probabilistic model to build a surrogate function of
● Epochs: The number of times the entire dataset is passed through the model the objective function.
during training. ○ Explores the search space more intelligently by focusing
on promising regions.
4. Evolutionary Algorithms:
○ Inspired by biological evolution, these algorithms use
concepts like selection, crossover, and mutation to evolve
a population of hyperparameter configurations.
Hyperparameter Tuning Techniques

Hyperparameter Tuning Techniques with


Scikit-learn
1. Grid Search
2. Random Search
3. Bayesian Optimization
4. Evolutionary Algorithms
Reference:

1) [Link]
[Link]
2) [Link]
n-model-hyper-parameters-tuning/
Machine learning for Regression
Regression is a statistical method used to model the
relationship between a dependent variable (the outcome)
and one or more independent variables (predictors). In
machine learning, regression algorithms are employed to
predict continuous numerical values.
Regression evaluation metrics are used to evaluate the
performance of regression models, such as MSE,RMSE,
MAE, MAPE,... They quantify how well a model's
predictions align with the actual values.
Regression Applications
- Predicting Sales: Forecasting future sales based on
historical data.
- Stock Price Prediction: Predicting future stock prices.
- Demand Forecasting: Predicting the demand for a
product or service.
- Real Estate Price Prediction: Estimating the price of a
property based on its features.
- Customer Lifetime Value Prediction: Predicting the
total revenue a customer will generate over their lifetime.
Regression evaluation metrics
Mean Squared Error MSE = 1/n * Σ(yi - ŷi)²
(MSE) Measures the average squared difference between predicted and actual values. Lower MSE indicates better performance.

Root Mean Squared Error (RMSE) RMSE = √(MSE)


The square root of MSE, which is often preferred because it is in the same units as the target variable.

Mean Absolute Error MAE = 1/n * Σ|yi - ŷi|


(MAE) Measures the average absolute difference between predicted and actual values. Less sensitive to outliers than MSE.

R-squared (R²) R² = 1 - SSres / SStot


Measures the proportion of variance in the dependent variable explained by the independent variables. Higher R²
indicates better fit.

Adjusted R-squared Adjusted R² = 1 - (1 - R²) * (n - 1) / (n - p - 1)


Penalizes R² for adding unnecessary predictors.

Mean Absolute Percentage Error MAPE = 100 * 1/n * Σ|yi - ŷi| / |yi|
(MAPE) Measures the average percentage error between predicted and actual values. Useful for comparing models on different
scales.

Weighted Mean Squared Error WMSE = 1/n * Σwi * (yi - ŷi)²


(WMSE) Assigns different weights to errors based on the importance of the data points.
Machine Learning Regression Algorithm

Common Regression Algorithms

● Linear Regression: A simple model that assumes a linear


relationship between the dependent and independent variables.
● Polynomial Regression: A more flexible model that can
capture non-linear relationships by fitting a polynomial curve
to the data.
● Ridge Regression: A regularization technique that adds a
penalty term to the loss function to prevent overfitting.
● Lasso Regression: Another regularization technique that uses
a different penalty term to encourage sparsity in the model,
meaning it can select the most important features.
● Elastic Net: A combination of Ridge and Lasso regression,
offering the benefits of both.
● Decision Trees: Can be used for regression by making
predictions based on the average values of the target variable
in the leaf nodes.
● Random Forest: An ensemble method that combines multiple
decision trees to improve accuracy and reduce overfitting.
● Support Vector Machines (SVM): Can be used for
regression by finding a hyperplane that maximizes the margin
between the data points.
● Neural Networks: Deep learning models that can learn
complex non-linear relationships between the features and the
target variable.
Linear Regression: A Foundation in Machine Learning
Linear Regression is a statistical method used to model the relationship between a dependent
variable (the outcome) and one or more independent variables (predictors). It assumes a linear
relationship between the variables and aims to find the best-fitting line to represent this
relationship.
The equation of a linear regression model is:
y = mx + b
where:
- y is the dependent variable (outcome)
- x is the independent variable (predictor)
- m is the slope of the line
- b is the y-intercept
Finding the Best-Fit Line. The goal of linear regression is to find the values of m and b that
minimize the error between the predicted values and the actual values. This is often done
using the least squares method, which calculates the line that minimizes the sum of the
squared differences between the predicted and actual values.

Types of Linear Regression:


- Simple Linear Regression: Involves a single
independent variable.
● Equation: y = mx + b
● Example: Predicting house prices based
on the size of the house.
- Multiple Linear Regression: Involves multiple
independent variables.
● Equation: y = b0 + b1x1 + b2x2 + ... +
bn*xn
● Example: Predicting car prices based on
factors like mileage, age, brand, and
features.
Ordinary Least Squares (OLS)
Ordinary Least Squares (OLS) is the most common and simplest method of linear
regression. It aims to find the best-fitting line that minimizes the sum of the squared
differences between the predicted and actual values.
In OLS, the goal is to minimize the sum of squared residuals (SSR):

SSR = Σ(yi - ŷi)²


where:
yi - the observed value of the dependent variable.
ŷi - the predicted value of the dependent variable.

The OLS estimates of the regression coefficients (slope and intercept) are obtained by
solving the following normal equations:

Where, predicted value :

vector

Using scikit-learn,
- LinearRegression
[Link]
- Linear Regression example
[Link]
examples-linear-model-plot-ols-py
- A Comprehensive Guide to OLS Regression
[Link]
on-part-1/
Lasso Regression
LASSO (Least Absolute Shrinkage and Selection Operator) is a type of regression
analysis that uses L1 regularization to prevent overfitting and potentially select Using scikit-learn,
important features. It's particularly useful when dealing with high-dimensional data → Laso Regression
where many features may be irrelevant or redundant. [Link]
Regularization: It's a type of regularization technique that adds a penalty term to the → Practice
loss function to prevent overfitting. [Link]
Feature Selection: LASSO can automatically select important features by setting some
coefficients to zero.
The LASSO regression objective function is given by: J(w) = Σ(yi - ŷi)² + α * Σ|wj|
where:
- yi is the observed value.
- ŷi is the predicted value.
- α is the regularization parameter.
- wj are the regression coefficients.
- α * Σ|wj|, is the L1 regularization term. It penalizes the absolute values of the
coefficients. As α increases, more coefficients are shrunk towards zero,
leading to feature selection.
Advantages of LASSO
- Feature Selection: LASSO automatically selects important features, leading to
more interpretable models.
- Prevention of Overfitting: The regularization term helps prevent overfitting by
reducing the complexity of the model.
- Computational Efficiency: LASSO can be computationally efficient for large
datasets.
Disadvantages of LASSO
- Bias: LASSO can introduce bias into the model, especially when the true
model is dense (has many non-zero coefficients).
- Inconsistent Feature Selection: The features selected by LASSO can be
inconsistent across different runs or datasets.
Ridge Regression
Ridge Regression is another regularization technique used to prevent overfitting in In scikit-learn,
linear regression models. It's similar to LASSO but uses a different penalty term, which → Rigdge Regression
leads to different properties. [Link]
Regularization: It's a type of regularization technique that adds a penalty term to the → Practice
loss function to prevent overfitting. [Link]
The Ridge regression objective function is given by: J(w) = Σ(yi - ŷi)² + α * Σ(wj²)
where:
- yi is the observed value.
- ŷi is the predicted value.
- α is the regularization parameter.
- wj are the regression coefficients.
- α * Σ(wj²), is the L2 regularization term.
Advantages of Ridge Regression
- Prevention of Overfitting: The regularization term helps prevent overfitting by
reducing the variance of the coefficients.
- Numerical Stability: Ridge regression is numerically stable, even when the
features are highly correlated.
- No Feature Selection: Ridge regression does not set any coefficients to zero,
which can be useful when all features are believed to be relevant.
Disadvantages of Ridge Regression
- No Feature Selection: Ridge regression does not perform feature selection,
which can be a disadvantage when dealing with high-dimensional data.
- Bias: Ridge regression can introduce bias into the model, especially when the
true model is sparse (has many zero coefficients).
Lasso Regression vs Ridge Regression
LASSO (Least Absolute Shrinkage and Selection Operator) and Ridge Feature LASSO Ridge Regression
Regression are both regularization techniques used to prevent overfitting in
linear regression models. While they share similarities, they have distinct
characteristics and applications. Penalty Term Σ|wj| Σ(wj²)

Feature Selection Yes No

Coefficient Sets some coefficients to zero Shrinks all coefficients


Shrinkage towards zero

Bias-Variance Higher bias, lower variance Lower bias, higher variance


Trade-off

When to Use Which


- LASSO:
○When feature selection is desired.
○When the true model is believed to be sparse (has many zero coefficients).
- Ridge Regression:
○When all features are believed to be relevant.
○When overfitting is a concern and feature selection is not necessary.
Hybrid Approaches
- Elastic Net: Combines the L1 and L2 penalties, offering a balance between
feature selection and shrinkage.
- Adaptive LASSO: Adapts the regularization parameter for each feature based
on their initial estimates.
Polynomial Regression
Polynomial Regression is a form of regression analysis in which the relationship
between the independent variables and dependent variables are modeled in the nth
degree polynomial.
For Multiple variable the matrix calculation is done by

Cost Function of Polynomial Regression.


Cost function measures a model's performance by calculating the average error
between its predictions and actual values (loss function for each data point). Think
of cost function as the overall "grade" and loss function as individual "scores."

Polynomial regression can improve model fit by creating a curved line that
better matches your data, potentially reducing the cost function's value.
Higher-order polynomials can achieve even more precise fits.
To minimize the cost function and improve model performance, we can use
gradient descent. This algorithm iteratively adjusts the model's weights to
reduce the error between predicted and actual values.
Polynomial Regression
Gradient Descent for Polynomial Regression

[Link]
-scratch-279db2936fe9
Polynomial Regression
Polynomial Regression is a form of regression

analysis in which the relationship between the

independent variables and dependent vaiables are

modeled in the nth degree polynomial.


Support Vector Regression
Exercises

You might also like