Chapter 2 Supervised Learning
Chapter 2 Supervised Learning
HCM
VIỆN CÔNG NGHỆ THÔNG TIN, ĐIỆN, ĐIỆN TỬ
Chapter 2
Supervised Learning
Advantages Disadvantages
Confusion Matrix is a performance measurement for the machine learning classification problems where the output can
be two or more classes. It is a table with combinations of predicted and actual values.
True Positive. We predicted positive and it’s true.
True Negative. We predicted negative and it’s true.
False Positive: (Type 1 Error). We predicted positive and it’s false.
False Negative: (Type 2 Error). We predicted negative and it’s
false.
ROC / AUC. The Receiver Operator Characteristic (ROC) graph provides an elegant way of presenting multiple confusion matrices
produced at different thresholds. A ROC plots the relationship between the true positive rate and the false positive rate.
Linear Classifiers ● Logistic Regression: A statistical model that predicts the probability of a categorical outcome.
● Support Vector Machines (SVMs): A set of supervised learning methods that create hyperplanes to
separate data points into different classes.
Decision Trees and ● Decision Trees: A tree-like model where each internal node represents a test on an attribute, each
Ensembles branch represents the outcome of the test, and each leaf node represents a class label.
● Random Forests: An ensemble of decision trees, where each tree is built on a random subset of the
data and features.
● Gradient Boosting Machines (GBM): An ensemble method that builds models sequentially, each
correcting the errors of the previous model.
Naive Bayes ● Naive Bayes: A probabilistic classifier based on Bayes' theorem, assuming independence between
features.
K-Nearest Neighbors ● K-Nearest Neighbors: A non-parametric classification algorithm that assigns a class to a data point based on
(KNN) the majority class of its k nearest neighbors.
Neural Networks ● Artificial Neural Networks (ANNs): A computational model inspired by the human brain, consisting of
interconnected nodes (neurons) organized in layers.
● Convolutional Neural Networks (CNNs): Specialized ANNs for processing and analyzing image data.
● Recurrent Neural Networks (RNNs): ANNs designed to handle sequential data, such as text or time series.
Scikit-learn classifiers
K-Nearest Neighbors
K-Nearest Neighbors (KNN) is a simple yet effective
supervised machine learning algorithm that classifies or
predicts data points based on their proximity to nearby
examples in the training data.
- Simplicity: Logistic regression is relatively easy to understand - Assumption of linearity: Logistic regression assumes a linear
and implement. relationship between the predictor variables and the log odds of the
- Interpretability: The coefficients of the model can be interpreted outcome. If this assumption is violated, the model's performance may
to understand the importance of each predictor variable. suffer.
- Efficiency: The model is computationally efficient, making it - Limited to binary or categorical outcomes: Logistic regression is
suitable for large datasets. primarily designed for binary or categorical outcomes. For continuous
- Versatility: Logistic regression can be extended to handle outcomes, other regression techniques like linear regression or
multi-class classification problems using techniques like generalized linear models might be more appropriate.
one-vs-rest or multinomial logistic regression. - Can't handle multicollinearity: If the predictor variables are highly
- Robustness: It's less sensitive to outliers compared to some other correlated (multicollinearity), it can lead to unstable coefficients and
classification algorithms. difficulty in interpreting the model.
- May not perform well with non-linear relationships: If the
In scikit-learn, relationship between the predictors and the outcome is highly
non-linear, logistic regression might not capture the underlying
- class sklearn.linear_model.LogisticRegression patterns effectively.
- Sensitive to outliers: While generally robust, logistic regression can
→[Link]
[Link]#sklearn.linear_model.LogisticRegression
still be affected by outliers, especially if they are influential points.
-
- Example: Logistic Regression 3-class Classifier
→[Link]
[Link]#sphx-glr-auto-examples-linear-model-plot-iris-logistic-py
Support Vector Machines (SVMs)
Support Vector Machines (SVMs) are a powerful machine learning Hyperplane: There can be multiple lines/decision boundaries to
algorithm used for classification and regression tasks. They are segregate the classes in n-dimensional space, but we need to find
particularly effective in high-dimensional spaces and are known for out the best decision boundary that helps to classify the data
their ability to handle complex decision boundaries. points. This best boundary is known as the hyperplane of SVM.
SVM algorithm can be used for Face detection, image classification,
The dimensions of the hyperplane depend on the features present
text categorization, etc.
in the dataset, which means if there are 2 features (as shown in
image), then hyperplane will be a straight line. And if there are 3
features, then hyperplane will be a 2-dimension plane.
Support Vectors: The data points or vectors that are the closest
to the hyperplane and which affect the position of the hyperplane
are termed as Support Vector. Since these vectors support the
hyperplane, hence called a Support vector.
Support Vector Machines (SVMs)
How SVMs Work:
Kernel Functions:
The choice of kernel function determines the type of mapping into the higher-dimensional
feature space. Common kernel functions include:
So now, SVM will divide the datasets into classes in the Since we are in 3-d Space, hence it is looking like a plane parallel to the
following way x-axis. If we convert it in 2d space with z=1, then it will become as
Support Vector Machines (SVMs)
Advantages of SVMs: In Scikit-learn,
- SVMs for classification - SVC, NuSVC and LinearSVC
- Effective in high-dimensional spaces: SVMs can handle complex
→ [Link]
decision boundaries in high-dimensional data.
- Example:
- Robust to outliers: SVMs are less sensitive to outliers due to their
1. Plot different SVM classifiers in the iris dataset
focus on the support vectors.
- Versatile: SVMs can be used for both classification and regression →[Link]
[Link]#sphx-glr-auto-examples-svm-plot-iris-svc-py
tasks.
- Efficient: SVMs can be efficient for large datasets, especially when 2. SVM with custom kernel
using kernel tricks. →[Link]
m_kernel.html#sphx-glr-auto-examples-svm-plot-custom-kerne
Disadvantages of SVMs: l-py
3. RBF SVM parameters
- Computational complexity: Training SVMs can be →[Link]
computationally expensive for large datasets, especially with [Link]
non-linear kernels.
- Choice of kernel: Selecting the appropriate kernel function can be
challenging.
- Sensitivity to hyperparameters: SVMs have hyperparameters
(like the kernel function and regularization parameter) that need to
be tuned for optimal performance.
Naive Bayes Classification
Naive Bayes is a probabilistic classification algorithm based on
Bayes' theorem, which assumes that features are independent given
the class. While this independence assumption is often violated in
real-world data, Naive Bayes can still perform surprisingly well in
many cases.
1. Calculate Probabilities:
- Prior probability: The probability of each class occurring
independently of the features.
- Conditional probability: The probability of a feature
occurring given a particular class.
2. Apply Bayes' Theorem:
- Using Bayes' theorem, calculate the posterior probability of
each class given the observed features.
- The class with the highest posterior probability is predicted as
the most likely class.
The primary types of Naive Bayes classifiers
- Simplicity: Easy to implement and understand. - Text classification: Spam filtering, sentiment analysis, topic
- Efficiency: Can handle large datasets efficiently. modeling
- Robustness: Can perform well even with noisy - Recommendation systems: Suggesting items based on user
or missing data. preferences
- Medical diagnosis: Predicting diseases based on symptoms
Disadvantages of Naive Bayes - Weather prediction: Forecasting weather conditions
- Independence Assumption: The assumption of In scikit-learn,
feature independence can be violated in many
real-world scenarios. → [Link]
- Sensitivity to Zero Counts: If a feature-class
combination has zero occurrences in the training → example:
data, the conditional probability becomes zero, [Link]
leading to an incorrect prediction. r_comparison.html
Decision Tree Classification
Decision Trees are a popular machine learning
algorithm often used for both classification and regression
tasks.
In the context of classification, they create a tree-like
model where each internal node represents a test on an
attribute (e.g., "Is age greater than 30?"), each branch
represents the possible outcomes of the test, and each leaf
node represents a class label.
The decision tree is a distribution-free or
non-parametric method which does not depend upon
probability distribution assumptions.
Decision trees can handle high-dimensional data with
good accuracy.
Practice:
→ [Link]
Decision Tree Algorithm
1. Select the best attribute using Attribute Selection Measures (ASM) to split the records.
2. Make that attribute a decision node and breaks the dataset into smaller subsets.
3. Start tree building by repeating this process recursively for each child until one of the conditions will match:
● All the tuples belong to the same attribute value.
● There are no more remaining attributes.
● There are no more instances.
Random Forest Classification
Random Forest is a powerful ensemble learning method that combines multiple
decision trees to make predictions. It's particularly effective for classification tasks
due to its ability to handle large datasets, reduce overfitting, and provide feature
importance.
1. Mean Squared Error (MSE): - Combines the advantages of MSE and MAE by using a quadratic loss
- Calculates the average squared difference between predicted and actual for small errors and a linear loss for large errors.
- Formula:
values.
Huber Loss = 1/n * Σ(δ^2 * (yi - ŷi)^2 / 2 , if |yi - ŷi| ≤ δ,
- Formula: MSE = 1/n * Σ(yi - ŷi)^2 Huber Loss = |yi - ŷi| - δ^2 / 2, otherwise
- Advantages: Easy to compute, differentiable, and widely used. - Advantages: Robust to outliers while maintaining differentiability.
- Disadvantages: Sensitive to outliers due to squaring.
Overfitting and Underfitting in Machine Learning
Overfitting
Overfitting occurs when a model learns the training data too well, capturing noise and irrelevant
patterns. This leads to a model that performs exceptionally well on the training data but poorly on
new, unseen data.
Characteristics of Overfitting:
- High performance on training data: The model achieves very high accuracy or low error on
the training set.
- Poor performance on validation/test data: The model's performance significantly drops
when evaluated on unseen data.
- Complex model: The model may have too many parameters or features, making it prone to
memorizing the training data instead of learning underlying patterns.
Underfitting
Underfitting happens when a model is too simple to capture the underlying patterns in the data. This
results in a model that performs poorly on both the training and validation/test data.
Characteristics of Underfitting:
- Poor performance on both training and validation/test data: The model consistently
achieves low accuracy or high error on both sets.
- Simple model: The model may have too few parameters or features, limiting its ability to
learn complex relationships.
Addressing Overfitting and Underfitting
1) [Link]
[Link]
2) [Link]
n-model-hyper-parameters-tuning/
Machine learning for Regression
Regression is a statistical method used to model the
relationship between a dependent variable (the outcome)
and one or more independent variables (predictors). In
machine learning, regression algorithms are employed to
predict continuous numerical values.
Regression evaluation metrics are used to evaluate the
performance of regression models, such as MSE,RMSE,
MAE, MAPE,... They quantify how well a model's
predictions align with the actual values.
Regression Applications
- Predicting Sales: Forecasting future sales based on
historical data.
- Stock Price Prediction: Predicting future stock prices.
- Demand Forecasting: Predicting the demand for a
product or service.
- Real Estate Price Prediction: Estimating the price of a
property based on its features.
- Customer Lifetime Value Prediction: Predicting the
total revenue a customer will generate over their lifetime.
Regression evaluation metrics
Mean Squared Error MSE = 1/n * Σ(yi - ŷi)²
(MSE) Measures the average squared difference between predicted and actual values. Lower MSE indicates better performance.
Mean Absolute Percentage Error MAPE = 100 * 1/n * Σ|yi - ŷi| / |yi|
(MAPE) Measures the average percentage error between predicted and actual values. Useful for comparing models on different
scales.
The OLS estimates of the regression coefficients (slope and intercept) are obtained by
solving the following normal equations:
vector
Using scikit-learn,
- LinearRegression
[Link]
- Linear Regression example
[Link]
examples-linear-model-plot-ols-py
- A Comprehensive Guide to OLS Regression
[Link]
on-part-1/
Lasso Regression
LASSO (Least Absolute Shrinkage and Selection Operator) is a type of regression
analysis that uses L1 regularization to prevent overfitting and potentially select Using scikit-learn,
important features. It's particularly useful when dealing with high-dimensional data → Laso Regression
where many features may be irrelevant or redundant. [Link]
Regularization: It's a type of regularization technique that adds a penalty term to the → Practice
loss function to prevent overfitting. [Link]
Feature Selection: LASSO can automatically select important features by setting some
coefficients to zero.
The LASSO regression objective function is given by: J(w) = Σ(yi - ŷi)² + α * Σ|wj|
where:
- yi is the observed value.
- ŷi is the predicted value.
- α is the regularization parameter.
- wj are the regression coefficients.
- α * Σ|wj|, is the L1 regularization term. It penalizes the absolute values of the
coefficients. As α increases, more coefficients are shrunk towards zero,
leading to feature selection.
Advantages of LASSO
- Feature Selection: LASSO automatically selects important features, leading to
more interpretable models.
- Prevention of Overfitting: The regularization term helps prevent overfitting by
reducing the complexity of the model.
- Computational Efficiency: LASSO can be computationally efficient for large
datasets.
Disadvantages of LASSO
- Bias: LASSO can introduce bias into the model, especially when the true
model is dense (has many non-zero coefficients).
- Inconsistent Feature Selection: The features selected by LASSO can be
inconsistent across different runs or datasets.
Ridge Regression
Ridge Regression is another regularization technique used to prevent overfitting in In scikit-learn,
linear regression models. It's similar to LASSO but uses a different penalty term, which → Rigdge Regression
leads to different properties. [Link]
Regularization: It's a type of regularization technique that adds a penalty term to the → Practice
loss function to prevent overfitting. [Link]
The Ridge regression objective function is given by: J(w) = Σ(yi - ŷi)² + α * Σ(wj²)
where:
- yi is the observed value.
- ŷi is the predicted value.
- α is the regularization parameter.
- wj are the regression coefficients.
- α * Σ(wj²), is the L2 regularization term.
Advantages of Ridge Regression
- Prevention of Overfitting: The regularization term helps prevent overfitting by
reducing the variance of the coefficients.
- Numerical Stability: Ridge regression is numerically stable, even when the
features are highly correlated.
- No Feature Selection: Ridge regression does not set any coefficients to zero,
which can be useful when all features are believed to be relevant.
Disadvantages of Ridge Regression
- No Feature Selection: Ridge regression does not perform feature selection,
which can be a disadvantage when dealing with high-dimensional data.
- Bias: Ridge regression can introduce bias into the model, especially when the
true model is sparse (has many zero coefficients).
Lasso Regression vs Ridge Regression
LASSO (Least Absolute Shrinkage and Selection Operator) and Ridge Feature LASSO Ridge Regression
Regression are both regularization techniques used to prevent overfitting in
linear regression models. While they share similarities, they have distinct
characteristics and applications. Penalty Term Σ|wj| Σ(wj²)
Polynomial regression can improve model fit by creating a curved line that
better matches your data, potentially reducing the cost function's value.
Higher-order polynomials can achieve even more precise fits.
To minimize the cost function and improve model performance, we can use
gradient descent. This algorithm iteratively adjusts the model's weights to
reduce the error between predicted and actual values.
Polynomial Regression
Gradient Descent for Polynomial Regression
→
[Link]
-scratch-279db2936fe9
Polynomial Regression
Polynomial Regression is a form of regression