0% found this document useful (0 votes)
5 views

Chapter 03

Uploaded by

suleymanabdu0931
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Chapter 03

Uploaded by

suleymanabdu0931
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Prepared by

Prince Thomas M.E., PhD


Associate Professor
Chapter 3 Supervised Learning: Nonlinear Models
K-Nearest Neighbors (K-NN)
- Distance Metrics (Euclidean, Manhattan)
- K-Value Selection and Cross-Validation
Neural Networks and Multilayer Perceptrons (MLPs)
- Structure of Neural Networks (Input, Hidden, and Output Layers)
- Activation Functions (ReLU, Sigmoid, Softmax)
- Backpropagation and Gradient Descent in Neural Networks
Decision Trees
- Splitting Criteria: Gini Index, Entropy, and Information Gain
- Overfitting in Decision Trees and Pruning Boosting Techniques (e.g., AdaBoost)
Random Forests (Introduction to Ensembles) - Concept of Weak Learners
- Boosting Algorithms: AdaBoost, Gradient
- Bagging and Random Subspace Sampling
Boosting
Stacking and Voting Methods
- Model Combination Techniques
- Hard vs. Soft Voting
Nonlinear Models in Machine Learning
- Nonlinear models capture relationships between inputs and outputs that are not linear.
- They can model complex patterns in data that linear models cannot.
Key Characteristics of Nonlinear Models:
Flexibility: Capable of modeling complex, nonlinear relationships.
Complex Patterns: Can fit data with intricate patterns and interactions between variables.
Non-Linear Boundaries: Able to create decision boundaries that are not straight lines, which
is crucial for solving complex classification problems.
Examples of Nonlinear Models:
K-Nearest Neighbors (K-NN):
Overview: Classifies a data point based on the classification of its nearest neighbors.
Nonlinearity: The decision boundary can be highly nonlinear depending on the distribution of
training data and the value of K.
Neural Networks:
Overview: Consist of layers of neurons that can learn complex representations of data.
Nonlinearity: Layers with nonlinear activation functions (like ReLU, Sigmoid) allow modeling
very complex relationships.
Decision Trees:
Overview: Splits the data based on feature values to make predictions.
Nonlinearity: Creates complex, piecewise constant decision boundaries that adapt to
intricate data patterns.

Random Forests:
Overview: An ensemble of decision trees, each trained on different subsets of the data.
Nonlinearity: Combines multiple nonlinear trees to create a more robust model with
improved generalization ability.

Advantages:
Model Complex Relationships: Capable of capturing complex patterns in data.
High Accuracy: Often achieve higher accuracy compared to linear models, especially in
real-world applications with complex data.

Disadvantages:
Computationally Intensive: Training nonlinear models can be resource-intensive.
Risk of Overfitting: More prone to overfitting, especially if not properly regularized.
Decision Tree
• Decision Tree is a Supervised learning technique
• It can be used for both classification and Regression problems, but mostly it is preferred for
solving Classification problems.
• It is a tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
• In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision
nodes are used to make any decision and have multiple branches, whereas Leaf nodes are
the output of those decisions and do not contain any further branches.
• In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.
• A decision tree simply asks a question, and based on the answer (Yes/No), it further split the
tree into subtrees.
Why use Decision Trees?
• Decision Trees usually mimic human thinking ability while making a decision.
• The logic of decision tree can be easily understood because it shows a tree-like structure.
Decision Tree Terminologies
Root Node: Decision tree starts from root node. It represents the entire dataset, which further
gets divided into two or more homogeneous sets.
Leaf Node: It’s final output node, the tree cannot be segregated further after getting a leaf
node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are
called the child nodes.
How does the Decision Tree algorithm Work?
Step-1 Begin the tree with the root node, says
S, which contains the complete dataset.
Step-2 Find the best attribute in the dataset
using Attribute Selection Measure (ASM).
Step-3 Divide the S into subsets that contains
possible values for the best attributes.
Step-4 Generate the decision tree node, which
contains the best attribute.
Step-5 Recursively make new decision trees
using the subsets of the dataset created in step
-3. Continue this process until a stage is
reached where you cannot further classify the
nodes and called the final node as a leaf node.
Attribute Selection Measures:
• While implementing a Decision tree, the main issue arises that how to select the best
attribute for the root node and for sub-nodes.
• To solve such problems there is a technique which is called as Attribute selection measure
or ASM.
• Popular techniques for ASM, which are: Information Gain, Gini Index
1. Information Gain:
• Information gain is the measurement of changes in entropy after the segmentation of a
dataset based on an attribute.
• It calculates how much information a feature provides us about a class.
• According to the value of information gain, we split the node and build the decision tree.
• A decision tree algorithm always tries to use maximum value of information gain, and a
node/attribute having the highest information gain is split first.
It can be calculated using the below formula:
Information gain is a measure of this change in entropy.
• Suppose S is a set of instances(whole dataset),
• A is an attribute
• Sv(one feature) is the subset of S
• v represents an individual value that the attribute A can take and Values (A) is the set of all
possible values of A, then

Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies


randomness in data, It specifies how much information available in the data. Entropy can be
calculated as:
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)
Where,
S= Total number of samples, P(yes)= probability of yes, P(no)= probability of no
Example for Entropy calculation dataset example
2. Gini Index:
• Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
• An attribute with low Gini index should be preferred as compared to the high Gini index.
because Low Gini index indicates less impurity, leading to better decision tree splits.
• It only creates binary splits, and the CART algorithm uses the Gini index to create binary
splits.
• Gini index can be calculated using the below formula:
Gini Index= 1- ∑jPj2
Pj: The probability of an object being classified into the jth class.
∑jPj^2 The sum of the squared probabilities for all classes.
Range: The Gini Index ranges from 0 to 1.
0 Perfect equality (all classes have equal probability).
1 Perfect inequality (one class has a probability of 1, all others 0).
Higher Gini Index: Indicates greater inequality or impurity in the dataset.
Lower Gini Index: Indicates less inequality or higher purity.
Overfitting in Decision Trees
Definition: Overfitting happens when the model captures too much detail from the
training data, including noise and outliers.
Impact: Leads to poor generalization, meaning the model performs well on training data
but poorly on unseen data.
Cause: Often results from a highly complex tree with many nodes and branches, trying to
perfectly fit the training data.
Consequence: High risk of making incorrect predictions on new data due to the overly
specific patterns learned from the training data.
Controlling Overfitting Through Pruning
Pruning helps reduce complexity by removing branches that don’t contribute significantly to
model accuracy.
1. Pre-pruning (Early Stopping):
• Stops tree growth early based on certain criteria, preventing it from becoming overly
complex.
Common parameters:
Max depth: Limits the depth of the tree.
Minimum samples per leaf: Sets a minimum number of samples for each leaf node.
Minimum samples to split: Specifies the minimum samples required to split a node.
Maximum leaf nodes: Limits the total number of leaf nodes.
Pros: Faster, reduces complexity upfront.
Cons: Risk of underfitting if the tree stops growing too early.
2. Post-pruning (Cost Complexity Pruning):
• Prunes the tree after it has fully grown by removing branches to simplify it.
Techniques:
• Reduced Error Pruning: Removes branches if it doesn’t worsen accuracy on a validation
set.
• Cost Complexity Pruning: Adds a penalty for each node to balance accuracy and
complexity, tuned with ccp_alpha in libraries like scikit-learn.
Pros: Allows exploration of deeper patterns before simplifying.
Cons: Computationally intensive and requires careful parameter selection.
Additional Tips for Controlling Overfitting
Cross-Validation: Apply cross-validation to fine-tune pruning parameters and other
hyperparameters, achieving a balance between bias and variance.
Ensemble Methods: Use methods like Random Forests or Gradient Boosted Trees, which
combine multiple trees to reduce overfitting through averaging.
Advantages of the Decision Tree
• It is simple to understand as it follows the same process which a human follow while
making any decision in real-life.
• It can be very useful for solving decision-related problems.
• It helps to think about all the possible outcomes for a problem.
• There is less requirement of data cleaning compared to other algorithms.
Disadvantages of the Decision Tree
• The decision tree contains lots of layers, which makes it complex.
• It may have an overfitting issue, which can be resolved using the Random Forest
algorithm.
• For more class labels, the computational complexity of the decision tree may increase.
Random Forest Algorithm
What is Random Forest Algortihm
Random Forest = Decision Tree + Column Sampling/Row Sampling

Random Forest is a popular machine learning algorithm that belongs to the family of
ensemble learning methods.
• Random Forest is a tree-based ensemble learning algorithm used in machine learning
for classification and regression.
• It constructs multiple Decision Trees during training, each using a random subset of the
dataset.
• Each tree measures a random subset of features at each split, increasing variability and
reducing overfitting.
• Prediction is made by aggregating the results of all trees:
• Voting for classification tasks.
• Averaging for regression tasks.
• This ensemble approach leads to stable and precise results.
• Random Forests can handle complex data effectively and are widely used in various
applications for their reliability in predictions.
What are Ensemble Learning models?
• The collective strength of multiple models overcomes individual limitations, leading to
more robust predictions.
• Ensemble models are commonly used in classification and regression tasks.
• Popular ensemble models include:
• Bagging: Reduces variance by training multiple versions of a model.
• Random Forest: Builds multiple decision trees on random data subsets.
• Boosting: Sequentially improves models by focusing on errors (e.g., AdaBoost,
XGBoost, LightGBM).
• Voting: Combines predictions by taking a majority or average vote across models.
Bagging (Bootstrap Aggregating)
Goal: Reduce variance and avoid overfitting by combining predictions from multiple
models.
How it works:
• Creates multiple subsets of the training data by sampling with replacement.
• Trains a separate model on each subset (often using decision trees).
Aggregates predictions:
For regression: Takes the average of predictions.
For classification: Uses majority voting.
Example: Random Forest is a popular bagging method that combines many decision trees.
Boosting
Goal: Improve model accuracy by focusing on difficult-to-predict cases.
How it works:
• Trains models sequentially, with each new model correcting the errors of the previous
ones.
• Adjusts weights to emphasize data points that were misclassified earlier.
• Final prediction combines all models, often with weighted voting.
Example: AdaBoost and XGBoost are popular boosting methods that iteratively refine
predictions.

Both bagging and boosting aim to create a stronger overall model by combining the
strengths of individual models.
Algorithm for Random Forest Work:

Step 1 Select random K data points from the training set.


Step 2 Build the decision trees associated with the selected data points(Subsets).
Step 3 Choose the number N for decision trees that you want to build.
Step 4 Repeat Step 1 and 2.
Step 5 For new data points, find the predictions of each decision tree, and assign the new
data points to the category that wins the majority votes.
What is a Weak Learner?
• A Weak Learner is a model that performs just slightly better than random guessing on a
given problem.
• In binary classification, a weak learner has an accuracy of just above 50% (i.e., better than
chance).
• For regression, it performs only marginally better than guessing the average value.
Characteristics of Weak Learners
• Simple models: Often, weak learners are simple models, such as small decision trees
(stumps with one or two splits) or simple linear models.
• High bias: Weak learners typically have limited complexity, so they’re biased and may
underfit the data if used alone.
• Low predictive power individually: On their own, weak learners may not capture all
patterns or relationships in the data.
Why Use Weak Learners?
• Combining Weak Learners in Ensembles: While a weak learner on its own is not powerful,
combining many weak learners can lead to a strong model.
• In Boosting, each weak learner corrects the errors of the previous ones, resulting in a
progressively better model.
• In Bagging (like Random Forest), the weak learners are trained independently, and their
predictions are averaged or voted upon, reducing variance.
Efficiency: Weak learners are computationally simpler and faster to train, making them suitable
for use in large ensemble methods where many learners are needed.
Controlled overfitting: Because weak learners are limited in complexity, they can help keep the
ensemble model from overfitting, especially in Boosting methods.
Examples of Weak Learners
• Decision Stumps: Decision trees with only one or two splits.
• Shallow Trees: Decision trees with low depth, typically limited to a few levels.
• Simple Linear Models: Models that only capture linear relationships without complex
transformations.
Boosting Technique:
1. AdaBoost (Adaptive Boosting)
How it works:
• AdaBoost builds models sequentially, where each new model focuses on the mistakes
made by the previous one.
• After each model, misclassified data points are given more weight, so the next model
will focus more on those points.
• The final prediction is made by combining the results from all models, with more weight
given to models that performed better.
Key idea: AdaBoost adjusts itself based on what it learns from the errors of earlier models.
Common use: It works well for both classification and regression tasks.
Strength: AdaBoost is simple and effective, but it can be sensitive to noisy data.
2. Gradient Boosting
• Like AdaBoost, Gradient Boosting also builds models sequentially, but with a key difference: each
new model is trained to predict the residual errors (the difference between the actual and predicted
values) of the previous models.
• Each model tries to minimize a loss function (such as mean squared error) by making small
corrections to the previous models' predictions.
• The predictions of all models are combined, usually by weighted summing.
Key idea: It’s focuses on correcting errors by directly improving the predictions in small steps.
Common use: It’s widely used for both classification and regression, when accuracy is a priority.
Strength: It is powerful and flexible but can be prone to overfitting if not tuned properly.

AdaBoost: Focuses on improving errors by adjusting the weights of misclassified data points.
Gradient Boosting: Focuses on improving the model by reducing prediction errors through gradient
descent.

You might also like