2179-Unit-3
2179-Unit-3
ALIGARH
Table of Content
1) DECISION TREE LEARNING - Decision tree learning algorithm
2) Inductive bias
3) Inductive inference with decision trees
4) Entropy and information theory, Information gain, ID-3 Algorithm
5) Issues in Decision tree learning.
6) INSTANCE-BASED LEARNING – k-Nearest Neighbor Learning
7) Locally Weighted Regression,
8) Radial basis function networks
9) Case-based learning.
o Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches represent the decision
rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do not contain any further
branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification
and Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further
split the tree into subtrees.
o Below diagram explains the general structure of a decision tree:
A decision tree can contain categorical data (YES/NO) as well as numeric data.
There are various algorithms in Machine learning, so choosing the best algorithm for the given
dataset and problem is the main point to remember while creating a machine learning model.
Below are the two reasons for using the Decision tree:
o Decision Trees usually mimic human thinking ability while making a decision, so it is
easy to understand.
o The logic behind the decision tree can be easily understood because it shows a tree-
like structure
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the
root node of the tree. This algorithm compares the values of root attribute with the record
(real dataset) attribute and, based on the comparison, follows the branch and jumps to the
next node.
pg. 3 Faculty : Shanu Gupta(CSE Department)
VISION INSTITUTE OF TECHNOLOGY, Subject: MACHINE LEARNING TECHNIQUES
ALIGARH
For the next node, the algorithm again compares the attribute value with the other sub-nodes
and move further. It continues the process until it reaches the leaf node of the tree. The
complete process can be better understood using the below algorithm
o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created
in step -3. Continue this process until a stage is reached where you cannot further
classify the nodes and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether
he should accept the offer or Not. So, to solve this problem, the decision tree starts with
the root node (Salary attribute by ASM). The root node splits further into the next decision
node (distance from the office) and one leaf node based on the corresponding labels. The
next decision node further gets split into one decision node (Cab facility) and one leaf
node. Finally, the decision node splits into two leaf nodes (Accepted offers and Declined
offer). Consider the below diagram:
o Information Gain
o Gini Index
1. Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation of
a dataset based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision
tree.
o A decision tree algorithm always tries to maximize the value of information gain, and
a node/attribute having the highest information gain is split first. It can be calculated
using the below formula:
Where,
2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini
index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create binary
splits.
o Gini index can be calculated using the below formula:
Advantages:
1. Interpretability: Decision trees are easily interpretable and can be visualized, making
them suitable for explaining the decision-making process to stakeholders and
domain experts. The decision rules learned by decision trees are simple to
understand and can be represented graphically.
2. No Assumptions about Data Distribution: Decision trees make no assumptions
about the underlying distribution of the data, unlike parametric models such as linear
regression. They can handle both numerical and categorical data without requiring
data transformation.
3. Implicit Feature Selection: Decision trees automatically perform feature selection by
selecting the most informative features at each split. Important features tend to
appear closer to the root of the tree, making it easy to identify key predictors.
4. Handling Non-Linear Relationships: Decision trees can capture non-linear
relationships between features and the target variable without the need for complex
transformations or feature engineering. They can model complex decision boundaries
with multiple splits.
pg. 6 Faculty : Shanu Gupta(CSE Department)
VISION INSTITUTE OF TECHNOLOGY, Subject: MACHINE LEARNING TECHNIQUES
ALIGARH
5. Easy to Handle Missing Values: Decision trees can handle missing values in the data
by splitting the data based on available features. They do not require imputation or
deletion of missing values, simplifying preprocessing steps.
Certainly! Here are the advantages and limitations of the Decision Tree algorithm:
Advantages:
1. Interpretability: Decision trees are easily interpretable and can be visualized, making
them suitable for explaining the decision-making process to stakeholders and
domain experts. The decision rules learned by decision trees are simple to
understand and can be represented graphically.
2. No Assumptions about Data Distribution: Decision trees make no assumptions
about the underlying distribution of the data, unlike parametric models such as linear
regression. They can handle both numerical and categorical data without requiring
data transformation.
3. Implicit Feature Selection: Decision trees automatically perform feature selection by
selecting the most informative features at each split. Important features tend to
appear closer to the root of the tree, making it easy to identify key predictors.
4. Handling Non-Linear Relationships: Decision trees can capture non-linear
relationships between features and the target variable without the need for complex
transformations or feature engineering. They can model complex decision boundaries
with multiple splits.
5. Easy to Handle Missing Values: Decision trees can handle missing values in the data
by splitting the data based on available features. They do not require imputation or
deletion of missing values, simplifying preprocessing steps.
Limitations:
1. Overfitting: Decision trees are prone to overfitting, especially when the tree is deep
or when the dataset is noisy or contains irrelevant features. Deep trees can capture
noise in the training data, leading to poor generalization performance on unseen
data.
2. Instability: Small changes in the training data can result in different tree structures,
leading to high variance and instability. Decision trees are sensitive to the training
data, and slight variations in the dataset can produce different trees.
3. Bias Toward Dominant Classes: Decision trees tend to favor splits that result in pure
subsets, leading to a bias toward dominant classes in the dataset. Imbalanced
datasets may result in biased trees that perform poorly on minority classes.
4. Greedy Nature: Decision trees use a greedy approach to construct the tree by
selecting the best split at each node based on local information. This may not always
lead to the globally optimal tree structure and may result in suboptimal solutions.
5. Limited Expressiveness: Decision trees may not be expressive enough to capture
complex relationships in the data, especially when the decision boundaries are highly
non-linear. Ensemble methods such as Random Forests or Gradient Boosting can be
used to improve expressiveness.
ID3 Algorithm
o The ID3 algorithm was developed by Ross Quinlan in 1986. It builds a decision tree
by recursively partitioning the dataset into smaller subsets until all data points in each
subset belong to the same class.
o It employs a top-down approach, selecting features to split the dataset based on
information gain.
o ID3 primarily deals with categorical properties, making it suitable for problems with
discrete input features.
o One of its strengths is its ability to generate interpretable decision trees.
o However, ID3 can be sensitive to noisy data and prone to overfitting.
o ID3’s resulting tree structure is easily understood and visualized, providing insight
into the decision-making process.
Inductive Bias
In decision tree learning, the inductive bias refers to the assumptions and preferences
encoded into the algorithm that guide the construction of the decision tree. These
biases influence the choice of attributes for splitting, the structure of the resulting
tree, and the generalization capabilities of the model. Here's how inductive bias
manifests in decision tree learning:
INSTANCE-BASED LEARNING
creating explicit models, instance-based learning compares new problem instances with instances
seen during training, which are stored in memory
Defers processing of training data until Learns a model during the training phase
Processing prediction time. and uses it for predictions.
High, as it stores the entire training dataset Lower, as it only needs to store the learned
Memory Usage for future reference. model parameters.
Quick to adapt to changes in the data Less adaptable to changes, as the model
Adaptability distribution or new instances. needs to be retrained.
Sensitivity to Can be sensitive to noisy data, outliers, and May be more robust to noise, depending
Noise irrelevant features. on the chosen model.
KNN is one of the most basic yet essential classification algorithms in machine
learning. It belongs to the supervised learning domain and finds intense
application in pattern recognition, data mining, and intrusion detection.
It is widely disposable in real-life scenarios since it is non-parametric, meaning
it does not make any underlying assumptions about the distribution of data (as
opposed to other algorithms such as GMM, which assume a Gaussian
distribution of the given data). We are given some prior data (also called
training data), which classifies coordinates into groups identified by an
attribute.
Now, given another set of data points (also called testing data), allocate
these points to a group by analyzing the training set. Note that the
unclassified points are marked as ‘White’.
o It is also called a lazy learner algorithm because it does not learn from the training
set immediately instead it stores the dataset and at the time of classification, it
performs an action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new data,
then it classifies that data into a category that is much similar to the new data.
o Example: Suppose, we have an image of a creature that looks similar to cat and dog,
but we want to know either it is a cat or dog. So for this identification, we can use the
KNN algorithm, as it works on a similarity measure. Our KNN model will find the similar
features of the new data set to the cats and dogs images and based on the most similar
features it will put it in either cat or dog category
Suppose we have a new data point and we need to put it in the required category.
Consider the below image:
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in
geometry. It can be calculated as:
o By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:
o As we can see the 3 nearest neighbors are from category A, hence this new data point
must belong to category A.
New Point:
Now, let's say we have a new point with coordinates (4, 5).
k-NN Algorithm:
Calculate Distance: Calculate the distance between the new point and each point in
the dataset. For simplicity, let's use Euclidean distance.
Distance=(X1new−X1i)2+(X2new−X2i)2
2. Find Nearest Neighbors: Choose the value of 𝑘k. Let's say 𝑘=3k=3. Select the three
nearest neighbors based on the calculated distances:
• Nearest neighbors: Points 2, 3, and 1.
3. Majority Vote: Determine the majority class among the nearest neighbors. In this
case, two neighbors belong to class A, and one belongs to class B. Therefore, the
predicted class for the new point is A.
Advantages:
1. Simplicity: k-NN is easy to understand and implement, making it suitable for
beginners.
2. Non-parametric: It makes no assumptions about the underlying data distribution,
making it versatile and adaptable to various types of data.
3. Flexibility: k-NN can handle multi-class classification and regression tasks.
4. Locally Adaptive: It can capture complex decision boundaries and adapt to the local
structure of the data.
Limitations:
1. Computational Complexity: As the size of the training dataset grows, the
computational cost of finding the nearest neighbors increases.
2. Memory Usage: k-NN requires storing the entire training dataset in memory, which
can be memory-intensive for large datasets.
3. Sensitive to Noise: It can be sensitive to noisy or irrelevant features, as it relies on
the similarity between instances.
4. Curse of Dimensionality: Performance may degrade in high-dimensional spaces due
to the increased sparsity of data points.
Applications:
1. Classification: k-NN is used in various domains such as image recognition, text
categorization, and medical diagnosis.
2. Regression: It can be applied to regression tasks such as predicting house prices,
stock prices, or weather forecasts.
3. Anomaly Detection: k-NN can be used for outlier detection or anomaly detection
tasks.
The popular type of feed-forward network is the radial basis function (RBF) network.
It has two layers, not counting the input layer, and contrasts from a multilayer
perceptron in the method that the hidden units implement computations.
The Radial Basis Function (RBF) kernel, also known as the Gaussian kernel, is one of
the most widely used kernel functions. It operates by measuring the similarity
between data points based on their Euclidean distance in the input space.
Mathematically, the RBF kernel between two data points, 𝑥x and 𝑥’x’, is defined as:
where, ∣𝑥–𝑥’∣2∣x–x’∣2 represents the squared Euclidean distance between the two
data points.
Each hidden unit significantly defines a specific point in input space, and its output,
or activation, for a given instance based on the distance between its point and the
instance, which is only a different point. The closer these two points, the better the
activation.
The parameters that such a network understands are the centers and widths of the
RBFs and the weights used to design the linear set of the outputs acquired from the
hidden layer. An essential benefit over multilayer perceptrons is that the first group
of parameters can be decided independently of the second group and make accurate
classifiers.
One method to decide the first group of parameters is to use clustering. The simple
k-means clustering algorithm can be applied, clustering each class independently to
obtain k-basis functions for each class.
The limitation of RBF networks is that they provide each attribute with a similar
weight because all are considered equally in the distance computation unless
attribute weight parameters are contained in the complete optimization process.
Advantages:
1. Non-Linearity: RBF networks can capture non-linear relationships between input and
output variables.
2. Interpretability: The centers of the radial basis functions can provide insight into the
regions of input space that are most important for prediction.
3. Generalization: RBF networks tend to generalize well to unseen data when properly
trained.
Limitations:
1. Scalability: RBF networks may struggle with scalability for high-dimensional data or
large datasets, as the number of parameters increases with the number of dimensions
and data points.
2. Center Selection: The selection of prototype or reference points can impact the
performance of the network, and choosing appropriate centers can be challenging.
3. Overfitting: RBF networks are prone to overfitting if not properly regularized or if
the number of radial basis functions is too high relative to the size of the training
data.
Applications:
1. Function Approximation: RBF networks are used for function approximation tasks in
engineering, finance, and physics.
2. Time Series Prediction: They can be applied to time series prediction tasks in
finance, weather forecasting, and other domains.
3. Classification: RBF networks can be used for classification tasks in pattern
recognition, medical diagnosis, and image processing
Locally Weighted Regression (LWR), also known as Locally Weighted Scatterplot Smoothing
(LOWESS), is a non-parametric regression method used for fitting a regression line to a dataset.
Unlike traditional regression methods that fit a global model to the entire dataset, LWR fits a
separate model to each data point, giving more weight to points that are closer to the point being
predicted.
Within the field of machine learning and regression analysis, Locally Weighted Linear Regression
(LWLR) emerges as a notable approach that bolsters predictive accuracy through the integration
of local adaptation. In contrast to conventional linear regression models, which presume a
universal correlation among variables, LWLR acknowledges the significance of localized patterns
and relationships present in the data. In the subsequent discourse, we embark on an exploration of
the fundamental principles, diverse applications, and inherent advantages offered by Locally
Weighted Linear Regression. Our aim is to shed light on its exceptional capacity to amplify
predictive prowess and furnish intricate understandings of intricate datasets.
Fundamentally, LWLR manifests as a non-parametric regression algorithm that discerns the
connection between a dependent variable and several independent variables. Notably, LWLR's
distinctiveness emanates from its dynamic adaptability, which empowers it to bestow distinct
weights upon individual data points contingent on their proximity to the target point under
prediction. In essence, this algorithm accords greater significance to proximate data points,
deeming them as more influential contributors in the prediction process.
Advantages:
1. Flexibility: LWR can capture non-linear relationships between input and output
variables.
2. Local Adaptation: It adapts the regression model to the local structure of the data,
giving more weight to nearby points.
3. Robustness: LWR is robust to outliers and noise in the data, as it focuses on the local
neighborhood of each data point.
Limitations:
1. Computational Complexity: LWR can be computationally expensive, especially for
large datasets, as it requires fitting a separate model to each data point.
2. Overfitting: LWR may overfit the training data if the bandwidth parameter 𝜏τ is too
small or if there are too few data points in the local neighborhood.
3. Bandwidth Selection: Choosing an appropriate value for the bandwidth parameter
𝜏τ can be challenging and may require cross-validation or other optimization
techniques.
Applications:
1. Time Series Forecasting: LWR can be used for time series forecasting tasks, where
the relationship between input and output variables may vary over time.
2. Anomaly Detection: It can be applied to anomaly detection tasks, where outliers or
unusual patterns need to be identified.
3. Function Approximation: LWR can be used for function approximation tasks in
various domains such as engineering, finance, and physics.
Basis of CBR :
Here, we will discuss the basis key parameters of CBR.
1. Regularity- The identical steps executed under the same circumstances will tend
to have the same or alike outcomes.
2. Typicality- Experiences tend to repeat themselves.
3. Consistency-
Minor switch in the circumstances require merely small changes in the
explanation and in the effect.
4. Adaptability- When things replicate, the dissimilarities tend to be minute, and
the small differences are uncomplicated to repay for.
• Case retrieval –
After the issue result has been judged, the best coordinating case is
explored in the case base and an estimated solution is retrieved.
• Case adaptation –
The recovered result is adjusted to fit finer the new issue.
• Solution evaluation –
The modified solution can be judged either before the solution is
applied to the complication or after the solution has been applied, the
modified solution must be adapted again or more cases should be
modified.