Chapter 4
Machine Learning Algorithms for Classification,
Activation Functions And Perceptron
I. Classification Algorithms in Machine Learning
1. Decision Trees
§Decision Tree is a Supervised learning technique that can be
used for both classification and Regression problems, but mostly
it is preferred for solving Classification problems.
§It is a tree-structured classifier, where internal nodes represent
the features of a dataset, branches represent the decision
rules and each leaf node represents the outcome.
§The decisions or the test are performed on the basis of features of
the given dataset.
§It is a graphical representation for getting all the possible
solutions to a problem/decision based on given conditions. 1
Decision Trees…Cont’D
§It
is called a decision tree because, similar to a tree, it starts
with the root node, which expands on further branches and
constructs a tree-like structure.
§Inorder to build a tree, we use the CART algorithm, which
stands for Classification and Regression Tree algorithm.
§Adecision tree simply asks a question, and based on the
answer (Yes/No), it further split the tree into subtrees.
2
Decision Trees…Cont’D
§The following diagram explains the general structure of
a decision tree:
3
Decision Trees…Cont’D
Why We Use Decision Trees?
§Thereare various algorithms in Machine learning, so
choosing the best algorithm for the given dataset and
problem is the main point to remember while creating a
machine learning model. The following are the two
reasons for using the Decision tree:
ØDecision Trees usually mimic human thinking ability while
making a decision, so it is easy to understand.
ØThe logic behind the decision tree can be easily understood
because it shows a tree-like structure. 4
Decision Trees…Cont’D
Decision Tree Terminologies
Root Node: Root node is from where the decision tree starts. It
represents the entire dataset, which further gets divided into two
or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree
cannot be segregated further after getting a leaf node.
Splitting: is the process of dividing the decision node/root node
into sub-nodes according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted
branches from the tree.
Parent/Child node: The root node of the tree is called the
parent node, and other nodes are called the child nodes. 5
Decision Trees…Cont’D
How does the Decision Tree Algorithm Work?
§ In a decision tree, for predicting the class of the given dataset, the
algorithm starts from the root node of the tree.
§ The complete process can be better understood using the following
algorithm:
Ø Step-1: Begin the tree with the root node, says S, which contains the
complete dataset.
Ø Step-2: Find the best attribute in the dataset using Attribute Selection
Measure (ASM).
Ø Step-3: Divide the S into subsets that contains possible values for the
best attributes.
Ø Step-4: Generate the decision tree node, which contains the best
attribute.
Ø Step-5: Recursively make new decision trees using the subsets of 6
the
dataset created in step -3.
Decision Trees…Cont’D
Example: Suppose there is a candidate who has a job offer and wants to
decide whether he should accept the offer or Not. So, to solve this
problem, the decision tree starts with the root node (Salary attribute by
ASM). The root node splits further into the next decision node (distance
from the office) and one leaf node based on the corresponding labels. The
next decision node further gets split into one decision node (Cab facility)
and one leaf node. Finally, the decision node splits into two leaf nodes
(Accepted offers and Declined offer). Consider the following diagram:
7
Entropy in Machine Learning
§Machine Learning contains lots of algorithms and
concepts that solve complex problems easily, and one of
them is entropy in Machine Learning.
§Almost everyone must have heard the Entropy word once
during their school or college days in physics and chemistry.
§The base of entropy comes from physics, where it is
defined as the measurement of disorder, randomness,
unpredictability, or impurity in the system.
8
Entropy …Cont’D
§From Machine Learning side, Entropy is defined as the
randomness or measuring the disorder of the information
being processed in Machine Learning.
§Further, in other words, we can say that entropy is the
machine learning metric that measures the unpredictability or
impurity in the system.
9
Entropy …Cont’D
§When information is processed in the system, then every piece of
information has a specific value to make and can be used to draw
conclusions from it.
§So, if it is easier to draw a valuable conclusion from a piece of
information, then entropy will be lower in Machine learning, or if
entropy is higher, then it will be difficult to draw any conclusion
from that piece of information.
§Entropy is a useful tool in machine learning to understand various
concepts such as feature selection, building decision trees, and
fitting classification models, etc.
§Being a machine learning engineer and professional data scientist,
you must have in-depth knowledge of entropy in machine learning.
10
Entropy …Cont’D
§We can understand the term entropy with any simple example:
flipping a coin.
§When we flip a coin, then there can be two outcomes. However, it
is difficult to conclude what would be the exact outcome while
flipping a coin because there is no direct relation between flipping a
coin and its outcomes.
§There is a 50% probability of both outcomes; then, in such
scenarios, entropy would be high. This is the essence of entropy in
machine learning.
11
Entropy …Cont’D
§Consider a dataset having a total number of N classes, then the
entropy (E) can be determined with the formula:
Where, Pi = Probability of randomly selecting an example in class I
§Entropyalways lies between 0 and 1, however depending on
the number of classes in the dataset, it can be greater than 1.
§Let'sunderstand it with an example where we have a dataset
having three colors of fruits as red, green, and yellow.
12
Entropy …Cont’D
§Suppose we have 2 red, 2 green, and 4 yellow observations
throughout the dataset. Then as per the above equation:
E=−(prlog2pr+pglog2pg+pylog2py)
Where;
Pr = Probability of choosing red fruits, Pg = Probability of choosing
green fruits and Py = Probability of choosing yellow fruits.
Pr = 2/8 =1/4, Pg = 2/8 =1/4 and Py = 4/8 = 1/2
Now our final equation will be such as:
So, entropy will be 1.5.
13
Entropy …Cont’D
14
Attribute Selection Measures (ASM)
§While implementing a Decision tree, the main issue
arises that how to select the best attribute for the root
node and for sub-nodes.
§So, to solve such problems there is a technique which is
called as Attribute Selection Measure (ASM). By this
measurement, we can easily select the best attribute for
the nodes of the tree. There are two popular techniques
for ASM, which are:
ØInformation Gain
ØGini Index
15
ASM…Cont’D
I. Information Gain:
§Information gain is the measurement of changes in entropy after
the segmentation dataset based on an attribute.
§It calculates how much information a feature provides us about a
class.
§According to the value of information gain, we split the node and
build the decision tree.
§A decision tree algorithm always tries to maximize the value of
information gain, and a node/attribute having the highest
information gain is split first. It can be calculated using the
following formula:
Information Gain= Entropy(S)-[(Weighted Avg) *Entropy(each feature)]
16
ASM…Cont’D
II. Gini Index:
§Gini index is a measure of impurity or purity used while creating a
decision tree in the CART algorithm.
§An attribute with the low Gini index should be preferred as
compared to the high Gini index.
§It only creates binary splits, and the CART algorithm uses the Gini
index to create binary splits.
§Gini index can be calculated using the following formula:
Gini Index= 1- ∑jPj2, Where ‘Pj’ is the probability of an object being
classified to a particular class
17
Pruning in Getting an Optimal Decision tree
§ Pruning is a process of deleting the unnecessary nodes from a
tree in order to get the optimal decision tree.
§ A too-large tree increases the risk of overfitting, and a small
tree may not capture all the important features of the dataset.
§ Therefore, a technique that decreases the size of the learning
tree without reducing accuracy is known as Pruning.
§ There are mainly two types of tree pruning technology used:
Ø Cost Complexity Pruning
Ø Reduced Error Pruning.
18
Advantages And Disadvantages of the Decision Tree
Advantages of the Decision Tree
§ It is simple to understand as it follows the same process which a
human follow while making any decision in real-life.
§ It can be very useful for solving decision-related problems.
§ It helps to think about all the possible outcomes for a problem.
§ There is less requirement of data cleaning compared to other algorithms
Disadvantages of the Decision Tree
§ The decision tree contains lots of layers, which makes it complex.
§ It may have an overfitting issue, which can be resolved using
the Random Forest algorithm.
§ For more class labels, the computational complexity of the
decision tree may increase.
19
2. Bayes Theorem in Machine learning
§Bayes theorem is also known with some other name such as Bayes
rule or Bayes Law. Bayes theorem helps to determine the
probability of an event with random knowledge
§An important concept of Bayes theorem named Bayesian method is
used to calculate conditional probability in Machine Learning
application that includes classification tasks.
§Further, a simplified version of Bayes theorem (Naïve Bayes
classification) is also used to reduce computation time and average
cost of the projects
§It is used to calculate the probability of occurring one event while
other one already occurred.
§It is a best method to relate the condition probability and marginal
probability.
20
Bayes Theorem …Cont’D
§Bayes theorem is one of the most popular machine learning concepts
that helps to calculate the probability of occurring one event with
uncertain knowledge while other one has already occurred.
§Bayes' theorem can be derived using product rule and conditional
probability of event X with known event Y:
P(X|Y) = P(Y|X). P(X)
P(Y)
§Here, both events X and Y are independent events which means
probability of outcome of both events does not depends one another.
§The above equation is called as Bayes Rule or Bayes Theorem.
21
Bayes Theorem …Cont’D
§P(X|Y) is called as posterior, which we need to calculate. It is
defined as updated probability after considering the evidence.
§P(Y|X)is called the likelihood. It is the probability of evidence
when hypothesis is true.
§P(X) is called the prior probability, probability of hypothesis
before considering the evidence
§P(Y)is called marginal probability. It is defined as the
probability of evidence under any consideration.
§Hence, Bayes Theorem can be written as:
Posterior = likelihood * prior / evidence
22
Bayes Theorem …Cont’D
§Naïve Bayes theorem is also a supervised algorithm, which is
based on Bayes theorem and used to solve classification problems.
§It is one of the most simple and effective classification algorithms
in Machine Learning which enables us to build various ML models
for quick predictions.
§Itis a probabilistic classifier that means it predicts on the basis of
probability of an object. Some popular Naïve Bayes algorithms
are spam filtration, Sentimental analysis, and classifying articles.
23
Advantages And Disadvantages of Naïve Bayes Classifier
in Machine Learning
Advantages
§ It is one of the simplest and effective methods for calculating the
conditional probability and text classification problems.
§ A Naïve-Bayes classifier algorithm is better than all other models where
assumption of independent predictors holds true.
§ It is easy to implement than other models.
§ It requires small amount of training data to estimate the test data which
minimize the training time period.
§ It can be used for Binary as well as Multi-class Classifications.
Disadvantage
The main disadvantage is that it limits the assumption of independent
predictors because it implicitly assumes that all attributes are independent
or unrelated but in real life it is not feasible to get mutually independent
attributes. 24
II. Activation Functions in Neural Networks
§ What is Activation Function?
§ It’s just a function that you use to get the output of node. It
is also known as Transfer Function.
§ Why we use Activation functions with Neural Networks?
§ It is used to determine the output of neural network like yes
or no.
§ It maps the resulting values in between 0 to 1 or -1 to 1 etc.
(depending upon the function).
§ The Activation Functions can be basically divided into 2 types.
Ø Linear Activation Function
Ø Non-linear Activation Functions
25
II. Activation Functions …Cont’D
Linear or Identity Activation Function
§ As you can see the function is a line or linear. Therefore, the output
of the functions will not be confined between any range.
Equation : f(x) = x And Range : (-infinity to infinity)
§It doesn’t help with the complexity or various parameters of usual
data that is fed to the neural networks.
26
II. Activation Functions …Cont’D
Non-linear Activation Function
§The Nonlinear Activation Functions are the most widely used
activation functions. Nonlinearity helps to make the graph look
something like this
27
II. Activation Functions …Cont’D
Non-linear Activation Function
§Itmakes it easy for the model to generalize or adapt with variety of
data and to differentiate between the output. The main
terminologies needed to understand for nonlinear functions are:
§Derivative or Differential: Change in y-axis w.r.t. change in x-axis.
It is also known as slope.
§Monotonic function : A function which is either entirely non-
increasing or non-decreasing.
§The Nonlinear Activation Functions are mainly divided on the
basis of their range or curves.
28
II. Activation Functions …Cont’D
1. Sigmoid or Logistic Activation Function
§The Sigmoid Function curve looks like a S-shape.
29
Sigmoid or Logistic …Cont’D
1. Sigmoid or Logistic Activation Function
§The main reason why we use sigmoid function is because it
exists between (0 to 1). Therefore, it is especially used for models
where we have to predict the probability as an output. Since
probability of anything exists only between the range of 0 and
1, sigmoid is the right choice.
§The function is differentiable. That means, we can find the slope
of the sigmoid curve at any two points.
§The function is monotonic but function’s derivative is not.
§The logistic sigmoid function can cause a neural network to get
stuck at the training time.
§The softmax function is a more generalized logistic activation
30
function which is used for multiclass classification.
2. Tanh or hyperbolic tangent Activation Function
§Tanh is also like logistic sigmoid but better. The range of the
Tanh function is from (-1 to 1).
§Tanh is also sigmoidal (S - shaped).
31
Tanh …Cont’D
§ The advantage is that the negative inputs will be mapped
strongly negative and the zero inputs will be mapped near
zero in the tanh graph.
§ The function is differentiable.
§ The function is monotonic while its derivative is not
monotonic.
§ The tanh function is mainly used classification between two
classes.
§ Both tanh and logistic sigmoid activation functions are used
in feed-forward nets.
32
3. ReLU (Rectified Linear Unit) Activation Function
§The ReLU is the most widely used activation function in the
world right now since, it is used in almost all the convolutional
neural networks or deep learning.
33
ReLU …Cont’D
§ As you can see, the ReLU is half rectified (from bottom). f(z) is
zero when z is less than zero and f(z) is equal to z when z is above
or equal to zero.
§ Range: [ 0 to infinity)
§ The function and its derivative both are monotonic.
§ But the issue is that all the negative values become zero
immediately which decreases the ability of the model to fit or train
from the data properly.
§ That means any negative input given to the ReLU activation
function turns the value into zero immediately in the graph, which
in turns affects the resulting graph by not mapping the negative
values appropriately.
34
4. Leaky ReLU
§It is an attempt to solve the dying ReLU problem
35
Leaky ReLU …Cont’D
§ The leak helps to increase the range of the ReLU function.
Usually, the value of a is 0.01 or so.
§ When a is not 0.01 then it is called Randomized ReLU.
§ Therefore the range of the Leaky ReLU is (-infinity to infinity).
§ Both Leaky and Randomized ReLU functions are monotonic in
nature. Also, their derivatives is monotonic in nature.
36
III. Perceptron in Machine Learning
§In Machine Learning and Artificial Intelligence, Perceptron is the
most commonly used term for all folks.
§It is the primary step to learn Machine Learning and Deep
Learning technologies, which consists of a set of weights, input
values or scores, and a threshold.
§Perceptron is a building block of an Artificial Neural Network.
§Initially, in the mid of 19th century, Mr. Frank Rosenblatt invented
the Perceptron for performing certain calculations to detect input
data capabilities or business intelligence.
§Perceptron is a linear Machine Learning algorithm used for
supervised learning for various binary classifiers.
§This algorithm enables neurons to learn elements and processes
them one by one during preparation
37
Perceptron …Cont’D
§Further,Perceptron is also understood as an Artificial Neuron or
Neural Network Unit that helps to detect certain input data
computations in business intelligence.
§Perceptron model is also treated as one of the best and simplest
types of Artificial Neural Networks (ANNs).
§However, it is a supervised learning algorithm of binary classifiers.
§Hence, we can consider it as a single-layer neural network with
four main parameters, i.e., input values, weights and Bias, net sum,
and an activation function.
38
Perceptron …Cont’D
§ In simple words, we can understand it as a classification algorithm
that can predict linear predictor function in terms of weight and
feature vectors.
§ Basic Components of Perceptron: Mr. Frank Rosenblatt invented
the perceptron model as a binary classifier which contains three
main components. These are as follows
39
Perceptron …Cont’D
§Input Nodes or Input Layer: is the primary component of
Perceptron which accepts the initial data into the system for further
processing. Each input node contains a real numerical value.
§Weight and Bias: Weight parameter represents the strength of the
connection between units. This is another most important parameter of
Perceptron components.
Ø Weight is directly proportional to the strength of the associated input
neuron in deciding the output. Further, Bias can be considered as the
line of intercept in a linear equation.
§Activation Function: These are the final and important components
that help to determine whether the neuron will fire or not. Activation
Function can be considered primarily as a step function.
40
How does Perceptron Work?
§ In Machine Learning, Perceptron is considered as a single-layer
neural network that consists of four main parameters named input
values (Input nodes), weights and Bias, net sum, and an activation
function.
§ The perceptron model begins with the multiplication of all input
values and their weights, then adds these values together to create
the weighted sum.
§ Then this weighted sum is applied to the activation function 'f' to
obtain the desired output. This activation function is also known as
the step function and is represented by 'f'.
41
How does Perceptron Work…Cont’D
42
How does Perceptron Work…Cont’D
§Perceptron model works in two important steps as follows:
§Step-1 : In the first step first, multiply all input values with
corresponding weight values and then add them to determine the
weighted sum.
§Mathematically, we can calculate the weighted sum as follows:
∑wi*xi = x1*w1 + x2*w2 +…wn*xn, then add a special term
called bias 'b' to this weighted sum to improve the model's
performance. ∑wi*xi + b
§Step-2: In the second step, an activation function is applied with
the above-mentioned weighted sum, which gives us output either
in binary form or a continuous value as follows:
Y = f(∑wi*xi + b)
43
Types of Perceptron Models
Based on the layers, Perceptron models are divided into
two types. These are as follows:
I. Single-layer Perceptron Model
II. Multi-layer Perceptron model
§Single Layer Perceptron Model: This is one of the easiest Artificial
neural networks (ANN) types.
§A single-layered perceptron model consists feed-forward network
and also includes a threshold transfer function inside the model.
§The main objective of the single-layer perceptron model is to
analyze the linearly separable objects with binary outcomes
§Multi-Layered Perceptron Model: will be discussed in next
chapter.
44
Perceptron Function
§Perceptron function ''f(x)'' can be achieved as output by
multiplying the input 'x' with the learned weight coefficient
'w'. Mathematically, we can express it as follows:
f(x)=1; if w.x+b>0, otherwise, f(x)=0
§'x' represents a vector of input x values.
§'w' represents real-valued weights vector
§'b' represents the bias
45
Characteristics of Perceptron
The perceptron model has the following characteristics.
1. Perceptron is a machine learning algorithm for supervised
learning of binary classifiers.
2. In Perceptron, the weight coefficient is automatically learned.
3. Initially, weights are multiplied with input features, and the
decision is made whether the neuron is fired or not.
4. The activation function applies a step rule to check whether the
weight function is greater than zero.
5. The linear decision boundary is drawn, enabling the distinction
between the two linearly separable classes +1 and -1.
6. If the added sum of all input values is more than the threshold value,
it must have an output signal; otherwise, no output will be shown
46
Limitations of Perceptron Model
A perceptron model has limitations:
Ø The output of a perceptron can only be a binary number (0
or 1) due to the hard limit transfer function.
Ø Perceptron can only be used to classify the linearly
separable sets of input vectors. If input vectors are non-
linear, it is not easy to classify them properly.
47