All Units
All Units
(AU REG.2021)
E-MATERIAL
(VERSION-1)
PREPARED BY
Ms.K.ABHIRAMI
UNIT-I- Page 1 of 64
KCE-CSE –AI&ML 2023
SYLLABUS
Introduction to machine learning – Linear Regression Models: Least squares, single &
multiple variables, Bayesian linear regression, gradient descent, Linear Classification
Models: Discriminant function – Probabilistic discriminative model - Logistic regression,
Probabilistic generative model – Naive Bayes, Maximum margin classifier – Support
vector machine, Decision Tree, Random forests
UNIT-I- Page 2 of 64
KCE-CSE –AI&ML 2023
UNIT- I
PROBLEM SOLVING
UNIT-I- Page 3 of 64
KCE-CSE –AI&ML 2023
UNIT-I
PROBLEM SOLVING
TOPIC 1: INTRODUCTION
Artificial Intelligence exists when a machine can have human based skills such as
learning, reasoning, and solving problems
UNIT-I- Page 4 of 64
KCE-CSE –AI&ML 2023
o High Accuracy with less errors: AI machines or systems are prone to less
errors and high accuracy as it takes decisions as per pre-experience or
information.
o High-Speed: AI systems can be of very high-speed and fast-decision making,
because of that AI systems can beat a chess champion in the Chess game.
o High reliability: AI machines are highly reliable and can perform the same
action multiple times with high accuracy.
o Useful for risky areas: AI machines can be helpful in situations such as defusing
a bomb, exploring the ocean floor, where to employ a human can be risky.
o Digital Assistant: AI can be very useful to provide digital assistant to the users
such as AI technology is currently used by various E-commerce websites to show
the products as per customer requirement.
o Useful as a public utility: AI can be very useful for public utilities such as a self-
driving car which can make our journey safer and hassle-free, facial recognition
for security purpose, Natural language processing to communicate with the
human in human-language, etc.
Every technology has some disadvantages, and thesame goes for Artificial intelligence.
Being so advantageous technology still, it has some disadvantages which we need to
keep in our mind while creating an AI system. Following are the disadvantages of AI:
UNIT-I- Page 5 of 64
KCE-CSE –AI&ML 2023
o No Original Creativity: As humans are so creative and can imagine some new
ideas but still AI machines cannot beat this power of human intelligence and
cannot be creative and imaginative.
AI VS HUMAN INTELLIGENCE
UNIT-I- Page 6 of 64
KCE-CSE –AI&ML 2023
TOPIC 2: APPLICATIONS
Following are some sectors which have the application of Artificial Intelligence:
1. AI in Astronomy
2. AI in Healthcare
o In the last, five to ten years, AI becoming more advantageous for the healthcare
industry and going to have a significant impact on this industry.
o Healthcare Industries are applying AI to make a better and faster diagnosis than
humans. AI can help doctors with diagnoses and can inform when patients are
worsening so that medical help can reach to the patient before hospitalization.
3. AI in Gaming
o AI can be used for gaming purpose. The AI machines can play strategic games
like chess, where the machine needs to think of a large number of possible
places.
4. AI in Finance
o AI and finance industries are the best matches for each other. The finance
industry is implementing automation, chatbot, adaptive intelligence, algorithm
trading, and machine learning into financial processes.
5. AI in Data Security
o The security of data is crucial for every company and cyber-attacks are growing
very rapidly in the digital world. AI can be used to make your data more safe and
secure. Some examples such as AEG bot, AI2 Platform,are used to determine
software bug and cyber-attacks in a better way.
6. AI in Social Media
o Social Media sites such as Facebook, Twitter, and Snapchat contain billions of
user profiles, which need to be stored and managed in a very efficient way. AI
can organize and manage massive amounts of data. AI can analyze lots of data to
identify the latest trends, hashtag, and requirement of different users.
UNIT-I- Page 7 of 64
KCE-CSE –AI&ML 2023
8. AI in Automotive Industry
o Some Automotive industries are using AI to provide virtual assistant to their user
for better performance. Such as Tesla has introduced TeslaBot, an intelligent
virtual assistant.
o Various Industries are currently working for developing self-driven cars which
can make your journey more safe and secure.
9. AI in Robotics:
10. AI in Entertainment
o We are currently using some AI based applications in our daily life with some
entertainment services such as Netflix or Amazon. With the help of ML/AI
algorithms, these services show the recommendations for programs or shows.
11. AI in Agriculture
o Agriculture is an area which requires various resources, labor, money, and time
for best result. Now a day's agriculture is becoming digital, and AI is emerging in
this field. Agriculture is applying AI as agriculture robotics, solid and crop
monitoring, predictive analysis. AI in agriculture can be very helpful for farmers.
12. AI in E-commerce
13. AI in education:
UNIT-I- Page 8 of 64
KCE-CSE –AI&ML 2023
o AI can automate grading so that the tutor can have more time to teach. AI chatbot
can communicate with students as a teaching assistant.
o AI in the future can be work as a personal virtual tutor for students, which will
be accessible easily at any time and any place.
2.1.APPLICATION IN DETAIL
AI IN AGRICULTURE
AI saves the agriculture sector from different factors such as climate change,
population growth, employment issues in this field, and food safety.
Today's agriculture system has reached at a different level due to AI. Artificial
Intelligence has improved crop production and real-time monitoring, harvesting,
processing and marketing.
Different hi-tech computer-based systems are designed to determine various
important parameters such as weed detection, yield detection, crop quality, and
many more.
1. Weather & price Forecasting: it is difficult for the farmers to take
the right decision for harvesting, sowing seeds, and soli preparing
due to climate change. But with the help of AI weather forecasting,
farmers can have information on weather analysis, and accordingly,
they can plan for the type of crop to grow, seeds to sow, and
harvesting the crop. With price forecasting, farmers can get a better
idea about the price of crops for the next few weeks, which can help
them to get maximum profit.
2. Health Monitoring of Crops:
The quality of crop widely depends on the type of soil and nutrition
of the soil. But with the increasing rate of deforestation, the soil
quality is degrading day by day, and it is hard to determine it.
To resolve this issue, AI has come up with a new application to
identify the deficiencies in soil, including plant pests and diseases.
With the help of this application, farmers can get an idea to use better
fertilizer which can improve the harvest quality. In this app, AI's
image recognition technology is used by which farmers can capture
the images of plants and get information about the quality.
3. Agriculture Robotics:
Robotics is being widely used in different sectors, mainly in
manufacturing, to perform complex tasks. Nowadays, different AI
companies are developing robots to be employed in the Agriculture
UNIT-I- Page 9 of 64
KCE-CSE –AI&ML 2023
sector. These AI robots are developed in such a way that they can
perform multiple tasks in farming.
AI robots are also trained in checking the quality of crops, detect and
controlling weeds, and harvesting the crop with faster speed
compared to a human.
4. Intelligent Spraying
With AI sensors, weed can be detected easily, and it also detects weed
affected areas. On finding such areas, herbicides can be precisely
sprayed to reduce the use of herbicides and also saves time and crop.
There are different AI companies that are building robots with AI and
computer vision, which can precisely spray on weeds. The use of AI
sprayers can widely reduce the number of chemicals to be used on
fields, and hence improves the quality of crops and also saves money.
5. Disease Diagnosis
With AI predictions, farmers can get knowledge of diseases easily.
With this, they can easily diagnose diseases with proper strategy and
on time. It can save the life of plants and farmer's time. To do this,
firstly, images of plants are pre-processed using computer vision
technology. This ensures that plant images are properly divided into
the diseased and non-diseased parts. After detection, the diseased
part is cropped and send to the labs for further diagnosis. This
technique also helps in the detection of pests, deficiency of nutrients,
and many more.
6. Precision Farming
Precision farming is all about "Right place, Right Time, and Right
products". The precision farming technique is a much accurate and
controlled way that can replace the labour-intensive part of farming
to perform repetitive tasks. One example of Precision farming is the
identification of stress levels in plants. This can be obtained using
high-resolution images and different sensor data on plants. The data
obtained from sensors is then fed to a machine learning model as
input for stress recognition.
2.2. AI IN BANKING
Almost every industry, including banking and finance, has been significantly disrupted
by artificial intelligence. This industry is now more customer-centric and
technologically relevant thanks to the use of AI inside banking applications and services.
UNIT-I- Page 10 of 64
KCE-CSE –AI&ML 2023
Large numbers of online payments happen every day when consumers utilise
applications with account information to pay bills, withdraw money, deposit checks, and
do much more. As a result, every financial system must increase its operations towards
cybersecurity and fraud detection.
At this point financial artificial intelligence enters the picture. Artificial intelligence can
help banks with eliminating hazards, tracking system flaws, and enhancing the security
of online financial transactions. AI and machine learning can quickly spot potential
fraud and notify both consumers as well as banks.
2.Chatbots
They also expect to study more concerning a specific customer's usage statistics. It aids
in their effective comprehension of user expectations.
The banks may guarantee that they remain accessible to their consumers 24 hours a day
by introducing bots within existing banking apps. Additionally, chatbots can provide
focused on customer attention and make appropriate financial service and product
recommendations through comprehending consumer behaviour.
In order to make better, safer, and more profitable loan and credit choices, banks are
trying implementing AI-based solutions. Presently, most banks still only consider a
person's or business's dependability based on their credit history, credit scores, and
consumer recommendations.
Somebody can ignore the fact that these credit reporting systems frequently contain
inaccuracies, exclude after all histories, and incorrectly identify creditors.
Consumers with little payment history can use an AI-based loan and credit system to
analyse existing behavioral patterns to assess its trustworthiness. Additionally, this
technology notifies banks from certain actions that can raise the likelihood of
bankruptcy. In short, these innovations were significantly altering the way that
customer borrowing will be conducted in the future.
UNIT-I- Page 11 of 64
KCE-CSE –AI&ML 2023
Bankers can analyse huge amounts of data as well as forecast the most recent economic
movements, commodities, and equities thanks to artificial intelligence in financial
institutions. Modern machine learning methods offer financial suggestions and assist in
evaluating market sentiments.
AI for banking additionally recommends when and how to buy equities and issues alerts
when there is a potential consequence.
Everyday, financial and banking institutions record millions of transactions. Due to the
vast amount of knowledge gained, it becomes challenging for staff must acquire and
register it.
AI-based alternative approaches can aid in effective data collection and analysis in these
kind of circumstances. Thus, the whole user development is achieved. Additionally, the
data may be utilised to identify theft or make credit decisions.
6. Customer experience
An AI system is composed of an agent and its environment. The agents act in their
environment. The environment may contain other agents
UNIT-I- Page 12 of 64
KCE-CSE –AI&ML 2023
UNIT-I- Page 13 of 64
KCE-CSE –AI&ML 2023
3.1.Types of AI Agents
Agents can be grouped into five classes based on their degree of perceived intelligence
and capability. All these agents can improve their performance and generate better
action over the time. These are given below:
o The Simple reflex agents are the simplest agents. These agents take decisions on
the basis of the current percepts and ignore the rest of the percept history.
UNIT-I- Page 14 of 64
KCE-CSE –AI&ML 2023
o The Simple reflex agent does not consider any part of percepts history during
their decision and action process.
o The Simple reflex agent works on Condition-action rule, which means it maps the
current state to action. Such as a Room Cleaner agent, it works only if there is
dirt in the room.
UNIT-I- Page 15 of 64
KCE-CSE –AI&ML 2023
o These agents have the model, "which is knowledge of the world" and based on
the model they perform actions.
o The agent needs to know its goal which describes desirable situations.
UNIT-I- Page 16 of 64
KCE-CSE –AI&ML 2023
o These agents may have to consider a long sequence of possible actions before
deciding whether the goal is achieved or not. Such considerations of different
scenario are called searching and planning, which makes an agent proactive.
o These agents are similar to the goal-based agent but provide an extra component
of utility measurement which makes them different by providing a measure of
success at a given state.
o Utility-based agent act based not only goals but also the best way to achieve the
goal.
o The Utility-based agent is useful when there are multiple possible alternatives,
and an agent has to choose in order to perform the best action.
o The utility function maps each state to a real number to check how efficiently
each action achieves the goals.
UNIT-I- Page 17 of 64
KCE-CSE –AI&ML 2023
o A learning agent in AI is the type of agent which can learn from its past
experiences, or it has learning capabilities.
o It starts to act with basic knowledge and then able to act and adapt automatically
through learning.
b. Critic: Learning element takes feedback from critic which describes that
how well the agent is doing with respect to a fixed performance standard.
Hence, learning agents are able to learn, analyze performance, and look for new ways to
improve the performance.
UNIT-I- Page 18 of 64
KCE-CSE –AI&ML 2023
Search tree: A tree representation of search problem is called Search tree. The root of
the search tree is the root node which is corresponding to the initial state.
Actions: It gives the description of all the available actions to the agent.
UNIT-I- Page 19 of 64
KCE-CSE –AI&ML 2023
Solution: It is an action sequence which leads from the start node to the goal node.
Optimal Solution: If a solution has the lowest cost among all solutions.
Following are the four essential properties of search algorithms to compare the
efficiency of these algorithms:
Based on the search problems we can classify the search algorithms into uninformed
(Blind search) search and informed search (Heuristic search) algorithms.
UNIT-I- Page 20 of 64
KCE-CSE –AI&ML 2023
Uninformed/Blind Search:
The uninformed search does not contain any domain knowledge such as
closeness, the location of the goal.
It operates in a brute-force way as it only includes information about how to
traverse the tree and how to identify leaf and goal nodes.
Uninformed search applies a way in which search tree is searched without any
information about the search space like initial state operators and test for the
goal, so it is also called blind search.
It examines each node of the tree until it achieves the goal node.
o Breadth-first search
o Uniform cost search
o Depth-first search
o Iterative deepening depth-first search
o Bidirectional Search
Informed Search
Informed search can solve much complex problem which could not be solved in another
way.
Breadth-first Search:
o Breadth-first search is the most common search strategy for traversing a tree or
graph. This algorithm searches breadthwise in a tree or graph, so it is called
breadth-first search.
UNIT-I- Page 21 of 64
KCE-CSE –AI&ML 2023
o BFS algorithm starts searching from the root node of the tree and expands all
successor node at the current level before moving to nodes of next level.
o The breadth-first search algorithm is an example of a general-graph search
algorithm.
o Breadth-first search implemented using FIFO queue data structure.
Advantages:
o BFS will provide a solution if any solution exists.
o If there are more than one solutions for a given problem, then BFS will provide
the minimal solution which requires the least number of steps.
Disadvantages:
o It requires lots of memory since each level of the tree must be saved into
memory to expand the next level.
o BFS needs lots of time if the solution is far away from the root node.
Example:
In the below tree structure, we have shown the traversing of the tree using BFS
algorithm from the root node S to goal node K. BFS search algorithm traverse in layers,
so it will follow the path which is shown by the dotted arrow, and the traversed path
will be:
1. S---> A--->B---->C--->D---->G--->H--->E---->F---->I---->K
UNIT-I- Page 22 of 64
KCE-CSE –AI&ML 2023
Time Complexity: Time Complexity of BFS algorithm can be obtained by the number of
nodes traversed in BFS until the shallowest Node. Where the d= depth of shallowest
solution and b is a node at every state.
Space Complexity: Space complexity of BFS algorithm is given by the Memory size of
frontier which is O(bd).
Completeness: BFS is complete, which means if the shallowest goal node is at some
finite depth, then BFS will find a solution.
Optimality: BFS is optimal if path cost is a non-decreasing function of the depth of the
node.
2. Depth-first Search
o Depth-first search isa recursive algorithm for traversing a tree or graph data
structure.
o It is called the depth-first search because it starts from the root node and follows
each path to its greatest depth node before moving to the next path.
o DFS uses a stack data structure for its implementation.
o The process of the DFS algorithm is similar to the BFS algorithm.
Note: Backtracking is an algorithm technique for finding all possible solutions using
recursion.
Advantage:
o DFS requires very less memory as it only needs to store a stack of the nodes on
the path from root node to the current node.
o It takes less time to reach to the goal node than BFS algorithm (if it traverses in
the right path).
Disadvantage:
o There is the possibility that many states keep re-occurring, and there is no
guarantee of finding the solution.
o DFS algorithm goes for deep down searching and sometime it may go to the
infinite loop.
Example:
In the below search tree, we have shown the flow of depth-first search, and it will follow
the order as:
Root node--->Left node ----> right node.
UNIT-I- Page 23 of 64
KCE-CSE –AI&ML 2023
It will start searching from root node S, and traverse A, then B, then D and E, after
traversing E, it will backtrack the tree as E has no other successor and still goal node is
not found. After backtracking it will traverse node C and then G, and here it will
terminate as it found goal node.
Completeness: DFS search algorithm is complete within finite state space as it will
expand every node within a limited search tree.
Time Complexity: Time complexity of DFS will be equivalent to the node traversed by
the algorithm. It is given by:
Where, m= maximum depth of any node and this can be much larger than d (Shallowest
solution depth)
Space Complexity: DFS algorithm needs to store only single path from the root node,
hence space complexity of DFS is equivalent to the size of the fringe set, which is O(bm).
UNIT-I- Page 24 of 64
KCE-CSE –AI&ML 2023
search. In this algorithm, the node at the depth limit will treat as it has no successor
nodes further.
o Standard failure value: It indicates that problem does not have any solution.
o Cutoff failure value: It defines no solution for the problem within a given depth
limit.
Advantages:
Disadvantages:
Example:
Completeness: DLS search algorithm is complete if the solution is above the depth-
limit.
Time Complexity: Time complexity of DLS algorithm is O(bℓ).
Space Complexity: Space complexity of DLS algorithm is O(b×ℓ).
Optimal: Depth-limited search can be viewed as a special case of DFS, and it is also not
optimal even if ℓ>d.
UNIT-I- Page 25 of 64
KCE-CSE –AI&ML 2023
Advantages:
o Uniform cost search is optimal because at every state the path with the least cost
is chosen.
Disadvantages:
o It does not care about the number of steps involve in searching and only
concerned about path cost. Due to which this algorithm may be stuck in an
infinite loop.
Example:
Completeness:
UNIT-I- Page 26 of 64
KCE-CSE –AI&ML 2023
Uniform-cost search is complete, such as if there is a solution, UCS will find it.
Time Complexity:
Let C* is Cost of the optimal solution, and ε is each step to get closer to the goal node.
Then the number of steps is = C*/ε+1. Here we have taken +1, as we start from state 0
and end to C*/ε.
Space Complexity:
The same logic is for space complexity so, the worst-case space complexity of Uniform-
cost search is O(b1 + [C*/ε]).
Optimal:
Uniform-cost search is always optimal as it only selects a path with the lowest path cost.
Advantages:
o It combines the benefits of BFS and DFS search algorithm in terms of fast search
and memory efficiency.
Disadvantages:
o The main drawback of IDDFS is that it repeats all the work of the previous phase.
Example:
Following tree structure is showing the iterative deepening depth-first search. IDDFS
algorithm performs various iterations until it does not find the goal node. The iteration
performed by the algorithm is given as:
UNIT-I- Page 27 of 64
KCE-CSE –AI&ML 2023
1'st Iteration-----> A
2'nd Iteration----> A, B, C
3'rd Iteration------>A, B, D, E, C, F, G
4'th Iteration------>A, B, D, H, I, E, C, F, K, G
In the fourth iteration, the algorithm will find the goal node.
Completeness:
This algorithm is complete is ifthe branching factor is finite.
Time Complexity:
Let's suppose b is the branching factor and depth is d then the worst-case time
complexity is O(bd).
Space Complexity:
The space complexity of IDDFS will be O(bd).
Optimal:
IDDFS algorithm is optimal if path cost is a non- decreasing function of the depth of the
node.
6. Bidirectional Search Algorithm:
Bidirectional search algorithm runs two simultaneous searches, one form initial
state called as forward-search and other from goal node called as backward-
search, to find the goal node.
Bidirectional search replaces one single search graph with two small subgraphs
in which one starts the search from an initial vertex and other starts from goal
vertex. The search stops when these two graphs intersect each other.
Bidirectional search can use search techniques such as BFS, DFS, DLS, etc.
UNIT-I- Page 28 of 64
KCE-CSE –AI&ML 2023
Advantages:
o Bidirectional search is fast.
o Bidirectional search requires less memory
Disadvantages:
o Implementation of the bidirectional search tree is difficult.
Example:
In the below search tree, bidirectional search algorithm is applied. This algorithm
divides one graph/tree into two sub-graphs. It starts traversing from node 1 in the
forward direction and starts from goal node 16 in the backward direction.
UNIT-I- Page 29 of 64
KCE-CSE –AI&ML 2023
Heuristics function:
Heuristic is a function which is used in Informed Search, and it finds the most
promising path.
It takes the current state of the agent as its input and produces the estimation of
how close agent is from the goal.
The heuristic method, however, might not always give the best solution, but it
guaranteed to find a good solution in reasonable time.
Heuristic function estimates how close a state is to the goal. It is represented by
h(n), and it calculates the cost of an optimal path between the pair of states. The
value of the heuristic function is always positive.
Here h(n) is heuristic cost, and h*(n) is the estimated cost. Hence heuristic cost should
be less than or equal to the estimated cost.
Pure heuristic search is the simplest form of heuristic search algorithms. It expands
nodes based on their heuristic value h(n). It maintains two lists, OPEN and CLOSED list.
In the CLOSED list, it places those nodes which have already expanded and in the OPEN
list, it places nodes which have yet not been expanded.
On each iteration, each node n with the lowest heuristic value is expanded and
generates all its successors and n is placed to the closed list. The algorithm continues
unit a goal state is found.
In the informed search we will discuss two main algorithms which are given below:
Greedy best-first search algorithm always selects the path which appears best at that
moment. It is the combination of depth-first search and breadth-first search algorithms.
UNIT-I- Page 30 of 64
KCE-CSE –AI&ML 2023
It uses the heuristic function and search. Best-first search allows us to take the
advantages of both algorithms. With the help of best-first search, at each step, we can
choose the most promising node. In the best first search algorithm, we expand the node
which is closest to the goal node and the closest cost is estimated by heuristic function,
i.e.
1. f(n)= g(n).
Advantages:
o Best first search can switch between BFS and DFS by gaining the advantages of
both the algorithms.
o This algorithm is more efficient than BFS and DFS algorithms.
Disadvantages:
o It can behave as an unguided depth-first search in the worst case scenario.
o It can get stuck in a loop as DFS.
o This algorithm is not optimal.
Example:
Consider the below search problem, and we will traverse it using greedy best-first
search. At each iteration, each node is expanded using evaluation function f(n)=h(n) ,
which is given in the below table.
UNIT-I- Page 31 of 64
KCE-CSE –AI&ML 2023
In this search example, we are using two lists which are OPEN and CLOSED Lists.
Following are the iteration for traversing the above example.
UNIT-I- Page 32 of 64
KCE-CSE –AI&ML 2023
Time Complexity: The worst case time complexity of Greedy best first search is O(b m).
Space Complexity: The worst case space complexity of Greedy best first search is
O(bm). Where, m is the maximum depth of the search space.
Complete: Greedy best-first search is also incomplete, even if the given state space is
finite.
A* search is the most commonly known form of best-first search. It uses heuristic
function h(n), and cost to reach the node n from the start state g(n).
It has combined features of UCS and greedy best-first search, by which it solve the
problem efficiently. A* search algorithm finds the shortest path through the search
space using the heuristic function.
This search algorithm expands less search tree and provides optimal result faster. A*
algorithm is similar to UCS except that it uses g(n)+h(n) instead of g(n).
In A* search algorithm, we use search heuristic as well as the cost to reach the node.
Hence we can combine both costs as following, and this sum is called as a fitness
number.
At each point in the search space, only those node is expanded which have the lowest
value of f(n), and the algorithm terminates when the goal node is found.
Algorithm of A* search:
Step 2: Check if the OPEN list is empty or not, if the list is empty then return failure and
stops.
UNIT-I- Page 33 of 64
KCE-CSE –AI&ML 2023
Step 3: Select the node from the OPEN list which has the smallest value of evaluation
function (g+h), if node n is goal node then return success and stop, otherwise
Step 4: Expand node n and generate all of its successors, and put n into the closed list.
For each successor n', check whether n' is already in the OPEN or CLOSED list, if not
then compute evaluation function for n' and place into Open list.
Step 5: Else if node n' is already in OPEN and CLOSED, then it should be attached to the
back pointer which reflects the lowest g(n') value.
Advantages:
o A* search algorithm is the best algorithm than other search algorithms.
o A* search algorithm is optimal and complete.
Disadvantages:
o It does not always produce the shortest path as it mostly based on heuristics and
approximation.
o A* search algorithm has some complexity issues.
o The main drawback of A* is memory requirement as it keeps all generated nodes
in the memory, so it is not practical for various large-scale problems.
Example:
In this example, we will traverse the given graph using the A* algorithm. The heuristic
value of all states is given in the below table so we will calculate the f(n) of each state
using the formula f(n)= g(n) + h(n), where g(n) is the cost to reach any node from start
state.
Here we will use OPEN and CLOSED list.
UNIT-I- Page 34 of 64
KCE-CSE –AI&ML 2023
Solution:
o A* algorithm returns the path which occurred first, and it does not search for all
remaining paths.
o The efficiency of A* algorithm depends on the quality of heuristic.
o A* algorithm expands all nodes which satisfy the condition f(n)<="" li="">
o Admissible: the first condition requires for optimality is that h(n) should be an
admissible heuristic for A* tree search. An admissible heuristic is optimistic in
nature.
o Consistency: Second required condition is consistency for only A* graph-search.
If the heuristic function is admissible, then A* tree search will always find the least cost
path.
UNIT-I- Page 35 of 64
KCE-CSE –AI&ML 2023
UNIT-I- Page 36 of 64
KCE-CSE –AI&ML 2023
Unwind recursion and store best f-value for current best leaf Pitesti
Unwind recursion and store best f-value for current best leaf Fagaras
UNIT-I- Page 37 of 64
KCE-CSE –AI&ML 2023
best is now Rimnicu Viclea (again). Call RBFS for new best
– Subtree is again expanded.
– Best alternative subtree is now through Timisoara.
Solution is found since because 447 > 417.
This is like A*, but when memory is full we delete the worst node (largest f-
value).
Like RBFS, we remember the best descendent in the branch we delete.
If there is a tie (equal f-values) we delete the oldest nodes first.
simple-MBA* finds the optimal reachable solution given the memory
constraint.
Time can still be exponential.
Progress of SMA*. Each node is labeled with its current f-cost. Values in
parentheses show the value of the best forgotten descendant
g+h = f
UNIT-I- Page 38 of 64
KCE-CSE –AI&ML 2023
UNIT-I- Page 39 of 64
KCE-CSE –AI&ML 2023
o In this algorithm, we don't need to maintain and handle the search tree or graph
as it only keeps a single current state.
o Generate and Test variant: Hill Climbing is the variant of Generate and Test
method. The Generate and Test method produce feedback which helps to decide
which direction to move in the search space.
o Greedy approach: Hill-climbing algorithm search moves in the direction which
optimizes the cost.
o No backtracking: It does not backtrack the search space, as it does not
remember the previous states.
On Y-axis we have taken the function which can be an objective function or cost
function, and state-space on the x-axis. If the function on Y-axis is cost then, the goal of
search is to find the global minimum and local minimum. If the function of Y-axis is
Objective function, then the goal of the search is to find the global maximum and local
maximum.
UNIT-I- Page 40 of 64
KCE-CSE –AI&ML 2023
Local Maximum: Local maximum is a state which is better than its neighbor states, but
there is also another state which is higher than it.
Global Maximum: Global maximum is the best possible state of state space landscape.
It has the highest value of objective function.
Flat local maximum: It is a flat space in the landscape where all the neighbor states of
current states have the same value.
Simple hill climbing is the simplest way to implement a hill climbing algorithm. It only
evaluates the neighbor node state at a time and selects the first one which optimizes
current cost and set it as a current state. It only checks it's one successor state, and if it
finds better than the current state, then move else be in the same state. This algorithm
has the following features:
o Step 1: Evaluate the initial state, if it is goal state then return success and Stop.
o Step 2: Loop Until a solution is found or there is no new operator left to apply.
o Step 3: Select and apply an operator to the current state.
o Step 4: Check new state:
a. If it is goal state, then return success and quit.
b. Else if it is better than the current state then assign new state as a current state.
c. Else if not better than the current state, then return to step2.
Step 5: Exit.
UNIT-I- Page 41 of 64
KCE-CSE –AI&ML 2023
o Step 1: Evaluate the initial state, if it is goal state then return success and stop,
else make current state as initial state.
o Step 2: Loop until a solution is found or the current state does not change.
a. Let SUCC be a state such that any successor of the current state will be better
than it.
b. For each operator that applies to the current state:
Step 3: Exit.
Stochastic hill climbing does not examine for all its neighbor before moving. Rather, this
search algorithm selects one neighbor node at random and decides whether to choose it
as a current state or examine another state.
1. Local Maximum: A local maximum is a peak state in the landscape which is better
than each of its neighboring states, but there is another state also present which is
higher than the local maximum.
Solution: Backtracking technique can be a solution of the local maximum in state space
landscape. Create a list of the promising path so that the algorithm can backtrack the
search space and explore other paths as well.
UNIT-I- Page 42 of 64
KCE-CSE –AI&ML 2023
2. Plateau: A plateau is the flat area of the search space in which all the neighbor states
of the current state contains the same value, because of this algorithm does not find any
best direction to move. A hill-climbing search might be lost in the plateau area.
Solution: The solution for the plateau is to take big steps or very little steps while
searching, to solve the problem. Randomly select a state which is far away from the
current state so it is possible that the algorithm could find non-plateau region.
3. Ridges: A ridge is a special form of the local maximum. It has an area which is higher
than its surrounding areas, but itself has a slope, and cannot be reached in a single
move.
Simulated Annealing:
A hill-climbing algorithm which never makes a move towards a lower value guaranteed
to be incomplete because it can get stuck on a local maximum. And if algorithm applies a
UNIT-I- Page 43 of 64
KCE-CSE –AI&ML 2023
random walk, by moving a successor, then it may complete but not efficient. Simulated
Annealing is an algorithm which yields both efficiency and completeness.
Adversarial search is a search, where we examine the problem which arises when
we try to plan ahead of the world and other agents are planning against us.
o The search strategies which are only associated with a single agent that aims to
find the solution which often expressed in the form of a sequence of actions.
o But, there might be some situations where more than one agent is searching for
the solution in the same search space, and this situation usually occurs in game
playing.
o The environment with more than one agent is termed as multi-agent
environment, in which each agent is an opponent of other agent and playing
against each other. Each agent needs to consider the action of other agent and
effect of that action on their performance.
o So, Searches in which two or more players with conflicting goals are trying
to explore the same search space for the solution, are called adversarial
searches, often known as Games.
o Games are modeled as a Search problem and heuristic evaluation function, and
these are the two main factors which help to model and solve games in AI.
Types of Games in AI
Chance Moves
Deterministic
UNIT-I- Page 44 of 64
KCE-CSE –AI&ML 2023
Zero-Sum Game
The Zero-sum game involved embedded thinking in which one agent or player is
trying to figure out:
o What to do.
o How to decide the move
o Needs to think about his opponent as well
o The opponent also thinks what to do
Each of the players is trying to find out the response of his opponent to their actions.
This requires embedded thinking or backward reasoning to solve the game problems in
AI.
Problem Formulation
UNIT-I- Page 45 of 64
KCE-CSE –AI&ML 2023
The following figure is showing part of the game-tree for tic-tac-toe game. Following are
some key points of the game:
UNIT-I- Page 46 of 64
KCE-CSE –AI&ML 2023
o From the initial state, MAX has 9 possible moves as he starts first. MAX place x
and MIN place o, and both player plays alternatively until we reach a leaf node
where one player has three in a row or all squares are filled.
o Both players will compute each node, minimax, the minimax value which is the
best achievable utility against an optimal adversary.
o Suppose both the players are well aware of the tic-tac-toe and playing the best
play. Each player is doing his best to prevent another one from winning. MIN is
acting against Max in the game.
o So in the game tree, we have a layer of Max, a layer of MIN, and each layer is
called as Ply. Max place x, then MIN puts o to prevent Max from winning, and this
game continues until the terminal node.
UNIT-I- Page 47 of 64
KCE-CSE –AI&ML 2023
o In this either MIN wins, MAX wins, or it's a draw. This game-tree is the whole
search space of possibilities that MIN and MAX are playing tic-tac-toe and taking
turns alternately.
o It aims to find the optimal strategy for MAX to win the game.
o It follows the approach of Depth-first search.
o In the game tree, optimal leaf node could appear at any depth of the tree.
o Propagate the minimax values up to the tree until the terminal node discovered.
Min-Max Algorithm
o In this algorithm two players play the game, one is called MAX and other is called
MIN.
o Both the players fight it as the opponent player gets the minimum benefit while
they get the maximum benefit.
o Both Players of the game are opponent of each other, where MAX will select the
maximized value and MIN will select the minimized value.
o The minimax algorithm performs a depth-first search algorithm for the
exploration of the complete game tree.
o The minimax algorithm proceeds all the way down to the terminal node of the
tree, then backtrack the tree as the recursion.
Step-1: In the first step, the algorithm generates the entire game-tree and apply
the utility function to get the utility values for the terminal states. In the below
tree diagram, let's take A is the initial state of the tree. Suppose maximizer takes
first turn which has worst-case initial value =- infinity, and minimizer will take
next turn which has worst-case initial value = +infinity.
UNIT-I- Page 48 of 64
KCE-CSE –AI&ML 2023
Step 2: Now, first we find the utilities value for the Maximizer, its initial value is -∞, so
we will compare each value in terminal state with initial value of Maximizer and
determines the higher nodes values. It will find the maximum among the all.
Step 3: In the next step, it's a turn for minimizer, so it will compare all nodes
value with +∞, and will find the 3rd layer node values.
UNIT-I- Page 49 of 64
KCE-CSE –AI&ML 2023
Step 4: Now it's a turn for Maximizer, and it will again choose the maximum of all nodes
value and find the maximum value for the root node. In this game tree, there are only 4
layers, hence we reach immediately to the root node, but in real games, there will be
more than 4 layers.
UNIT-I- Page 50 of 64
KCE-CSE –AI&ML 2023
The main drawback of the minimax algorithm is that it gets really slow for complex
games such as Chess, go, etc. This type of games has a huge branching factor, and the
player has lots of choices to decide. This limitation of the minimax algorithm can be
improved from alpha-beta pruning
o In the minimax search algorithm, the number of game states it has to examine
are exponential in depth of the tree.
o Since we cannot eliminate the exponent, but we can cut it to half. Hence there is a
technique by which without checking each node of the game tree we can
compute the correct minimax decision, and this technique is called pruning.
o This involves two threshold parameter Alpha and beta for future expansion, so it
is called alpha-beta pruning. It is also called as Alpha-Beta Algorithm.
o Alpha-beta pruning can be applied at any depth of a tree, and sometimes it not
only prune the tree leaves but also entire sub-tree.
o The two-parameter can be defined as:
o Alpha: The best (highest-value) choice we have found so far at any
point along the path of Maximizer. The initial value of alpha is -∞.
o Beta: The best (lowest-value) choice we have found so far at any
point along the path of Minimizer. The initial value of beta is +∞.
o The Alpha-beta pruning to a standard minimax algorithm returns the same move
as the standard algorithm does, but it removes all the nodes which are not really
affecting the final decision but making algorithm slow. Hence by pruning these
nodes, it makes the algorithm fast.
UNIT-I- Page 51 of 64
KCE-CSE –AI&ML 2023
Step 2: At Node D, the value of α will be calculated as its turn for Max. The value of α is
compared with firstly 2 and then 3, and the max (2, 3) = 3 will be the value of α at node
D and node value will also 3.
Step 3: Now algorithm backtrack to node B, where the value of β will change as this is a
turn of Min, Now β= +∞, will compare with the available subsequent nodes value, i.e.
min (∞, 3) = 3, hence at node B now α= -∞, and β= 3.
UNIT-I- Page 52 of 64
KCE-CSE –AI&ML 2023
In the next step, algorithm traverse the next successor of Node B which is node E, and
the values of α= -∞, and β= 3 will also be passed.
Step 4: At node E, Max will take its turn, and the value of alpha will change. The current
value of alpha will be compared with 5, so max (-∞, 5) = 5, hence at node E α= 5 and β=
3, where α>=β, so the right successor of E will be pruned, and algorithm will not
traverse it, and the value at node E will be 5.
Step 5: At next step, algorithm again backtrack the tree, from node B to node A. At node
A, the value of alpha will be changed the maximum available value is 3 as max (-∞, 3)=
3, and β= +∞, these two values now passes to right successor of A which is Node C.
At node C, α=3 and β= +∞, and the same values will be passed on to node F.
UNIT-I- Page 53 of 64
KCE-CSE –AI&ML 2023
Step 6: At node F, again the value of α will be compared with left child which is 0, and
max(3,0)= 3, and then compared with right child which is 1, and max(3,1)= 3 still α
remains 3, but the node value of F will become 1.
Step 7: Node F returns the node value 1 to node C, at C α= 3 and β= +∞, here the value
of beta will be changed, it will compare with 1 so min (∞, 1) = 1. Now at C, α=3 and β= 1,
and again it satisfies the condition α>=β, so the next child of C which is G will be pruned,
and the algorithm will not compute the entire sub-tree G.
Step 8: C now returns the value of 1 to A here the best value for A is max (3, 1) = 3.
Following is the final game tree which is the showing the nodes which are computed
UNIT-I- Page 54 of 64
KCE-CSE –AI&ML 2023
and nodes which has never computed. Hence the optimal value for the maximizer is 3
for this example.
• What is a CSP?
– Finite set of variables X1, X2, …, Xn
– Nonempty domain of possible values for each variable
D1, D2, …, Dn
– Finite set of constraints C1, C2, …, Cm
• Each constraint Ci limits the values that variables can take,
• e.g., X1 ≠ X2
– Each constraint Ci is a pair <scope, relation>
• Scope = Tuple of variables that participate in the constraint.
• Relation = List of allowed combinations of variable values.
UNIT-I- Page 55 of 64
KCE-CSE –AI&ML 2023
In constraint satisfaction, domains are the areas wherein parameters were located after
the restrictions that are particular to the task. Those three components make up a
constraint satisfaction technique in its entirety. The pair "scope, rel" makes up the
number of something like the requirement. The scope is a tuple of variables that
contribute to the restriction, as well as rel is indeed a relationship that contains a list of
possible solutions for the parameters should assume in order to meet the restrictions of
something like the issue.
Issues
For a constraint satisfaction problem (CSP), the following conditions must be met:
o States area
The definition of a state in phase space involves giving values to any or all of the
parameters, like as
3. A partial assignment is one that just gives some of the variables values. Projects
of this nature are referred to as incomplete assignment.
Basically, there are three different categories of limitations in regard towards the
parameters:
o Unary restrictions are the easiest kind of restrictions because they only limit the
value of one variable.
UNIT-I- Page 56 of 64
KCE-CSE –AI&ML 2023
• Domains: Di={red,green,blue}
• E.g. WA NT
{WA=red,NT=green,Q=red,NSW=green,V=red,SA=blue,T=green
• Constraint graph:
Variety of constraints
UNIT-I- Page 57 of 64
KCE-CSE –AI&ML 2023
– e.g. SA green
– e.g. SA WA
– e.g. red is better than green often can be represented by a cost for each
variable assignment
• Incremental formulation
UNIT-I- Page 58 of 64
KCE-CSE –AI&ML 2023
– Path cost: constant cost for every step (not really relevant)
Problem: You are given two jugs, a 4-gallon one and a 3-gallon one.Neither has any
measuring mark on it.There is a pump that can be used to fill the jugs with water.How
can you get exactly 2 gallons of water into the 4-gallon jug.
Solution:
The state space for this problem can be described as the set of ordered pairs of
integers (x,y)
Where,
X represents the quantity of water in the 4-gallon jug X= 0,1,2,3,4
Y represents the quantity of water in 3-gallon jug Y=0,1,2,3
Start State: (0,0)
Goal State: (2,0)
Generate production rules for the water jug problem
Production Rules:
Rule State Process
1 (X,Y | X<4) (4,Y)
{Fill 4-gallon jug}
2 (X,Y |Y<3) (X,3)
{Fill 3-gallon jug}
3 (X,Y |X>0) (0,Y)
{Empty 4-gallon jug}
4 (X,Y | Y>0) (X,0)
{Empty 3-gallon jug}
5 (X,Y | X+Y>=4 ^ (4,Y-(4-X))
Y>0) {Pour water from 3-gallon jug into 4-gallon jug
until 4-gallon jug is full}
6 (X,Y | X+Y>=3 (X-(3-Y),3)
^X>0) {Pour water from 4-gallon jug into 3-gallon jug
until 3-gallon jug is full}
7 (X,Y | X+Y<=4 (X+Y,0)
^Y>0) {Pour all water from 3-gallon jug into 4-gallon jug}
8 (X,Y | X+Y <=3^ (0,X+Y)
X>0) {Pour all water from 4-gallon jug into 3-gallon jug}
9 (0,2) (2,0)
{Pour 2 gallon water from 3 gallon jug into 4
gallon jug}
Initialization:
Start State: (0,0)
Apply Rule 2:
(X,Y | Y<3) -> (X,3)
{Fill 3-gallon jug}
UNIT-I- Page 59 of 64
KCE-CSE –AI&ML 2023
Iteration 1:
Current State: (X,3)
Apply Rule 7:
(X,Y | X+Y<=4 (X+Y,0)
^Y>0) {Pour all water from 3-gallon jug into 4-gallon jug}
Now the state is (3,0)
Iteration 2:
Current State : (3,0)
Apply Rule 2:
(X,Y | Y<3) -> (3,3)
{Fill 3-gallon jug}
Now the state is (3,3)
Iteration 3:
Current State:(3,3)
Apply Rule 5:
(X,Y | X+Y>=4 ^ (4,Y-(4-X))
Y>0) {Pour water from 3-gallon jug into 4-gallon jug until
4-gallon jug is full}
Now the state is (4,2)
Iteration 4:
Current State : (4,2)
Apply Rule 3:
(X,Y | X>0) (0,Y)
{Empty 4-gallon jug}
Now state is (0,2)
Iteration 5:
Current State : (0,2)
Apply Rule 9:
(0,2) (2,0)
{Pour 2 gallon water from 3 gallon jug into 4 gallon
jug}
Now the state is (2,0)
Goal Achieved.
UNIT-I- Page 60 of 64
KCE-CSE –AI&ML 2023
Here, to convert it into numeric form, we first split each word separately and represent
it as follows:
SEND
MORE
-------------
MONEY
These alphabets then are replaced by numbers such that all the constraints are satisfied.
So initially we have all blank spaces.
UNIT-I- Page 61 of 64
KCE-CSE –AI&ML 2023
These alphabets then are replaced by numbers such that all the constraints are satisfied.
So initially we have all blank spaces.
We first look for the MSB in the last word which is 'M' in the word 'MONEY' here. It is
the letter which is generated by carrying. So, carry generated can be only one. SO, we
have M=1.
Now, we have S+M=O in the second column from the left side. Here M=1. Therefore, we
have, S+1=O. So, we need a number for S such that it generates a carry when added
with 1. And such a number is 9. Therefore, we have S=9 and O=0.
Now, in the next column from the same side we have E+O=N. Here we have O=0. Which
means E+0=N which is not possible. This means a carry was generated by the lower
place digits. So we have:
1+E=N ----------(i)
So, for satisfying both equations (i) and (ii), we get E=5 and N=6.
Now, R should be 9, but 9 is already assigned to S, So, R=8 and we have 1 as a carry
which is generated from the lower place digits.
Now, we have D+5=Y and this should generate a carry. Therefore, D should be greater
than 4. As 5, 6, 8 and 9 are already assigned, we have D=7 and therefore Y=2.
9 5 6 7
1 0 8 5
-------------
1 0 6 5 2
UNIT-I- Page 62 of 64
KCE-CSE –AI&ML 2023
On one bank of a river are three missionaries and three cannibals. There is one boat
available that can hold up to two people and that they would like to use to cross the river.
If the cannibals ever outnumber the missionaries on either of the river’s banks, the
missionaries will get eaten.
How can the boat be used to safely carry all the missionaries and cannibals across the
river?
First let us consider that both the missionaries (M) and cannibals(C) are on the same
side of the river.
UNIT-I- Page 63 of 64
KCE-CSE –AI&ML 2023
Here (B) shows the position of the boat after the action is performed. Therefore all the
missionaries and cannibals have crossed the river safely.
UNIT-I- Page 64 of 64
KCE-CSE –AI&ML 2023
UNIT- II
PROBABILISTIC REASONING
Agents in the real world need to handle uncertainty, whether due to partial
observability, nondeterminism, or adversaries. An agent may never know for sure what
state it is in now or where it will end up after a sequence of actions.
– using a simple but incorrect theory of the world, which does not take into
account uncertainty and will work most of the time
• This rule is wrong and in order to make it true we have to add an almost
unlimited list of possible causes:
• Trying to use first-order logic to cope with a domain like medical diagnosis fails
for three main reasons:
• Actually, the connection between toothaches and cavities is just not a logical
consequence in any direction.
• In judgmental domains (medical, law, design...) the agent’s knowledge can at best
provide a degree of belief in the relevant sentences.
• The main tool for dealing with degrees of belief is probability theory, which
assigns to each sentence a numerical degree of belief between 0 and 1.
• statistical data
• A probability of 0.8 does not mean “80% true”, but rather an 80% degree of
belief that something is true.
• In probability theory, a sentence such as “The probability that the patient has a
cavity is 0.8” is about the agent’s belief, not directly about the world.
• These beliefs depend on the percepts that the agent has received to date.
• For example:
• Before looking at the card, the agent might assign a probability of 1/52 to
its being the ace of spades.
Following are some leading causes of uncertainty to occur in the real world.
• Experimental Errors
• Equipment fault
• Temperature variation
• Climate change.
Probabilistic Reasoning
In probabilistic reasoning, there are two ways to solve problems with uncertain
knowledge:
• Bayes' rule
• Bayesian Statistics
Probability
• Sample space: The collection of all possible events is called sample space.
• Random variables: Random variables are used to represent the events and
objects in the real world.
Conditional probability:
• Let's suppose, we want to calculate the event A when event B has already
occurred, "the probability of A under the conditions of B", it can be written as:
• We can find the probability of an uncertain event by using the below formula.
• P(¬A) + P(A) = 1.
Example
• In a class, there are 70% of the students who like English and 40% of the
students who likes English and mathematics, and then what is the percent of
students those who like English also like mathematics?
Solution:
Baye’s Theorem
Bayes' theorem is also known as Bayes' rule, Bayes' law, or Bayesian reasoning, which
determines the probability of an event with uncertain knowledge.
Bayes' theorem allows updating the probability prediction of an event by observing new
information of the real world.
Example: If cancer corresponds to one's age then by using Bayes' theorem, we can
determine the probability of cancer more accurately with the help of age.
Bayes' theorem can be derived using product rule and conditional probability of event A
with known event B:
It shows the simple relationship between joint and conditional probabilities. Here,
P(B|A) is called the likelihood, in which we consider that hypothesis is true, then we
calculate the probability of evidence.
P(A) is called the prior probability, probability of hypothesis before considering the
evidence
Bayes' rule allows us to compute the single term P(B|A) in terms of P(A|B), P(B), and
P(A). This is very useful in cases where we have a good probability of these three terms
and want to determine the fourth one. Suppose we want to perceive the effect of some
unknown cause, and want to compute that cause, then the Bayes' rule becomes:
Question: what is the probability that a patient has diseases meningitis with a stiff
neck?
Given Data:
A doctor is aware that disease meningitis causes a patient to have a stiff neck, and it
occurs 80% of the time. He is also aware of some more facts, which are given as follows:
Let a be the proposition that patient has stiff neck and b be the proposition that patient
has meningitis. , so we can calculate the following as:
P(a|b) = 0.8
P(b) = 1/30000
P(a)= .02
Hence, we can assume that 1 patient out of 750 patients has meningitis disease
with a stiff neck
o It is used to calculate the next step of the robot when the already executed step is
given.
Bayesian belief network is key computer technology for dealing with probabilistic
events and to solve a problem which has uncertainty. We can define a Bayesian network
as:
It is also called a Bayes network, belief network, decision network, or Bayesian model.
Bayesian networks are probabilistic, because these networks are built from
a probability distribution, and also use probability theory for prediction and anomaly
detection.
Real world applications are probabilistic in nature, and to represent the relationship
between multiple events, we need a Bayesian network. It can also be used in various
tasks including prediction, anomaly detection, diagnostics, automated insight,
reasoning, time series prediction, and decision making under uncertainty.
Bayesian Network can be used for building models from data and experts opinions, and
it consists of two parts:
The generalized form of Bayesian network that represents and solve decision problems
under uncertain knowledge is known as an Influence diagram.
A Bayesian network graph is made up of nodes and Arcs (directed links), where:
o Causal Component
o Actual numbers
If we have variables x1, x2, x3,....., xn, then the probabilities of a different combination of
x1, x2, x3.. xn, are known as Joint probability distribution.
P[x1, x2, x3,....., xn], it can be written as the following way in terms of the joint probability
distribution.
In general for each variable Xi, we can write the equation as:
Example: Harry installed a new burglar alarm at his home to detect burglary. The alarm
reliably responds at detecting a burglary but also responds for minor earthquakes.
Harry has two neighbors David and Sophia, who have taken a responsibility to inform
Harry at work when they hear the alarm. David always calls Harry when he hears the
alarm, but sometimes he got confused with the phone ringing and calls at that time too.
On the other hand, Sophia likes to listen to high music, so sometimes she misses to hear
the alarm. Here we would like to compute the probability of Burglary Alarm.
Problem:
Calculate the probability that alarm has sounded, but there is neither a burglary, nor an
earthquake occurred, and David and Sophia both called the Harry.
Solution:
o The Bayesian network for the above problem is given below. The network
structure is showing that burglary and earthquake is the parent node of the
alarm and directly affecting the probability of alarm's going off, but David and
Sophia's calls depend on alarm probability.
o The network is representing that our assumptions do not directly perceive the
burglary and also do not notice the minor earthquake, and they also not confer
before calling.
o The conditional distributions for each node are given as conditional probabilities
table or CPT.
o Each row in the CPT must be sum to 1 because all the entries in the table
represent an exhaustive set of cases for the variable.
o Burglary (B)
o Earthquake(E)
o Alarm(A)
o David Calls(D)
o Sophia calls(S)
We can write the events of problem statement in the form of probability: P[D, S, A, B, E],
can rewrite the above probability statement using joint probability distribution:
Let's take the observed probability for the Burglary and earthquake component:
P(E= False)= 0.999, Which is the probability that an earthquake not occurred.
The Conditional probability of David that he will call depends on the probability
of Alarm.
The Conditional probability of Sophia that she calls is depending on its Parent
Node "Alarm."
From the formula of joint distribution, we can write the problem statement in the form
of probability distribution:
= 0.00068045.
Hence, a Bayesian network can answer any query about the domain by using Joint
distribution.
There are two ways to understand the semantics of the Bayesian network, which is
given below:
Causal reasoning is the process of understanding the relationships between causes and
effects. It is the way that we, as humans, make sense of the world around us and draw
conclusions based on our observations. In a similar vein, causal AI uses algorithms and
models to identify and analyse causal relationships in data, allowing it to make
predictions and decisions based on these relationships.
Suppose now that you have some data on educational outcomes Y, school expenditures
X, and parent involvement C. The unit of observation is, say, a school district. The
educational outcome data might come from standardized testing. Parent involvement
might be the records of what fraction of parents attend their student’s quarterly teacher
conferences. You, the modeler, work for the national government. You’ve been asked to
figure out what will be the effect on educational outcomes of an intervention where the
national government will give additional funding to schools.
For the purpose of anticipating the impact of a change in X on Y, either of two models
might be appropriate: either Y ~ X or Y ~ X + C.
The above graph illustrates another simple yet typical Bayesian network. In contrast to
the statistical relationships in the non-causal example, this graph describes the causal
relationships among the seasons of the year (X 1), whether it is raining (X2), whether the
sprinkler is on (X3), whether the pavement is wet (X4), and whether the pavement is
slippery (X5).
Here, the absence of a direct link between X1 and X5, for example, captures our
understanding that there is no direct influence of season on slipperiness. The influence
is mediated by the wetness of the pavement (if freezing were a possibility, a direct link
could be added).
Perhaps the most important aspect of Bayesian networks is that they are direct
representations of the world, not of reasoning processes.
The arrows in the diagram represent real causal connections and not the flow of
information during reasoning (as in rule-based systems and neural networks).
Reasoning processes can operate on Bayesian networks by propagating information in
any direction.
For example, if the sprinkler is on, then the pavement is probably wet (prediction,
simulation). If someone slips on the pavement, that will also provide evidence that it is
wet (abduction, reasoning to a probable cause, or diagnosis).
On the other hand, if we see that the pavement is wet, that will make it more likely that
the sprinkler is on or that it is raining (abduction); but if we then observe that the
sprinkler is on, that will reduce the likelihood that it is raining (explaining away).
It is the latter form of reasoning, explaining away, that is especially difficult to model in
rule-based systems and neural networks in a natural way because it seems to require
the propagation of information in two directions.
Causal Reasoning
For example, what if I turn the Sprinkler on instead of just observing that it is turned
on? What effect does that have on the Season, or on the connection between Wet and
Slippery?
A causal network, intuitively speaking, is a Bayesian network with the added property
that the parents of each node are its direct causes.
In such a network, the result of an intervention is obvious: the Sprinkler node is set
to X3=on and the causal link between the Season X1 and the Sprinkler X3 is removed. All
other causal links and conditional probabilities remain intact, so the new model is:
This differs from observing that X3=on, which would result in a new model
that included the term P(X3=on|x1). This mirrors the difference between
seeing and doing: after observing that the Sprinkler is on, we wish to infer
that the Season is dry, that it probably did not rain, and so on. An arbitrary
decision to turn on the Sprinkler should not result in any such beliefs.
Causal networks are more properly defined, then, as Bayesian networks in which the
correct probability model—after intervening to fix any node’s value—is given simply by
deleting links from the node’s parents. For example, Fire → Smoke is a causal network,
whereas Smoke → Fire is not, even though both networks are equally capable of
representing any Joint Probability Distribution (JPD) of the two variables.
corresponding local changes in the model. This, in turn, allows causal networks to be
used very naturally for prediction by an agent that is considering various courses of
action.
In pure Bayesian approaches, Bayesian networks are designed from expert knowledge
and include hyperparameter nodes. Data (usually scarce) is used as pieces of evidence
for incrementally updating the distributions of the hyperparameters
UNIT- II
PROBABILISTIC REASONING
Machine Learning is the field of study that gives computers the capability to learn without being
explicitly programmed.
A computer program is said to learn from experience E with respect to some class of tasks T and
performance measure P, if its performance at tasks in T, as measured by P, improves with
experience E. (Mitchell 1997)
This means : Given : A task T A performance measure P Some experience E with the task Goal :
Generalize the experience in a way that allows to improve your performance on the task.
But in Machine Learning, a model is built from the data, and that model is the logic. ML
programs have two distinct phases:
Training: Input and the expected output are used to train and test various models, and
select the most suitable model.
Inference: The model is applied to the input to compute results. These results are
wrong sometimes. A mechanism is built into the application to gather user feedback on
such occasions.
This feedback is added to the training data, and this is how a model learns.
Let’s take the problem of detecting email spam and compare both methods.
Traditional programs detect spam by checking an email against a fixed set of heuristic rules. For
example:
Does the email contain FREE, weight loss, or lottery several times?
Prepare a data set: a large number of emails labeled manually as spam or not-spam.
During inference, apply the model to decide whether to keep an email in the inbox or in
the spam folder.
If the user moves an email from inbox to spam or vice versa, add this feedback to the
training data.
As you can notice traditional programs are deterministic, but ML programs are probabilistic.
Both make mistakes. But the traditional program will require constant manual effort in
updating the rules, while the ML program will learn from new data when retrained.
Applications of ML
1. Image Recognition:
Image recognition is one of the most common applications of machine learning. It is used to
identify objects, persons, places, digital images, etc. The popular use case of image recognition
and face detection is, Automatic friend tagging suggestion:
Facebook provides us a feature of auto friend tagging suggestion. Whenever we upload a photo
with our Facebook friends, then we automatically get a tagging suggestion with name, and the
technology behind this is machine learning's face detection and recognition algorithm.
It is based on the Facebook project named "Deep Face," which is responsible for face
recognition and person identification in the picture.
2. Speech Recognition
While using Google, we get an option of "Search by voice," it comes under speech recognition,
and it's a popular application of machine learning.
Speech recognition is a process of converting voice instructions into text, and it is also known as
"Speech to text", or "Computer speech recognition." At present, machine learning algorithms
are widely used by various applications of speech recognition. Google assistant, Siri, Cortana,
and Alexa are using speech recognition technology to follow the voice instructions.
3. Traffic prediction:
If we want to visit a new place, we take help of Google Maps, which shows us the correct path
with the shortest route and predicts the traffic conditions.
It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or heavily
congested with the help of two ways:
o Real Time location of the vehicle form Google Map app and sensors
Everyone who is using Google Map is helping this app to make it better. It takes information
from the user and sends back to its database to improve the performance.
4. Product recommendations:
Machine learning is widely used by various e-commerce and entertainment companies such
as Amazon, Netflix, etc., for product recommendation to the user. Whenever we search for
some product on Amazon, then we started getting an advertisement for the same product while
internet surfing on the same browser and this is because of machine learning.
Google understands the user interest using various machine learning algorithms and suggests
the product as per customer interest.
As similar, when we use Netflix, we find some recommendations for entertainment series,
movies, etc., and this is also done with the help of machine learning.
5. Self-driving cars:
One of the most exciting applications of machine learning is self-driving cars. Machine learning
plays a significant role in self-driving cars. Tesla, the most popular car manufacturing company
is working on self-driving car. It is using unsupervised learning method to train the car models
to detect people and objects while driving.
Whenever we receive a new email, it is filtered automatically as important, normal, and spam.
We always receive an important mail in our inbox with the important symbol and spam emails
in our spam box, and the technology behind this is Machine learning. Below are some spam
filters used by Gmail:
o Content Filter
o Header filter
o Rules-based filters
o Permission filters
Some machine learning algorithms such as Multi-Layer Perceptron, Decision tree, and Naïve
Bayes classifier are used for email spam filtering and malware detection.
We have various virtual personal assistants such as Google assistant, Alexa, Cortana, Siri. As
the name suggests, they help us in finding the information using our voice instruction. These
assistants can help us in various ways just by our voice instructions such as Play music, call
someone, Open an email, Scheduling an appointment, etc.
Machine learning is making our online transaction safe and secure by detecting fraud
transaction. Whenever we perform some online transaction, there may be various ways that a
fraudulent transaction can take place such as fake accounts, fake ids, and steal money in the
middle of a transaction. So to detect this, Feed Forward Neural network helps us by checking
whether it is a genuine transaction or a fraud transaction.
For each genuine transaction, the output is converted into some hash values, and these values
become the input for the next round. For each genuine transaction, there is a specific pattern
which gets change for the fraud transaction hence, it detects it and makes our online
transactions more secure.
Machine learning is widely used in stock market trading. In the stock market, there is always a
risk of up and downs in shares, so for this machine learning's long short term memory neural
network is used for the prediction of stock market trends.
In medical science, machine learning is used for diseases diagnoses. With this, medical
technology is growing very fast and able to build 3D models that can predict the exact position
of lesions in the brain.
Machine learning life cycle involves seven major steps, which are given below:
o Gathering Data
o Data preparation
o Data Wrangling
o Analyse Data
o Train the model
o Test the model
o Deployment
1. Gathering Data:
Data Gathering is the first step of the machine learning life cycle. The goal of this step is to
identify and obtain all data-related problems.
In this step, we need to identify the different data sources, as data can be collected from various
sources such as files, database, internet, or mobile devices. It is one of the most important steps
of the life cycle. The quantity and quality of the collected data will determine the efficiency of
the output. The more will be the data, the more accurate will be the prediction.
o Collect data
By performing the above task, we get a coherent set of data, also called as a dataset. It will be
used in further steps.
2. Data preparation
After collecting the data, we need to prepare it for further steps. Data preparation is a step
where we put our data into a suitable place and prepare it to use in our machine learning
training.
In this step, first, we put all data together, and then randomize the ordering of data.
o Data exploration:
It is used to understand the nature of data that we have to work with. We need to
understand the characteristics, format, and quality of data.
A better understanding of data leads to an effective outcome. In this, we find
Correlations, general trends, and outliers.
o Data pre-processing:
Now the next step is preprocessing of data for its analysis.
3. Data Wrangling
Data wrangling is the process of cleaning and converting raw data into a useable format. It is the
process of cleaning the data, selecting the variable to use, and transforming the data in a proper
format to make it more suitable for analysis in the next step. It is one of the most important
steps of the complete process. Cleaning of data is required to address the quality issues.
It is not necessary that data we have collected is always of our use as some of the data may not
be useful. In real-world applications, collected data may have various issues, including:
o Missing Values
o Duplicate data
o Invalid data
o Noise
So, we use various filtering techniques to clean the data.
It is mandatory to detect and remove the above issues because it can negatively affect the
quality of the outcome.
4. Data Analysis
Now the cleaned and prepared data is passed on to the analysis step. This step involves:
5. Train Model
Now the next step is to train the model, in this step we train our model to improve its
performance for better outcome of the problem.
We use datasets to train the model using various machine learning algorithms. Training a model
is required so that it can understand the various patterns, rules, and, features.
6. Test Model
Once our machine learning model has been trained on a given dataset, then we test the model.
In this step, we check for the accuracy of our model by providing a test dataset to it.
Testing the model determines the percentage accuracy of the model as per the requirement of
project or problem.
7. Deployment
The last step of machine learning life cycle is deployment, where we deploy the model in the
real-world system.
If the above-prepared model is producing an accurate result as per our requirement with
acceptable speed, then we deploy the model in the real system. But before deploying the project,
we will check whether it is improving its performance using available data or not. The
deployment phase is similar to making the final report for a project.
Based on the methods and way of learning, machine learning is divided into mainly four types,
which are:
4. Reinforcement Learning
Let's understand supervised learning with an example. Suppose we have an input dataset of cats
and dog images. So, first, we will provide the training to the machine to understand the images,
such as the shape & size of the tail of cat and dog, Shape of eyes, colour, height (dogs are taller,
cats are smaller), etc. After completion of training, we input the picture of a cat and ask the
machine to identify the object and predict the output. Now, the machine is well trained, so it will
check all the features of the object, such as height, shape, colour, eyes, ears, tail, etc., and find
that it's a cat. So, it will put it in the Cat category. This is the process of how the machine
identifies the objects in Supervised Learning.
The main goal of the supervised learning technique is to map the input variable(x) with the
output variable(y). Some real-world applications of supervised learning are Risk Assessment,
Fraud Detection, Spam filtering, etc.
Supervised machine learning can be classified into two types of problems, which are given
below:
o Classification
o Regression
a) Classification
Classification algorithms are used to solve the classification problems in which the output
variable is categorical, such as "Yes" or No, Male or Female, Red or Blue, etc. The classification
algorithms predict the categories present in the dataset. Some real-world examples of
classification algorithms are Spam Detection, Email filtering, etc.
b) Regression
Regression algorithms are used to solve regression problems in which there is a linear
relationship between input and output variables. These are used to predict continuous output
variables, such as market trends, weather prediction, etc.
o Lasso Regression
Advantages:
o Since supervised learning work with the labelled dataset so we can have an exact idea
about the classes of objects.
o These algorithms are helpful in predicting the output on the basis of prior experience.
Disadvantages:
o It may predict the wrong output if the test data is different from the training data.
o Image Segmentation:
Supervised Learning algorithms are used in image segmentation. In this process, image
classification is performed on different image data with pre-defined labels.
o Medical Diagnosis:
Supervised algorithms are also used in the medical field for diagnosis purposes. It is
done by using medical images and past labelled data with labels for disease conditions.
With such a process, the machine can identify a disease for the new patients.
o Fraud Detection - Supervised Learning classification algorithms are used for identifying
fraud transactions, fraud customers, etc. It is done by using historic data to identify the
patterns that can lead to possible fraud.
o Spam detection - In spam detection & filtering, classification algorithms are used. These
algorithms classify an email as spam or not spam. The spam emails are sent to the spam
folder.
Unsupervised learning is different from the Supervised learning technique; as its name
suggests, there is no need for supervision. It means, in unsupervised machine learning, the
machine is trained using the unlabeled dataset, and the machine predicts the output without
any supervision.
In unsupervised learning, the models are trained with the data that is neither classified nor
labelled, and the model acts on that data without any supervision.
The main aim of the unsupervised learning algorithm is to group or categories the unsorted
dataset according to the similarities, patterns, and differences. Machines are instructed to find
the hidden patterns from the input dataset.
Let's take an example to understand it more preciously; suppose there is a basket of fruit
images, and we input it into the machine learning model. The images are totally unknown to the
model, and the task of the machine is to find the patterns and categories of the objects.
So, now the machine will discover its patterns and differences, such as colour difference, shape
difference, and predict the output when it is tested with the test dataset.
Unsupervised Learning can be further classified into two types, which are given below:
o Clustering
o Association
1) Clustering
The clustering technique is used when we want to find the inherent groups from the data. It is a
way to group the objects into a cluster such that the objects with the most similarities remain in
one group and have fewer or no similarities with the objects of other groups. An example of the
clustering algorithm is grouping the customers by their purchasing behaviour.
2) Association
Association rule learning is an unsupervised learning technique, which finds interesting
relations among variables within a large dataset. The main aim of this learning algorithm is to
find the dependency of one data item on another data item and map those variables accordingly
so that it can generate maximum profit. This algorithm is mainly applied in Market Basket
analysis, Web usage mining, continuous production, etc.
Some popular algorithms of Association rule learning are Apriori Algorithm, Eclat, FP-growth
algorithm.
Advantages:
o These algorithms can be used for complicated tasks compared to the supervised ones
because these algorithms work on the unlabeled dataset.
o Unsupervised algorithms are preferable for various tasks as getting the unlabeled
dataset is easier as compared to the labelled dataset.
Disadvantages:
o The output of an unsupervised algorithm can be less accurate as the dataset is not
labelled, and algorithms are not trained with the exact output in prior.
o Working with Unsupervised learning is more difficult as it works with the unlabelled
dataset that does not map with the output.
3. Semi-Supervised Learning
Semi-Supervised learning is a type of Machine Learning algorithm that lies between Supervised
and Unsupervised machine learning. It represents the intermediate ground between Supervised
(With Labelled training data) and Unsupervised learning (with no labelled training data)
algorithms and uses the combination of labelled and unlabeled datasets during the training
period.
Although Semi-supervised learning is the middle ground between supervised and unsupervised
learning and operates on the data that consists of a few labels, it mostly consists of unlabeled
data. As labels are costly, but for corporate purposes, they may have few labels. It is completely
different from supervised and unsupervised learning as they are based on the presence &
absence of labels.
To overcome the drawbacks of supervised learning and unsupervised learning algorithms, the
concept of Semi-supervised learning is introduced. The main aim of semi-supervised learning is
to effectively use all the available data, rather than only labelled data like in supervised learning.
Initially, similar data is clustered along with an unsupervised learning algorithm, and further, it
helps to label the unlabeled data into labelled data. It is because labelled data is a comparatively
more expensive acquisition than unlabeled data.
We can imagine these algorithms with an example. Supervised learning is where a student is
under the supervision of an instructor at home and college. Further, if that student is self-
analysing the same concept without any help from the instructor, it comes under unsupervised
learning. Under semi-supervised learning, the student has to revise himself after analyzing the
same concept under the guidance of an instructor at college.
Advantages:
o It is highly efficient.
Disadvantages:
o Accuracy is low.
4. Reinforcement Learning
In reinforcement learning, there is no labelled data like supervised learning, and agents learn
from their experiences only.
The reinforcement learning process is similar to a human being; for example, a child learns
various things by experiences in his day-to-day life. An example of reinforcement learning is to
play a game, where the Game is the environment, moves of an agent at each step define states,
and the goal of the agent is to get a high score. Agent receives feedback in terms of punishment
and rewards.
Due to its way of working, reinforcement learning is employed in different fields such as Game
theory, Operation Research, Information theory, multi-agent systems.
o Video Games:
RL algorithms are much popular in gaming applications. It is used to gain super-human
performance. Some popular games that use RL algorithms are AlphaGO and AlphaGO
Zero.
o Resource Management:
The "Resource Management with Deep Reinforcement Learning" paper showed that
how to use RL in computer to automatically learn and schedule resources to wait for
different jobs in order to minimize average job slowdown.
o Robotics:
RL is widely being used in Robotics applications. Robots are used in the industrial and
manufacturing area, and these robots are made more powerful with reinforcement
learning. There are different industries that have their vision of building intelligent
robots using AI and Machine learning technology.
o Text Mining
Text-mining, one of the great applications of NLP, is now being implemented with the
help of Reinforcement Learning by Salesforce company.
Advantages
o The learning model of RL is similar to the learning of human beings; hence most
accurate results can be found.
Disadvantage
o Too much reinforcement learning can lead to an overload of states which can weaken
the results.
The curse of dimensionality limits reinforcement learning for real physical systems.
TOPIC 13. LINEAR REGRESSION MODELS: LEAST SQUARES, SINGLE & MULTIPLE
VARIABLES
Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
statistical method that is used for predictive analysis. Linear regression makes predictions for
continuous/real or numeric variables such as sales, salary, age, product price, etc.
Linear regression algorithm shows a linear relationship between a dependent (y) and one or
more independent (y) variables, hence called as linear regression. Since linear regression shows
the linear relationship, which means it finds how the value of the dependent variable is
changing according to the value of the independent variable.
The linear regression model provides a sloped straight line representing the relationship
between the variables. Consider the below image:
y= a0+a1x+ ε
Here,
The values for x and y variables are training datasets for Linear Regression model
representation.
Linear regression can be further divided into two types of the algorithm:
A linear line showing the relationship between the dependent and independent variables is
called a regression line. A regression line can show two types of relationship:
When working with linear regression, our main goal is to find the best fit line that means the
error between predicted values and actual values should be minimized. The best fit line will
have the least error.
The different values for weights or the coefficient of lines (a0, a1) gives a different line of
regression, so we need to calculate the best values for a 0 and a1 to find the best fit line, so to
calculate this we use cost function.
Cost function-
o The different values for weights or coefficient of lines (a0, a1) gives the different line of
regression, and the cost function is used to estimate the values of the coefficient for the
best fit line.
o Cost function optimizes the regression coefficients or weights. It measures how a linear
regression model is performing.
o We can use the cost function to find the accuracy of the mapping function, which maps
the input variable to the output variable. This mapping function is also known
as Hypothesis function.
For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the average
of squared error occurred between the predicted values and actual values. It can be written as:
Where,
Residuals: The distance between the actual value and predicted values is called residual. If the
observed points are far from the regression line, then the residual will be high, and so cost
function will high. If the scatter points are close to the regression line, then the residual will be
small and hence the cost function.
Gradient Descent:
o Gradient descent is used to minimize the MSE by calculating the gradient of the cost
function.
o A regression model uses gradient descent to update the coefficients of the line by
reducing the cost function.
o It is done by a random selection of values of coefficient and then iteratively update the
values to reach the minimum cost function.
Model Performance:
The Goodness of fit determines how the line of regression fits the set of observations. The
process of finding the best model out of various models is called optimization. It can be achieved
by below method:
Below are some important assumptions of Linear Regression. These are some formal checks
while building a Linear Regression model, which ensures to get the best possible result from the
given dataset.
affecting the target variable and which is not. So, the model assumes either little or no
multicollinearity between the features or independent variables.
o Homoscedasticity Assumption:
Homoscedasticity is a situation when the error term is the same for all the values of
independent variables. With homoscedasticity, there should be no clear pattern
distribution of data in the scatter plot.
o No autocorrelations:
The linear regression model assumes no autocorrelation in error terms. If there will be
any correlation in the error term, then it will drastically reduce the accuracy of the
model. Autocorrelation usually occurs if there is a dependency between residual errors.
Simple Linear Regression is a type of Regression algorithms that models the relationship
between a dependent variable and a single independent variable. The relationship shown by a
Simple Linear Regression model is linear or a sloped straight line, hence it is called Simple
Linear Regression.
The key point in Simple Linear Regression is that the dependent variable must be a
continuous/real value. However, the independent variable can be measured on continuous or
categorical values.
o Model the relationship between the two variables. Such as the relationship between
Income and expenditure, experience and Salary, etc.
The Simple Linear Regression model can be represented using the below equation:
y= a0+a1x+ ε
Where,
a0= It is the intercept of the Regression line (can be obtained putting x=0)
a1= It is the slope of the regression line, which tells whether the line is increasing or decreasing.
ε = The error term. (For a good model it will be negligible)
Here we are taking a dataset that has two variables: salary (dependent variable) and experience
(Independent variable). The goals of this problem is:
o We want to find out if there is any correlation between these two variables
models the linear relationship between a single dependent continuous variable and
more than one independent variable.
Example:
Prediction of CO2 emission based on engine size and number of cylinders in a car.
o For MLR, the dependent or target variable(Y) must be the continuous/real, but the
predictor or independent variable may be of continuous or categorical form.
o Each feature variable must model the linear relationship with the dependent variable.
MLR equation:
Where,
Y= Output/Response variable
o A linear relationship should exist between the Target and predictor variables.
ML Polynomial Regression
o It is also called the special case of Multiple Linear Regression in ML. Because we add
some polynomial terms to the Multiple Linear regression equation to convert it into
Polynomial Regression.
o It makes use of a linear regression model to fit the complicated and non-linear functions
and datasets.
o Hence, "In Polynomial regression, the original features are converted into
Polynomial features of required degree (2,3,..,n) and then modeled using a linear
model."
o So for such cases, where data points are arranged in a non-linear fashion, we need
the Polynomial Regression model. We can understand it in a better way using the
below comparison diagram of the linear dataset and non-linear dataset.
Multiple Linear Regression equation: y= b0+b1x+ b2x2+ b3x3+....+ bnxn ............................. (b)
To demonstrate the relationship between two components, linear regression fits a straight
condition to observed data. One variable is seen as illustrative, while the other is seen as
necessary. For instance, solid modeling using a straight recurrence model must connect many
people to their monuments.
In Bayesian linear regression, the mean of one parameter is characterized by a weighted sum of
other variables. This type of conditional modeling aims to determine the prior distribution of
the regressors as well as other variables describing the allocation of the regressand) and
The normal linear equation, where the distribution of display style YY given by display style XX
is Gaussian, is the most basic and popular variant of this model. The future can be determined
analytically for this model, and a specific set of prior probabilities for the parameters is known
as conjugate priors. The posteriors usually have more randomly selected priors.
When the dataset has too few or poorly dispersed data, Bayesian Regression might be quite
helpful. In contrast to conventional regression techniques, where the output is only derived
from a single number of each attribute, a Bayesian Regression model's output is derived from a
probability distribution.
The result, "y," is produced by a normal distribution (where the variance and mean are
normalized). The goal of the Bayesian Regression Model is to identify the 'posterior' distribution
again for model parameters rather than the model parameters themselves. The model
parameters will be expected to follow a distribution in addition to the output y.
Posterior: It is the likelihood that an event, such as H, will take place given the
occurrence of another event, such as E, i.e., P(H | E).
Priority: This refers to the likelihood that event H happened before event A, i.e., P(H) (H)
P(A) is the likelihood that event A will occur, while P(A|B) is the likelihood that event A will
occur, provided that event B has already occurred. Here, A and B seem to be events. P(B), the
likelihood of event B happening cannot be zero because it already has.
According to the aforementioned formula, we get a prior probability for the model parameters
that is proportional to the probability of the data divided by the posterior distribution of the
parameters, unlike Ordinary Least Square (OLS), which is what we observed in the case of the
OLS.
The value of probability will rise as more data points are collected and eventually surpass the
previous value. The parameter values converge to values obtained by OLS in the case of an
unlimited number of data points. Consequently, we start our regression method with an
estimate (the prior value).
As we begin to include additional data points, the accuracy of our model improves. Therefore, to
make a Bayesian Ridge Regression model accurate, a considerable amount of train data is
required.
Some of the real-life applications of Bayesian Linear Regression are given below:
Using Priors: Consider a scenario in which your supermarkets carry a new product, and
we want to predict its initial Christmas sales. For the new product's Christmas effect, we
may merely use the average of comparable things as a previous one.
Additionally, once we obtain data from the new item's initial Christmas sales, the previous is
immediately updated. As a result, the forecast for the next Christmas is influenced by both the
prior and the new item's data.
Regularize Priors: With the season, day of the week, trend, holidays, and a tonne of
promotion indicators, our model is severely over-parameterized. Therefore
regularization is crucial to keep the forecasts in check.
Particularly well-suited for online learning as opposed to batch learning, when we know
the complete dataset before we begin training the model. This is so that Bayesian
Regression can be used without having to save data.
The Bayesian technique has been successfully applied and is quite strong
mathematically. Therefore, using this requires no additional prior knowledge of the
dataset.
The Bayesian strategy is not worthwhile if there is a lot of data accessible for our
dataset, and the regular probability approach does the task more effectively.
In linear regression, the model targets to get the best-fit regression line to predict the
value of y based on the given input value (x). While training the model, the model
calculates the cost function which measures the Root Mean Squared error between the
predicted value (pred) and true value (y).
By the time model achieves the minimum cost function, it will have the best θ1 and
θ2 values. Using these finally updated values of θ 1 and θ2 in the hypothesis equation of
linear equation, the model predicts the value of x in the best manner it can.
Therefore, the question arises – How do θ1 and θ2 values get updated?
Linear Regression Cost Function:
We graph cost function as a function of parameter estimates i.e. parameter range of our
hypothesis function and the cost resulting from selecting a particular set of parameters. We
move downward towards pits in the graph, to find the minimum value. The way to do this is
taking derivative of cost function as explained in the above figure. Gradient Descent step-downs
the cost function in the direction of the steepest descent. The size of each step is determined by
parameter α known as Learning Rate.
In the Gradient Descent algorithm, one can infer two points :
For linear regression Cost, the Function graph is always convex shaped.
Gradient Descent is a popular optimization algorithm for linear regression models that involves
iteratively adjusting the model parameters to minimize the cost function. Here are some
advantages and disadvantages of using Gradient Descent for linear regression:
Advantages:
Flexibility: Gradient Descent can be used with various cost functions and can handle
non-linear regression problems.
Scalability: Gradient Descent is scalable to large datasets since it updates the parameters
for each training example one at a time.
Convergence: Gradient Descent can converge to the global minimum of the cost function,
provided that the learning rate is set appropriately.
Disadvantages:
Sensitivity to Learning Rate: The choice of learning rate can be critical in Gradient
Descent since using a high learning rate can cause the algorithm to overshoot the
minimum, while a low learning rate can make the algorithm converge slowly.
Slow Convergence: Gradient Descent may require more iterations to converge to the
minimum since it updates the parameters for each training example one at a time.
Local Minima: Gradient Descent can get stuck in local minima if the cost function has
multiple local minima.
Noisy updates: The updates in Gradient Descent are noisy and have a high variance,
which can make the optimization process less stable and lead to oscillations around the
minimum.
Overall, Gradient Descent is a useful optimization algorithm for linear regression, but it has
some limitations and requires careful tuning of the learning rate to ensure convergence.
A linear classifier is a model that makes a decision to categories a set of data points to a discrete
class based on a linear combination of its explanatory variables. As an example, combining
details about a dog such as weight, height, colour and other features would be used by a model
to decide its species. The effectiveness of these models lie in their ability to find this
mathematical combination of features that groups data points together when they have the
same class and separate them when they have different classes, providing us with clear
boundaries for how to classify.
If each instance belongs to one and only one class, then our input data can be divided into
decision regions separated by decision boundaries.
The main goal of the Classification algorithm is to identify the category of a given dataset, and
these algorithms are mainly used to predict the output for the categorical data.
Classification algorithms can be better understood using the below diagram. In the below
diagram, there are two classes, class A and Class B. These classes have features that are similar
to each other and dissimilar to other classes
The algorithm which implements the classification on a dataset is known as a classifier. There
are two types of Classifications:
o Binary Classifier: If the classification problem has only two possible outcomes, then it is
called as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
o Multi-class Classifier: If a classification problem has more than two outcomes, then it is
called as Multi-class Classifier.
Example: Classifications of types of crops, Classification of types of music.
1. Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it receives
the test dataset. In Lazy learner case, classification is done on the basis of the most
related data stored in the training dataset. It takes less time in training but more time for
predictions.
Example: K-NN algorithm, Case-based reasoning
Classification Algorithms can be further divided into the Mainly two category:
o Linear Models
o Logistic Regression
o Non-linear Models
o K-Nearest Neighbours
o Kernel SVM
o Naïve Bayes
o For a good binary Classification model, the value of log loss should be near to 0.
o The value of log loss increases if the predicted value deviates from the actual value.
o The lower log loss represents the higher accuracy of the model.
1. ?(ylog(p)+(1?y)log(1?p))
2. Confusion Matrix:
o The matrix consists of predictions result in a summarized form, which has a total
number of correct predictions and incorrect predictions. The matrix looks like as below
table:
3. AUC-ROC curve:
o ROC curve stands for Receiver Operating Characteristics Curve and AUC stands for Area
Under the Curve.
o To visualize the performance of the multi-class classification model, we use the AUC-
ROC Curve.
o The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on Y-axis
and FPR(False Positive Rate) on X-axis.
Classification algorithms can be used in different places. Below are some popular use
cases of Classification Algorithms:
Linear Regression
Linear Regression is a statistical approach that predicts the result of a response variable by
combining numerous influencing factors. It attempts to represent the linear connection between
features (independent variables) and the target (dependent variables). The cost function
enables us to find the best possible values for the model parameters.
Logistic Regression
Logistic regression is an extension of linear regression. The sigmoid function first transforms
the linear regression output between 0 and 1. After that, a predefined threshold helps to
determine the probability of the output values. The values higher than the threshold value tend
towards having a probability of 1, whereas values lower than the threshold value tend towards
having a probability of 0. A separate article dives deeper into the mathematics behind the
Logistic Regression Model.
REGRESSION
Problem Formulation
o Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the
categorical dependent variable using a given set of independent variables.
o Logistic regression predicts the output of a categorical dependent variable.
Therefore the outcome must be a categorical or discrete value. It can be either Yes
or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and 1, it
gives the probabilistic values which lie between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how they
are used. Linear Regression is used for solving Regression problems,
whereas Logistic regression is used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped
logistic function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its
weight, etc.
o Logistic Regression is a significant machine learning algorithm because it has the
ability to provide probabilities and classify new data using continuous and
discrete datasets.
o Logistic Regression can be used to classify the observations using different types
of data and can easily determine the most effective variables used for the
classification. The below image is showing the logistic function:
The Logistic regression equation can be obtained from the Linear Regression equation.
The mathematical steps to get Logistic Regression equations are given below:
o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):
o But we need range between -[infinity] to +[infinity], then take logarithm of the equation it
will become:
On the basis of the categories, Logistic Regression can be classified into three types:
o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".
Generative models are considered a class of statistical models that can generate new data
UNIT-3 III- Page 34 of 58
KCE-CSE –AI&ML 2023
instances. These models are used in unsupervised machine learning as a means to perform tasks
such as
• Probability and Likelihood estimation,
• Modeling data points
• To describe the phenomenon in data,
• To distinguish between classes based on these probabilities.
Since these models often rely on the Bayes theorem to find the joint probability, generative
models can tackle a more complex task than analogous discriminative models
These models use the concept of joint probability and create instances where a given feature (x) or
input and the desired output or label (y) exist simultaneously..
Training generative classifiers involve estimating a function f: X -> Y, or probability P(Y|X):
Discriminative models draw boundaries in the data space, while generative models try to
model how data is placed throughout the space.
A generative model explains how the data was generated, while a discriminative model
focuses on predicting the labels of the data.
Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems.
However, primarily, it is used for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that
can segregate n-dimensional space into classes so that we can easily put the new
data point in the correct category in the future. This best decision boundary is
called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane.
These extreme cases are called as support vectors, and hence algorithm is termed
as Support Vector Machine. Consider the below diagram in which there are two
different categories that are classified using a decision boundary or hyperplane:
SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a
model that can accurately identify whether it is a cat or dog, so such a model can
be created by using the SVM algorithm.
We will first train our model with lots of images of cats and dogs so that it can
learn about different features of cats and dogs, and then we test it with this strange
creature. So as support vector creates a decision boundary between these two data
(cat and dog) and choose extreme cases (support vectors), it will see the extreme
case of cat and dog.
On the basis of the support vectors, it will classify it as a cat. Consider the below
diagram:
SVM algorithm can be used for Face detection, image classification, text
categorization, etc.
Types of SVM
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can
be classified into two classes by using a single straight line, then such data is termed as
linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if
a dataset cannot be classified by using a straight line, then such data is termed as non-
linear data and classifier used is called as Non-linear SVM classifier.
Hyperplane:
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect
the position of the hyperplane are termed as Support Vector. Since these vectors
support the hyperplane, hence called a Support vector.
UNIT-3 III- Page 37 of 58
KCE-CSE –AI&ML 2023
Working of SVM
So as it is 2-d space so by just using a straight line, we can easily separate these two
classes. But there can be multiple lines that can separate these classes. Consider the
below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane.
SVM algorithm finds the closest point of the lines from both the classes. These points are
called support vectors.
The distance between the vectors and the hyperplane is called as margin. And the goal of
SVM is to maximize this margin. The hyperplane with maximum margin is called
the optimal hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for
non-linear data, we cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data,
we have used two dimensions x and y, so for non-linear data, we will add a third
dimension z. It can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we
Advantages of SVM
Effective in high-dimensional cases.
Its memory is efficient as it uses a subset of training points in the decision
function called support vectors.
Different kernel functions can be specified for the decision functions and its
possible to specify custom kernels.
Cons
If the number of features is a lot bigger than the number of data points,
avoiding over-fitting when choosing kernel functions and regularization term is
crucial.
SVMs don't directly provide probability estimates. Those are calculated using
an expensive five-fold cross-validation.
Works best on small sample sets because of its high training time.
What does a hard margin mean?
The hard margin means that the SVM is very rigid in classification and tries to
perform extremely well in the training set, covering all available data points.
This works well when the deviation is insignificant, but can lead to overfitting, in
which case we would need to switch to a soft margin.
The idea behind the soft margin is based on a simple assumption: allow the support
vector classifier to make some mistakes while still keeping the margin as large
as possible.
If we want a soft margin, we have to decide how to ignore some of the outliers while
still getting good classification results. The solution is to introduce penalties for
outfitting that are regulated by the C parameter (as it’s called in many frameworks).
We use an SVM with hard margin when data is evidently separable. However, we may
choose the soft margin when we have to disregard some of the outliers to improve the
model’s overall performance.
o Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches represent the decision
rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do not contain any further
branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification
and Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further
split the tree into subtrees.
o Below diagram explains the general structure of a decision tree:
There are various algorithms in Machine learning, so choosing the best algorithm for the
given dataset and problem is the main point to remember while creating a machine
learning model. Below are the two reasons for using the Decision tree:
o Decision Trees usually mimic human thinking ability while making a decision, so it is
easy to understand.
o The logic behind the decision tree can be easily understood because it shows a tree-
like structure.
In a decision tree, for predicting the class of the given dataset, the algorithm starts
from the root node of the tree.
This algorithm compares the values of root attribute with the record (real dataset)
attribute and, based on the comparison, follows the branch and jumps to the next
node.
For the next node, the algorithm again compares the attribute value with the other
sub-nodes and move further.
It continues the process until it reaches the leaf node of the tree. The complete
process can be better understood using the below algorithm:
Step-1: Begin the tree with the root node, says S, which contains the complete
dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
Step-3: Divide the S into subsets that contains possible values for the best
attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset
created in step -3. Continue this process until a stage is reached where you cannot
further classify the nodes and called the final node as a leaf node.
Eg:
Suppose there is a candidate who has a job offer and wants to decide whether he should
accept the offer or Not. So, to solve this problem, the decision tree starts with the root
node (Salary attribute by ASM). The root node splits further into the next decision node
(distance from the office) and one leaf node based on the corresponding labels. The
next decision node further gets split into one decision node (Cab facility) and one leaf
node. Finally, the decision node splits into two leaf nodes (Accepted offers and Declined
offer). Consider the below diagram:
o Information Gain
o Gini Index
UNIT-3 III- Page 43 of 58
KCE-CSE –AI&ML 2023
1. Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation
of a dataset based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision
tree.
o A decision tree algorithm always tries to maximize the value of information gain,
and a node/attribute having the highest information gain is split first. It can be
calculated using the below formula:
o The original information needed for classification of a tuple in dataset D is given
by:
o
o Where p is the probability that the tuple belongs to class C. The information is
encoded in bits, therefore, log to the base 2 is used. E(s) represents the average
amount of information required to find out the class label of dataset D. This
information gain is also called Entropy.
o The information required for exact classification after portioning is given by the
formula:
o
o Where P (c) is the weight of partition. This information represents the
information needed to classify the dataset D on portioning by X.
o Information gain is the difference between the original and expected information
that is required to classify the tuples of dataset D.
o
o Gain is the reduction of information that is required by knowing the value of X.
The attribute with the highest information gain is chosen as “best”.
2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
UNIT-3 III- Page 44 of 58
KCE-CSE –AI&ML 2023
o An attribute with the low Gini index should be preferred as compared to the high Gini
index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create
binary splits.
o Gini Index is calculated for binary variables only. It measures the impurity in
training tuples of dataset D, as
o
o P is the probability that tuple belongs to class C. The Gini index that is calculated for
binary split dataset D by attribute A is given by:
o
o Where n is the nth partition of the dataset D.
o The reduction in impurity is given by the difference of the Gini index of the original
dataset D and Gini index after partition by attribute A. The maximum reduction in
impurity or max Gini index is selected as the best attribute for splitting.
Pruning is a process of deleting the unnecessary nodes from a tree in order to get the
optimal decision tree.
A too-large tree increases the risk of overfitting, and a small tree may not capture all
the important features of the dataset. Therefore, a technique that decreases the size of
the learning tree without reducing accuracy is known as Pruning. There are mainly
two types of tree pruning technology used: Cost Complexity Pruning, Reduced
Error Pruning.
Algorithm BuiltDT
Input: D : Training data set
Output: T : Decision tree
Steps
1. If all tuples in D belongs to the same class Cj
Add a leaf node labeled as Cj
Return // Termination condition
2. Select an attribute Ai (so that it is not selected twice in the same branch)
3. Partition D = { D1, D2, …, Dp} based on p different values of Ai in D
4. For each Dk ϵ D
Create a node and add an edge between D and Dk with label as the Ai’s attribute
value in Dk
5. For each Dk ϵ D
BuildTD(Dk) // Recursive call
6. Stop
BuildDT algorithm must provides a method for expressing an attribute test condition
and corresponding outcome for different attribute type
Case: Binary attribute
This is the simplest case of node splitting
The test condition for a binary attribute generates only two outcomes
The image below shows the decision tree for the Titanic dataset to predict
whether the passenger will survive or not.
CART
CART model i.e. Classification and Regression Models is a decision tree algorithm for
building models. Decision Tree model where the target values have a discrete nature
is called classification models.
A discrete value is a finite or countably infinite set of values, For Example, age, size,
etc. The models where the target values are represented by continuous values are
usually numbers that are called Regression Models. Continuous variables are floating-
point variables. These two models together are called CART.
UNIT-3 III- Page 48 of 58
KCE-CSE –AI&ML 2023
#7) The above partitioning steps are followed recursively to form a decision tree for
the training dataset tuples.
#8) The portioning stops only when either all the partitions are made or when the
remaining tuples cannot be partitioned further.
#9) The complexity of the algorithm is described by n * |D| * log |D| where n is the
number of attributes in training dataset D and |D| is the number of tuples.
What Is Greedy Recursive Binary Splitting?
In the binary splitting method, the tuples are split and each split cost function is
calculated. The lowest cost split is selected. The splitting method is binary which is
formed as 2 branches. It is recursive in nature as the same method (calculating the
cost) is used for splitting the other tuples of the dataset.
This algorithm is called as greedy as it focuses only on the current node. It focuses on
lowering its cost, while the other nodes are ignored.
The most popular methods of selecting the attribute are information gain, Gini
index.
#1) Information Gain
This method is the main method that is used to build decision trees. It reduces the
information that is required to classify the tuples. It reduces the number of tests that
are needed to classify the given tuple. The attribute with the highest information gain
is selected.
increasing the depth of tests and thereby reduces the error. This results in very
complex trees and leads to overfitting. Overfitting reduces the predictive nature of the
decision tree. The approaches to avoid overfitting of the trees include pre pruning and
post pruning.
Tree pruning is the method to reduce the unwanted branches of the tree. This will
reduce the complexity of the tree and help in effective predictive analysis. It reduces
the overfitting as it removes the unimportant branches from the trees.
#2) Postpruning: This method removes the outlier branches from a fully grown tree.
The unwanted branches are removed and replaced by a leaf node denoting the most
frequent class label. This technique requires more computation than prepruning,
however, it is more reliable.
The pruned trees are more precise and compact when compared to unpruned trees
but they carry a disadvantage of replication and repetition.
Repetition occurs when the same attribute is tested again and again along a branch of
a tree. Replication occurs when the duplicate subtrees are present within the tree.
These issues can be solved by multivariate splits.
Play
Day Outlook Temperature Humidity Wind
cricket
Play
Day Outlook Temperature Humidity Wind
cricket
9 5
If entropy is zero, it means that all members belong to the same class and if entropy is
one then it means that half of the tuples belong to one class and one of them belong to
other class. 0.94 means fair distribution.
Find the information gain attribute which gives maximum information gain.
For Example “Wind”, it takes two values: Strong and Weak, therefore, x = {Strong,
Weak}.
Find out H(x), P(x) for x =weak and x= strong. H(S) is already calculated above.
Weak= 8
Strong= 8
For “weak” wind, 6 of them say “Yes” to play cricket and 2 of them say “No”. So
entropy will be:
For “strong” wind, 3 said “No” to play cricket and 3 said “Yes”.
This shows perfect randomness as half items belong to one class and the remaining
half belong to others.
The attribute outlook has the highest information gain of 0.246, thus it is chosen as
root.
Overcast has 3 values: Sunny, Overcast and Rain. Overcast with play cricket is always
“Yes”. So it ends up with a leaf node, “yes”. For the other values “Sunny” and “Rain”.
The information gain for humidity is highest, therefore it is chosen as the next node.
Similarly, Entropy is calculated for Rain. Wind gives the highest information gain.
The decision tree would look like below:
When a dataset with unknown class labels is fed into the model, then it will
automatically assign the class label to it. This method of applying probability to
predict outcomes is called predictive modeling.
Other Examples
RANDOM FORESTS
Instead of relying on one decision tree, the random forest takes the prediction
from each tree and based on the majority votes of predictions, and it predicts
the final output.
Algorithm
Characteristics
• It takes less training time as compared to other algorithms.
• It predicts output with high accuracy, even for the large dataset it runs efficiently.
• It can also maintain accuracy when a large proportion of data is missing.
Advantages of Random Forest
• Random Forest is capable of performing both Classification and Regression tasks.
• It is capable of handling large datasets with high dimensionality.
• It enhances the accuracy of the model and prevents the overfitting issue.
Disadvantages of Random Forest
• Although random forest can be used for both classification and regression tasks, it is
not more suitable for Regression tasks.
UNIT- IV
Ensemble learning is one of the most powerful machine learning techniques that use the
combined output of two or more models/weak learners and solve a particular
computational intelligence problem. E.g., a Random Forest algorithm is an ensemble of
various decision trees combined.
An ensembled model is a machine learning model that combines the predictions from two or
more models.”
Hard Voting: In hard voting, the predicted output class is a class with the highest
majority of votes i.e the class which had the highest probability of being predicted by each
of the classifiers. Suppose three classifiers predicted the output class(A, A, B), so here the
majority predicted A as output. Hence A will be the final prediction.
Soft Voting: In soft voting, the output class is the prediction based on the average of
probability given to that class. Suppose given some input to three models, the prediction
probability for class A = (0.30, 0.47, 0.53) and B = (0.20, 0.32, 0.40). So the average for
UNIT-4 IV- Page 2 of 24
KCE-CSE –AI&ML 2023
class A is 0.4333 and B is 0.3067, the winner is clearly class A because it had the highest
probability averaged by each classifier
Ensemble learning is primarily used to improve the model performance, such as classification,
prediction, function approximation, etc. In simple words, we can summarise the ensemble
learning as follows:
There are many ways to ensemble models in machine learning, such as Bagging,
Boosting, and stacking.
Stacking is one of the most popular ensemble machine learning techniques used to
predict multiple nodes to build a new model and improve model performance.
Stacking enables us to train multiple models to solve similar problems, and based on
their combined output, it builds a new model with improved performance.
1. Bagging
Bagging is a method of ensemble modeling, which is primarily used to solve supervised
machine learning problems. It is generally completed in two steps as follows:
Bootstrapping:
It is a random sampling method that is used to derive samples from the data using the
replacement procedure.
In this method, first, random data samples are fed to the primary model, and then a
base learning algorithm is run on the samples to complete the learning process.
Aggregation:
This is a step that involves the process of combining the output of all base models and,
based on their output, predicting an aggregate result with greater accuracy and
reduced variance.
Example: In the Random Forest method, predictions from multiple decision trees are
ensembled parallelly.
Further, in regression problems, we use an average of these predictions to get the final
output, whereas, in classification problems, the model is selected as the predicted
class.
The feature offering the best split out of the lot is used to split the nodes
The tree is grown, so you have the best root nodes
The above steps are repeated n times. It aggregates the output of individual decision
trees to give the best prediction
2. Boosting
Boosting is an ensemble method that enables each member to learn from the
preceding member's mistakes and make better predictions for the future.
Unlike the bagging method, in boosting, all base learners (weak) are arranged in a
sequential format so that they can learn from the mistakes of their preceding learner.
Hence, in this way, all weak learners get turned into strong learners and make a better
predictive model with significantly improved performance.
3. Stacking
Stacking is one of the popular ensemble modeling techniques in machine learning.
Various weak learners are ensembled in a parallel manner in such a way that by
combining them with Meta learners, we can predict better predictions for the future.
This ensemble technique works by applying input of combined multiple weak learners'
predictions and Meta learners so that a better output prediction model can be
achieved.
In stacking, an algorithm takes the outputs of sub-models as input and attempts to
learn how to best combine the input predictions to make a better output prediction.
Stacking is also known as a stacked generalization and is an extended form of the
Model Averaging Ensemble technique in which all sub-models equally participate as
per their performance weights and build a new model with better predictions. This
new model is stacked up on top of the others; this is the reason why it is named
stacking.
Original data: This data is divided into n-folds and is also considered test data or
training data.
Base models: These models are also referred to as level-0 models. These models use
training data and provide compiled predictions (level-0) as an output.
Level-0 Predictions: Each base model is triggered on some training data and provides
different predictions, which are known as level-0 predictions.
Meta Model: The architecture of the stacking model consists of one meta-model,
which helps to best combine the predictions of the base models. The meta-model is
also known as the level-1 model.
Level-1 Prediction: The meta-model learns how to best combine the predictions of
the base models and is trained on different predictions made by individual base
models, i.e., data not used to train the base models are fed to the meta-model,
predictions are made, and these predictions, along with the expected outputs, provide
the input and output pairs of the training dataset used to fit the meta-model.
Here K defines the number of pre-defined clusters that need to be created in the
process, as if K=2, there will be two clusters, and for K=3, there will be three clusters,
and so on.
It is an iterative algorithm that divides the unlabeled dataset into k different
clusters in such a way that each dataset belongs only one group that has similar
properties.
It allows us to cluster the data into different groups and a convenient way to discover
the categories of groups in the unlabeled dataset on its own without the need for any
training.
It is a centroid-based algorithm, where each cluster is associated with a centroid.
The main aim of this algorithm is to minimize the sum of distances between the
data point and their corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-
number of clusters, and repeats the process until it does not find the best
clusters. The value of k should be predetermined in this algorithm.
Determines the best value for K center points or centroids by an iterative process.
Assigns each data point to its closest k-center. Those data points which are near to
the particular k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other
clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined
K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
UNIT-4 IV- Page 7 of 24
KCE-CSE –AI&ML 2023
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two
variables is given below:
Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into
different clusters. It means here we will try to group these datasets into two different
clusters.
We need to choose some random k points or centroid to form the cluster. These points
can be either the points from the dataset or any other point.
So, here we are selecting the below two points as k points, which are not the part of
our dataset. Consider the below image:
Now we will assign each data point of the scatter plot to its closest K-point or centroid.
We will compute it by applying some mathematics that we have studied to calculate
the distance between two points. So, we will draw a median between both the
centroids. Consider the below image:
From the above image, it is clear that points left side of the line is near to the K1 or blue
centroid, and points to the right of the line are close to the yellow centroid. Let's color
UNIT-4 IV- Page 8 of 24
KCE-CSE –AI&ML 2023
As we need to find the closest cluster, so we will repeat the process by choosing a new
centroid. To choose the new centroids, we will compute the center of gravity of these
centroids, and will find new centroids as below:
Next, we will reassign each datapoint to the new centroid. For this, we will repeat the
same process of finding a median line. The median will be like below image:
From the above image, we can see, one yellow point is on the left side of the line, and
two blue points are right to the line. So, these three points will be assigned to new
centroids.
As reassignment has taken place, so we will again go to the step-4, which is finding
new centroids or K-points. We will repeat the process by finding the center of gravity
o As we got the new centroids so again will draw the median line and reassign the data
points. So, the image will be:
o We can see in the above image; there are no dissimilar data points on either side of the
line, which means our model is formed. Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and the two final clusters
will be as shown in the below image:
Applications
Customer Segmentation: K-means clustering in machine learning permits marketers
to enhance their customer base, work on a target base, and segment customers based
on purchase patterns, interests, or activity monitoring. Segmentation helps companies
target specific clusters/groups of customers for particular campaigns.
Document Classification: Cluster documents in multiple categories based on tags,
topics, and content. K-means clustering in machine learning is a suitable algorithm for
this purpose. The initial processing of the documents is needed to represent each
document as a vector and uses term frequency for identifying commonly used terms
that help classify the document. The document vectors are then clustered to help
identify similarities in document groups.
Delivery store optimization: K-means clustering in machine learning helps to
optimize the process of good delivery using truck drones. K-means clustering in
machine learning helps to find the optimal number of launch locations.
Insurance fraud detection: Machine learning is critical in fraud detection and has
numerous applications in automobile, healthcare, and insurance fraud detection.
K-Means clustering in machine learning can also be used for performing image
segmentation by trying to group similar pixels in the image together and creating
clusters.
One more way to categorize Machine Learning systems is by how they generalize.
Generalization — usually refers to a ML model’s ability to perform well on new unseen data
rather than just the data that it was trained on.
Most Machine Learning tasks are about making predictions. This means that given a number
of training examples, the system needs to be able to make good “predictions for” / “generalize
to” examples it has never seen before. Having a good performance measure on the training
data is good, but insufficient; the true goal is to perform well on new instances.
There are two main approaches to generalization: instance-based learning and model-
based learning
1. Instance-based learning:
(sometimes called memory-based learning) is a family of learning algorithms that, instead of
performing explicit generalization, compares new problem instances with instances seen in
training, which have been stored in memory.
Instance-based learning systems, also known as lazy learning systems, store the entire
training dataset in memory and when a new instance is to be classified, it compares the
new instance with the stored instances and returns the most similar one. These systems
do not build a model using the training dataset.
2. Model-based learning:
Machine learning models that are parameterized with a certain number of
parameters that do not change as the size of training data changes.
If you don’t assume any distribution with a fixed number of parameters over your data,
for example, in k-nearest neighbor, or in a decision tree, where the number of
parameters grows with the size of the training data, then you are not model-based, or
nonparametric
UNIT-4 IV- Page 12 of 24
KCE-CSE –AI&ML 2023
Model-based learning systems are also known as eager learning systems, where the
model learns the training data. These systems build a machine learning model using
the entire training dataset, which is built by analyzing the training data and
identifying patterns and relationships. After that, the model can be used to make
predictions on new data.
Which is better?
In general, it’s better to choose model-based learning when the goal is to make
predictions on unseen data, and when there are enough computational resources
available. And it’s better to choose instance-based learning when the goal is to make
predictions on new data that are similar to the training instances, and when there are
limited computational resources available.
So, both instance-based and model-based learning have their own advantages and
disadvantages, and the choice between them depends on the specific problem and the
available resources.
set immediately instead it stores the dataset and at the time of classification, it
performs an action on the dataset.
KNN algorithm at the training phase just stores the dataset and when it gets new data,
then it classifies that data into a category that is much similar to the new data.
Example: Suppose, we have an image of a creature that looks similar to cat and dog, but
we want to know either it is a cat or dog. So for this identification, we can use the KNN
algorithm, as it works on a similarity measure. Our KNN model will find the similar
features of the new data set to the cats and dogs images and based on the most similar
features it will put it in either cat or dog category.
Suppose we have a new data point and we need to put it in the required category.
Consider the below image:
Firstly, we will choose the number of neighbors, so we will choose the k=5.
Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in
geometry. It can be calculated as:
There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of
outliers in the model.
Large values for K are good, but it may find some difficulties.
The computation cost is high because of calculating the distance between the data
points for all the training samples.
We have data from the questionnaires survey and objective testing with two attributes (acid
durability and strength) to classify whether a special paper tissue is good or not. Here are the
training samples
Now, the factory produces a new paper tissue that pass lab test with x1=3, x2=7. Without
another expensive survey, find the classification of this new tissue
1. Determine K eg:3
2. Calculate the distance between the query-instance and all the training samples . Here
instance is (3,7)
7 4 (7-3)2 +(4-7)2=25 4 No
7 4 (7-3)2 +(4-7)2=25 4 No -
Use simple majority of the category and predict instance. Here Y=Good (2 votes)
Eg 2
Suppose we have height, weight and T-shirt size of some customers and we need to
predict the T-shirt size of a new customer given only height and weight information we
have. Data including height, weight and T-shirt size information is shown. New customer
named 'Monica' has height 161cm and weight 61kg. Find the T-shirt size.
Eg:3
The table above represents our data set. We have two columns — Brightness and Saturation.
Each row in the table has a class of either Red or Blue.
We have a new entry but it doesn't have a class yet. To know its class, we have to calculate the
distance from the new entry to other entries in the data set using the Euclidean distance
formula.
Where:
Distance #1
For the first row, d1:
We now know the distance from the new data entry to the first entry in the table. Let's update
the table.
Distance #2
For the second row, d2:
Table will look like after all the distances have been calculated:
Since we chose 5 as the value of K, we'll only consider the first five rows. That is:
As you can see above, the majority class within the 5 nearest neighbors to the new entry is
Red. Therefore, we'll classify the new entry as Red.
UNIT-4 IV- Page 20 of 24
KCE-CSE –AI&ML 2023
Choosing a very low value will most likely lead to inaccurate predictions.
The commonly used value of K is 5.
Always use an odd number as the value of K.
Gaussian mixture models (GMMs) are a type of machine learning algorithm. They are used
to classify data into different categories based on the probability distribution.
Gaussian mixture models can be used in many different areas, including finance,
marketing and so much more!
Gaussian mixture models (GMM) are a probabilistic concept used to model real-world data
sets. GMMs are a generalization of Gaussian distributions and can be used to represent
any data set that can be clustered into multiple Gaussian distributions.
The Gaussian mixture model is a probabilistic model that assumes all the data points
are generated from a mix of Gaussian distributions with unknown parameters.
A Gaussian mixture model can be used for clustering, which is the task of grouping a set
of data points into clusters.
GMMs can be used to find clusters in data sets where the clusters may not be clearly
defined. Additionally, GMMs can be used to estimate the probability that a new data point
belongs to each cluster.
Gaussian mixture models are also relatively robust to outliers, meaning that they can
still yield accurate results even if there are some data points that do not fit neatly into any
of the clusters. This makes GMMs a flexible and powerful tool for clustering data.
It can be understood as a probabilistic model where Gaussian distributions are assumed
UNIT-4 IV- Page 21 of 24
KCE-CSE –AI&ML 2023
for each group and they have means and covariances which define their parameters.
GMM consists of two parts – mean vectors (μ) & covariance matrices (Σ). A Gaussian
distribution is defined as a continuous probability distribution that takes on a bell-shaped
curve.
Another name for Gaussian distribution is the normal distribution. Here is a picture of
Gaussian mixture models:
GMM has many applications, such as density estimation, clustering, and image
segmentation.
For density estimation, GMM can be used to estimate the probability density function of a
set of data points.
For clustering, GMM can be used to group together data points that come from the same
Gaussian distribution. And for image segmentation, GMM can be used to partition an
image into different regions.
Gaussian mixture models can be used for a variety of use cases, including identifying
customer segments, detecting fraudulent activity, and clustering images.
In each of these examples, the Gaussian mixture model is able to identify clusters in the
data that may not be immediately obvious.
As a result, Gaussian mixture models are a powerful tool for data analysis and should be
considered for any clustering task.
The following are three different steps to using gaussian mixture models:
Determining a covariance matrix that defines how each Gaussian is related to one another.
The more similar two Gaussians are, the closer their means will be and vice versa if they
are far away from each other in terms of similarity.
A gaussian mixture model can have a covariance matrix that is diagonal or symmetric.
Determining the number of Gaussians in each group defines how many clusters there are.
Selecting the hyperparameters which define how to optimally separate data using
gaussian mixture models as well as deciding on whether or not each gaussian’s covariance
matrix is diagonal or symmetric.
Application areas
There are many different real-world problems that can be solved with gaussian mixture
models. Gaussian mixture models are very useful when there are large datasets and it is
difficult to find clusters. This is where Gaussian mixture models help. It is able to find
clusters of Gaussians more efficiently than other clustering algorithms such as k-means.
Here are some real-world problems which can be solved using Gaussian mixture models:
Finding patterns in medical datasets: GMMs can be used for segmenting images into
multiple categories based on their content or finding specific patterns in medical datasets.
They can be used to find clusters of patients with similar symptoms, identify disease
subtypes, and even predict outcomes. In one recent study, a Gaussian mixture model was
used to analyze a dataset of over 700,000 patient records. The model was able to identify
previously unknown patterns in the data, which could lead to better treatment for patients
with cancer.
Modeling natural phenomena: GMM can be used to model natural phenomena where it
has been found that noise follows Gaussian distributions. This model of probabilistic
modeling relies on the assumption that there exists some underlying continuum of
unobserved entities or attributes and that each member is associated with measurements
taken at equidistant points in multiple observation sessions.
Customer behavior analysis: GMMs can be used for performing customer behavior
analysis in marketing to make predictions about future purchases based on historical data.
Stock price prediction: Another area Gaussian mixture models are used is in finance
where they can be applied to a stock’s price time series. GMMs can be used to detect
changepoints in time series data and help find turning points of stock prices or other
market movements that are otherwise difficult to spot due to volatility and noise.
Gene expression data analysis: Gaussian mixture models can be used for gene
expression data analysis. In particular, GMMs can be used to detect differentially
UNIT-4 IV- Page 23 of 24
KCE-CSE –AI&ML 2023
expressed genes between two conditions and identify which genes might contribute
toward a certain phenotype or disease state.
What are the differences between Gaussian mixture models and other types of
clustering algorithms such as K-means?
Here are some of the key differences between Gaussian mixture models and the K-means
algorithm used for clustering:
A Gaussian mixture model is a type of clustering algorithm that assumes that the data
point is generated from a mixture of Gaussian distributions with unknown parameters.
The goal of the algorithm is to estimate the parameters of the Gaussian distributions, as
well as the proportion of data points that come from each distribution. In contrast, K-
means is a clustering algorithm that does not make any assumptions about the underlying
distribution of the data points. Instead, it simply partitions the data points into K clusters,
where each cluster is defined by its centroid.
While Gaussian mixture models are more flexible, they can be more difficult to train than
K-means. K-means is typically faster to converge and so may be preferred in cases where
the runtime is an important consideration.
In general, K-means will be faster and more accurate when the data set is large and the
clusters are well-separated. Gaussian mixture models will be more accurate when the data
set is small or the clusters are not well-separated.
Gaussian mixture models take into account the variance of the data, whereas K-means
does not.
Gaussian mixture models are more flexible in terms of the shape of the clusters, whereas
K-means is limited to spherical clusters.
Gaussian mixture models can handle missing data, whereas K-means cannot. This
difference can make Gaussian mixture models more effective in certain applications, such
as data with a lot of noise or data that is not well-defined.
UNIT- V
NEURAL NETWORKS
UNIT-5 V- Page 1 of 29
KCE-CSE –AI&ML 2023
INTRODUCTION
The term "Artificial neural network" refers to a biologically inspired sub-field of artificial
intelligence modeled after the brain.
An Artificial neural network is usually a computational network based on biological neural
networks that construct the structure of the human brain.
Similar to a human brain has neurons interconnected to each other, artificial neural networks also
have neurons that are linked to each other in various layers of the networks. These neurons are
known as nodes.
The typical diagram of Biological Neural Network.
The typical Artificial Neural Network looks something like the given figure.
Dendrites from Biological Neural Network represent inputs in Artificial Neural Networks, cell
nucleus represents Nodes, synapse represents Weights, and Axon represents Output.
An Artificial Neural Network in the field of Artificial intelligence where it attempts to mimic
the network of neurons makes up a human brain so that computers will have an option to
understand things and make decisions in a human-like manner. The artificial neural network
is designed by programming computers to behave simply like interconnected brain cells.
There are around 1000 billion neurons in the human brain. Each neuron has an
association point somewhere in the range of 1,000 and 100,000.
UNIT-5 V- Page 2 of 29
KCE-CSE –AI&ML 2023
In the human brain, data is stored in such a manner as to be distributed, and we can
extract more than one piece of this data when necessary from our memory parallelly.
We can say that the human brain is made up of incredibly amazing parallel processors.
We can understand the artificial neural network with an example, consider an example of
a digital logic gate that takes an input and gives an output. "OR" gate, which takes two
inputs.
If one or both the inputs are "On," then we get "On" in output. If both the inputs are "Off,"
then we get "Off" in output. Here the output depends upon input. Our brain does not
perform the same task.
The outputs to inputs relationship keep changing because of the neurons in our brain,
which are "learning."
Dendrites Inputs
Synapse Weights
Axon Output
Input Layer:
As the name suggests, it accepts inputs in several different formats provided by the programmer.
UNIT-5 V- Page 3 of 29
KCE-CSE –AI&ML 2023
Hidden Layer:
The hidden layer presents in-between input and output layers. It performs all the calculations to
find hidden features and patterns.
Output Layer:
The input goes through a series of transformations using the hidden layer, which finally results
in output that is conveyed using this layer.
The artificial neural network takes input and computes the weighted sum of the inputs and
includes a bias. This computation is represented in the form of a transfer function.
Hardware dependence:
Artificial neural networks need processors with parallel processing power, as per their
structure. Therefore, the realization of the equipment is dependent.
Artificial Neural Network can be best represented as a weighted directed graph, where
the artificial neurons form the nodes.
The association between the neurons outputs and neuron inputs can be viewed as the
directed edges with weights.
The Artificial Neural Network receives the input signal from the external source in the
form of a pattern and image in the form of a vector.
These inputs are then mathematically assigned by the notations x(n) for every n number
of inputs.
Afterward, each of the input is multiplied by its corresponding weights ( these weights
are the details utilized by the artificial neural networks to solve a specific problem ).
UNIT-5 V- Page 5 of 29
KCE-CSE –AI&ML 2023
TOPIC 24.
PERCEPTRON, MULTILAYER PERCEPTRON, ACTIVATION FUNCTIONS
24.1. PERCEPTRON
A Perceptron is an Artificial Neuron. It is the simplest possible Neural Network. Neural
Networks are the building blocks of Machine Learning.Perceptron is a type of artificial neural
network, which is a fundamental concept in machine learning. The basic components of a
perceptron are:
Input Layer: The input layer consists of one or more input neurons, which receive input
signals from the external world or from other layers of the neural network.
Weights: Each input neuron is associated with a weight, which represents the strength of the
connection between the input neuron and the output neuron.
Bias: A bias term is added to the input layer to provide the perceptron with additional
flexibility in modeling complex patterns in the input data.
Activation Function: The activation function determines the output of the perceptron based
on the weighted sum of the inputs and the bias term. Common activation functions used in
perceptrons include the step function, sigmoid function, and ReLU function.
Output: The output of the perceptron is a single binary value, either 0 or 1, which indicates
the class or category to which the input data belongs.
UNIT-5 V- Page 6 of 29
KCE-CSE –AI&ML 2023
Perceptron
The original Perceptron was designed to take a number of binary inputs, and produce one
binary output (0 or 1).
The idea was to use different weights to represent the importance of each input, and that
the sum of the values should be greater than a threshold value before making a decision
like yes or no (true or false) (0 or 1).
∑wi*xi + b
UNIT-5 V- Page 7 of 29
KCE-CSE –AI&ML 2023
Perceptron Example
Imagine a perceptron (in your brain). The perceptron tries to decide if you should go to a
concert. Is the artist good? Is the weather good? What weights should these facts have?
Types of Perceptron:
Single layer:
Single layer perceptron can learn only linearly separable patterns.
This is one of the easiest Artificial neural networks (ANN) types.
A single-layered perceptron model consists feed-forward network and also includes a
threshold transfer function inside the model.
The main objective of the single-layer perceptron model is to analyze the linearly
separable objects with binary outcomes.
UNIT-5 V- Page 8 of 29
KCE-CSE –AI&ML 2023
In a single layer perceptron model, its algorithms do not contain recorded data, so it
begins with inconstantly allocated input for weight parameters.
Further, it sums up all inputs (weight). After adding all inputs, if the total sum of all
inputs is more than a pre-determined value, the model gets activated and shows the
output value as +1.
If the outcome is same as pre-determined or threshold value, then the performance of this
model is stated as satisfied, and weight demand does not change.
However, this model consists of a few discrepancies triggered when multiple weight
inputs values are fed into the model. Hence, to find desired output and minimize errors,
some changes should be necessary for the weights input.
Multilayer: Multilayer perceptrons can learn about two or more layers having a greater
processing power.
Like a single-layer perceptron model, a multi-layer perceptron model also has the same
model structure but has a greater number of hidden layers.
o Forward Stage: Activation functions start from the input layer in the forward stage
and terminate on the output layer.
o Backward Stage: In the backward stage, weight and bias values are modified as per
the model's requirement. In this stage, the error between actual output and demanded
originated backward on the output layer and ended on the input layer.
networks having various layers in which activation function does not remain linear, similar
to a single layer perceptron model. Instead of linear, activation function can be executed as
sigmoid, TanH, ReLU, etc., for deployment.
A multi-layer perceptron model has greater processing power and can process linear and non-
linear patterns. Further, it can also implement logic gates such as AND, OR, XOR, NAND, NOT,
XNOR, NOR
Characteristics of Perceptron
The perceptron model has the following characteristics.
Perceptron is a machine learning algorithm for supervised learning of binary classifiers.
In Perceptron, the weight coefficient is automatically learned.
Initially, weights are multiplied with input features, and the decision is made whether the
neuron is fired or not.
The activation function applies a step rule to check whether the weight function is greater
than zero.
The linear decision boundary is drawn, enabling the distinction between the two linearly
separable classes +1 and -1.
If the added sum of all input values is more than the threshold value, it must have an output
signal; otherwise, no output will be shown.
UNIT-5 V- Page 10 of 29
KCE-CSE –AI&ML 2023
The output of a perceptron can only be a binary number (0 or 1) due to the hard limit
transfer function.
Perceptron can only be used to classify the linearly separable sets of input vectors. If input
vectors are non-linear, it is not easy to classify them properly.
Neurons in neural networks operate in accordance with weight, bias, and their
corresponding activation functions.
Based on the mistake, the values of the neurons inside a neural network would be
modified. This process is known as back-propagation.
Back-propagation is made possible by activation functions since they provide the
gradients and error required to change the biases and weights.
UNIT-5 V- Page 11 of 29
KCE-CSE –AI&ML 2023
Linear Function
Equation: A linear function's equation, which is y = x, is similar to the eqn of a single direction.
The ultimate activation function of the last layer is nothing more than a linear function of input
from the first layer, regardless of how many levels we have if they are all linear in nature. -inf to
+inf is the range.
Uses: The output layer is the only location where the activation function's function is applied.
If we separate a linear function to add non-linearity, the outcome will no longer depend on the
input "x," the function will become fixed, and our algorithm won't exhibit any novel behaviour.
A good example of a regression problem is determining the cost of a house. We can use linear
activation at the output layer since the price of a house may have any huge or little value. The
neural network's hidden layers must perform some sort of non-linear function even in this
circumstance.
Non-linear Activation Functions, The normal data input to neural networks is unaffected by
the complexity or other factors.
Sigmoid Function
It is a functional that is graphed in a "S" shape. The sigmoid function is used when the model
UNIT-5 V- Page 12 of 29
KCE-CSE –AI&ML 2023
is predicting probability.
Application
The sigmoid function's ability to transform any real number to one between 0 and 1 is
advantageous in data science and many other fields such as:
• In deep learning as a non-linear activation function within neurons in artificial neural
networks to allows the network to learn non-linear relationships between the data
• In binary classification, also called logistic regression, the sigmoid function is used to
predict the probability of a binary variable
For example, during backpropagation in deep learning, the gradient of a sigmoid activation
function is used to update the weights & biases of a neural network. If these gradients are
tiny, the updates to the weights & biases are tiny and the network will not learn.
Alternatively, other non-linear functions such as the Rectified Linear Unit (ReLu) are used, which
do not show these flaws.
Mathematical function
We typically denote the sigmoid function by the greek letter (sigma) and define as
Where
x is the input to the sigmoid function
e is [Euler's number] (e=2.781)
UNIT-5 V- Page 13 of 29
KCE-CSE –AI&ML 2023
Tanh Function
• The activation that consistently outperforms sigmoid function is known as tangent
hyperbolic function. It's actually a sigmoid function that has been mathematically
adjusted. Both are comparable to and derivable from one another.
Uses: - Since its values typically range from -1 to 1, the mean again for hidden layer of a neural
network will be 0 or very near to it. This helps to centre the data by getting the mean close to 0.
This greatly facilitates learning for the following layer.
Equation:
However, the problem is that all negative values instantly become zero, which reduces the
UNIT-5 V- Page 14 of 29
KCE-CSE –AI&ML 2023
model's capacity to effectively fit or learn from the data. This means that any negative input to a
ReLU activation function immediately becomes zero in the graph, which has an impact on the
final graph by improperly mapping the negative values.
Tanh function is very similar to the sigmoid/logistic activation function, and even has the same
S-shape with the difference in output range of -1 to 1. In Tanh, the larger the input (more
positive), the closer the output value will be to 1.0, whereas the smaller the input (more
negative), the closer the output will be to -1.0.
TOPIC 25.
NETWORK TRAINING, GRADIENT DESCENT OPTIMIZATION
• It is thus possible to compare the network's calculated values for the output nodes to these "correct"
values, and calculate an error term for each node (the "Delta" rule).
• These error terms are then used to adjust the weights in the hidden layers so that, hopefully, the next
time around the output values will be closer to the "correct" values.
• A key feature of neural networks is an iterative learning process in which data cases
(rows) are presented to the network one at a time, and the weights associated with the input
values are adjusted each time.
• After all cases are presented, the process often starts over again. During this learning phase,
the network learns by adjusting the weights so as to be able to predict the correct class
label of input samples. Neural network learning is also referred to as "connectionist
learning," due to connections between the units.
• Advantages of neural networks include their high tolerance to noisy data, as well as their
ability to classify patterns on which they have not been trained. The most popular neural
network algorithm is back-propagation algorithm proposed in the 1980's.
• Once a network has been structured for a particular application, that network is ready
to be trained. To start this process, the initial weights are chosen randomly. Then the
training, or learning, begins.
• The network processes the records in the training data one at a time, using the weights
and functions in the hidden layers, then compares the resulting outputs against the
desired outputs.
• Errors are then propagated back through the system, causing the system to adjust the
weights for application to the next record to be processed.
• This process occurs over and over as the weights are continually tweaked. During the
training of a network the same set of data is processed many times as the connection weights
are continually refined.
UNIT-5 V- Page 16 of 29
KCE-CSE –AI&ML 2023
consequence, the Input Layer is linked to the Hidden Layers which are then linked to the
Output Layer
UNIT-5 V- Page 17 of 29
KCE-CSE –AI&ML 2023
• When the neural network gives out the incorrect output, this leads to an output error. This
error is the difference between the actual and predicted outputs. A cost function
measures this error.
• The cost function (J) indicates how accurately the model performs. It tells us how far-off
our predicted output values are from our actual values.
• It is also known as the error. Because the cost function quantifies the error, we aim to
minimize the cost function.
• What we want is to reduce the output error. Since the weights affect the error, we will need
to readjust the weights. We have to adjust the weights such that we have a combination
of weights that minimizes the cost function.
Network Topology
A network topology is the arrangement of a network along with its nodes and connecting lines. According
to the topology, ANN can be classified as the following kinds −
Feedforward Network
• It is a non-recurrent network having processing units/nodes in layers and all the nodes in a layer
UNIT-5 V- Page 18 of 29
KCE-CSE –AI&ML 2023
Single layer feedforward network − The concept is of feedforward ANN having only one weighted
layer. In other words, we can say the input layer is fully connected to the output layer.
Multilayer feedforward network − The concept is of feedforward ANN having more than one weighted
layer. As this network has one or more layers between the input and the output layer, it is called hidden
layers.
Feedback Network
As the name suggests, a feedback network has feedback paths, which means the signal can flow in both
directions using loops. This makes it a non-linear dynamic system, which changes continuously until it
reaches a state of equilibrium. It may be divided into the following types −
Recurrent networks − They are feedback networks with closed loops. Following are the two types of
recurrent networks.
1. Fully recurrent network − It is the simplest neural network architecture because all nodes are
connected to all other nodes and each node works as both input and output.
Jordan network − It is a closed loop network in which the output will go to the input again as feedback
as shown in the following diagram.
UNIT-5 V- Page 19 of 29
KCE-CSE –AI&ML 2023
Unsupervised Learning
• As the name suggests, this type of learning is done without the supervision of a teacher. This
learning process is independent.
• During the training of ANN under unsupervised learning, the input vectors of similar type are
combined to form clusters. When a new input pattern is applied, then the neural network gives an
output response indicating the class to which the input pattern belongs.
• There is no feedback from the environment as to what should be the desired output and if it is
correct or incorrect
Reinforcement Learning
• During the training of network under reinforcement learning, the network receives some
feedback from the environment
UNIT-5 V- Page 20 of 29
KCE-CSE –AI&ML 2023
accuracy. The step size is measured by a parameter alpha (α). A small α means a small step
size, and a large α means a large step size. If the step sizes are too large, we could miss the
minimum point completely. This can yield inaccurate results. If the step size is too small, the
optimization process could take too much time. This will lead to a waste of computational
power.
• The step size is evaluated and updated according to the behavior of the cost function.
• The higher the gradient of the cost function, the steeper the slope and the faster a
model can learn (high learning rate).
• A high learning rate results in a higher step value, and a lower learning rate results in a
lower step value.
• If the gradient of the cost function is zero, the model stops learning.
Descending the Cost Function
• Navigating the cost function consists of adjusting the weights. The weights are adjusted
using the following formula:
to obtain the new weight, we use the gradient, the learning rate, and an initial weight.
Adjusting the weights consists of multiple iterations. We take a new step down for each iteration
and calculate a new weight. Using the initial weight and the gradient and learning rate, we can
determine the subsequent weights.
Let’s consider a graphical example of this:
Gradient descent algorithm stops when a local minimum of the loss surface is reached
GD does not guarantee reaching a global minimum
However, empirical evidence suggests that GD works well for NNs
TOPIC 26.
Gradient, in plain terms means slope or slant of a surface. So gradient descent literally means
descending a slope to reach the lowest point on that surface
UNIT-5 V- Page 23 of 29
KCE-CSE –AI&ML 2023
In the above graph, the lowest point on the parabola occurs at x = 1. The objective of gradient
descent algorithm is to find the value of “x” such that “y” is minimum. “y” here is termed as the
objective function that the gradient descent algorithm operates upon, to descend to the lowest
point.
Gradient descent is an iterative algorithm, that starts from a random point on a function and
travels down its slope in steps until it reaches the lowest point of that function.”
1. Find the slope of the objective function with respect to each parameter/feature. In other
words, compute the gradient of the function.
2. Pick a random initial value for the parameters. (To clarify, in the parabola example,
differentiate “y” with respect to “x”. If we had more features like x1, x2 etc., we take the
partial derivative of “y” with respect to each of the features.)
3. Update the gradient function by plugging in the parameter values.
4. Calculate the step sizes for each feature as : step size = gradient * learning rate.
5. Calculate the new parameters as : new params = old params -step size
6. Repeat steps 3 to 5 until gradient is almost 0.
Stochastic”, in plain terms means “random”. SGD randomly picks one data point from the whole
data set at each iteration to reduce the computations enormously.
It is also common to sample a small number of data points instead of just one point at each step
and that is called “mini-batch” gradient descent. Mini-batch tries to strike a balance between the
goodness of gradient descent and speed of SGD.
• For a single training example, Backpropagation algorithm calculates the gradient of the error
function. Backpropagation can be written as a function of the neural network.
Backpropagation algorithms are a set of methods used to efficiently train artificial neural
networks following a gradient descent approach which exploits the chain rule.
• The main features of Backpropagation are the iterative, recursive and efficient method
through which it calculates the updated weight to improve the network until it is not able to
perform the task for which it is being trained. Derivatives of the activation function to be
known at network design time is required to Backpropagation.
UNIT-5 V- Page 24 of 29
KCE-CSE –AI&ML 2023
A shallow neural network has only one hidden layer between the input and output layers, while
a deep neural network has multiple hidden layers. Deeper networks perform better than shallow
UNIT-5 V- Page 25 of 29
KCE-CSE –AI&ML 2023
networks. But only up to some limit: after a certain number of layers, the performance of deeper
networks plateaus
TOPIC 27.
UNIT SATURATION (AKA THE VANISHING GRADIENT PROBLEM) – RELU, HYPERPARAMETER
TUNING, BATCH NORMALIZATION, REGULARIZATION, DROPOUT
In some cases, during training, the gradients can become either very small (vanishing gradients) of very
large (exploding gradients)
They result in very small or very large update of the parameters
Solutions: change learning rate, ReLU activations, regularization, LSTM units in RNNs
REGULARIZATION
UNIT-5 V- Page 26 of 29
KCE-CSE –AI&ML 2023
UNIT-5 V- Page 27 of 29
KCE-CSE –AI&ML 2023
UNIT-5 V- Page 28 of 29
KCE-CSE –AI&ML 2023
UNIT-5 V- Page 29 of 29