0% found this document useful (0 votes)
6 views204 pages

All Units

The document outlines the syllabus for the CS3491 Artificial Intelligence and Machine Learning course, detailing five units covering topics such as problem-solving, probabilistic reasoning, supervised and unsupervised learning, ensemble techniques, and neural networks. It also discusses the applications of AI across various sectors including healthcare, finance, agriculture, and social media, highlighting both advantages and disadvantages of AI technology. Additionally, it compares AI with human intelligence and emphasizes the importance of AI in modern industries.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views204 pages

All Units

The document outlines the syllabus for the CS3491 Artificial Intelligence and Machine Learning course, detailing five units covering topics such as problem-solving, probabilistic reasoning, supervised and unsupervised learning, ensemble techniques, and neural networks. It also discusses the applications of AI across various sectors including healthcare, finance, agriculture, and social media, highlighting both advantages and disadvantages of AI technology. Additionally, it compares AI with human intelligence and emphasizes the importance of AI in modern industries.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 204

KCE-CSE –AI&ML 2023

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

CS3491 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING

(AU REG.2021)

E-MATERIAL
(VERSION-1)

PREPARED BY
Ms.K.ABHIRAMI

UNIT-I- Page 1 of 64
KCE-CSE –AI&ML 2023

CS3491 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING (4 CREDITS)

SYLLABUS

UNIT I PROBLEM SOLVING 9

Introduction to AI - AI Applications - Problem solving agents – search algorithms –


uninformed search strategies – Heuristic search strategies – Local search and
optimization problems – adversarial search – constraint satisfaction problems (CSP)

UNIT II PROBABILISTIC REASONING 9

Acting under uncertainty – Bayesian inference – naïve bayes models. Probabilistic


reasoning – Bayesian networks – exact inference in BN – approximate inference in BN –
causal networks.

UNIT III SUPERVISED LEARNING 9

Introduction to machine learning – Linear Regression Models: Least squares, single &
multiple variables, Bayesian linear regression, gradient descent, Linear Classification
Models: Discriminant function – Probabilistic discriminative model - Logistic regression,
Probabilistic generative model – Naive Bayes, Maximum margin classifier – Support
vector machine, Decision Tree, Random forests

UNIT IV ENSEMBLE TECHNIQUES AND UNSUPERVISED LEARNING 9

Combining multiple learners: Model combination schemes, Voting, Ensemble Learning -


bagging, boosting, stacking, Unsupervised learning: K-means, Instance Based Learning:
KNN, Gaussian mixture models and Expectation maximization

UNIT V NEURAL NETWORKS 9

Perceptron - Multilayer perceptron, activation functions, network training – gradient


descent optimization – stochastic gradient descent, error backpropagation, from
shallow networks to deep networks –Unit saturation (aka the vanishing gradient
problem) – ReLU, hyperparameter tuning, batch normalization, regularization, dropout.

UNIT-I- Page 2 of 64
KCE-CSE –AI&ML 2023

UNIT- I

PROBLEM SOLVING

S.NO TOPICS PAGE NO. STATUS OF


COVERAGE

1 Introduction to AI, 4-6


AI Applications 7-12

2 Problem solving agents 12-19

3 Search algorithms 19-21


Uninformed search strategies 21-29

4 Heuristic search strategies 30-39

5 Local search and optimization 39-44


problems

6 Adversarial search 44-55

7 Constraint Satisfaction 5-64


Problems (CSP)

UNIT-I- Page 3 of 64
KCE-CSE –AI&ML 2023

UNIT-I

PROBLEM SOLVING

TOPIC 1: INTRODUCTION

Artificial Intelligence is a branch of computer science by which we can create


intelligent machines which can behave like a human, think like humans, and able to
make decisions.

Artificial Intelligence exists when a machine can have human based skills such as
learning, reasoning, and solving problems

Think Like human Think Rationally


Act Like Human Act Rationally

What is involved in Intelligence?


 Ability to interact with the real world
o to perceive, understand, and act
o e.g., speech recognition and understanding and synthesis
o e.g., image understanding
o e.g., ability to take actions, have an effect
 Reasoning and Planning
o modeling the external world, given input
o solving new problems, planning, and making decisions
o ability to deal with unexpected problems, uncertainties
 Learning and Adaptation
o we are continuously learning and adapting
o our internal models are always being “updated”
 e.g., a baby learning to categorize and recognize animals
Objective of AI

1. Replicate human intelligence


2. Solve Knowledge-intensive tasks
3. An intelligent connection of perception and action
4. Building a machine which can perform tasks that requires human intelligence
such as:
o Proving a theorem
o Playing chess
o Plan some surgical operation

UNIT-I- Page 4 of 64
KCE-CSE –AI&ML 2023

o Driving a car in traffic


5. Creating some system which can exhibit intelligent behavior, learn new things by
itself, demonstrate, explain, and can advise to its user.

Following are some main advantages of Artificial Intelligence:

o High Accuracy with less errors: AI machines or systems are prone to less
errors and high accuracy as it takes decisions as per pre-experience or
information.
o High-Speed: AI systems can be of very high-speed and fast-decision making,
because of that AI systems can beat a chess champion in the Chess game.
o High reliability: AI machines are highly reliable and can perform the same
action multiple times with high accuracy.
o Useful for risky areas: AI machines can be helpful in situations such as defusing
a bomb, exploring the ocean floor, where to employ a human can be risky.
o Digital Assistant: AI can be very useful to provide digital assistant to the users
such as AI technology is currently used by various E-commerce websites to show
the products as per customer requirement.
o Useful as a public utility: AI can be very useful for public utilities such as a self-
driving car which can make our journey safer and hassle-free, facial recognition
for security purpose, Natural language processing to communicate with the
human in human-language, etc.

Disadvantages of Artificial Intelligence

Every technology has some disadvantages, and thesame goes for Artificial intelligence.
Being so advantageous technology still, it has some disadvantages which we need to
keep in our mind while creating an AI system. Following are the disadvantages of AI:

o High Cost: The hardware and software requirement of AI is very costly as it


requires lots of maintenance to meet current world requirements.
o Can't think out of the box: Even we are making smarter machines with AI, but
still they cannot work out of the box, as the robot will only do that work for
which they are trained, or programmed.
o No feelings and emotions: AI machines can be an outstanding performer, but
still it does not have the feeling so it cannot make any kind of emotional
attachment with human, and may sometime be harmful for users if the proper
care is not taken.
o Increase dependency on machines: With the increment of technology, people
are getting more dependent on devices and hence they are losing their mental
capabilities.

UNIT-I- Page 5 of 64
KCE-CSE –AI&ML 2023

o No Original Creativity: As humans are so creative and can imagine some new
ideas but still AI machines cannot beat this power of human intelligence and
cannot be creative and imaginative.

Comparison of AI with Human Intelligence

AI VS HUMAN INTELLIGENCE

Artificial intelligence Human intelligence


Created by human intelligence Created by divine intelligence
Process information faster Process information slower
Highly objective May be subjective
More accurate Maybe less accurate
It Uses 2 watts It Uses 25 watts
Cannot adapt to changes well Can easily adapt to changes well
Cannot multitask that well Can easily multitask
Optimization Innovation
Below average social skill Excellent social skills
Still working toward self-awareness Good self-awareness

UNIT-I- Page 6 of 64
KCE-CSE –AI&ML 2023

TOPIC 2: APPLICATIONS

Following are some sectors which have the application of Artificial Intelligence:

1. AI in Astronomy

o Artificial Intelligence can be very useful to solve complex universe problems. AI


technology can be helpful for understanding the universe such as how it works,
origin, etc.

2. AI in Healthcare

o In the last, five to ten years, AI becoming more advantageous for the healthcare
industry and going to have a significant impact on this industry.
o Healthcare Industries are applying AI to make a better and faster diagnosis than
humans. AI can help doctors with diagnoses and can inform when patients are
worsening so that medical help can reach to the patient before hospitalization.

3. AI in Gaming

o AI can be used for gaming purpose. The AI machines can play strategic games
like chess, where the machine needs to think of a large number of possible
places.

4. AI in Finance

o AI and finance industries are the best matches for each other. The finance
industry is implementing automation, chatbot, adaptive intelligence, algorithm
trading, and machine learning into financial processes.

5. AI in Data Security

o The security of data is crucial for every company and cyber-attacks are growing
very rapidly in the digital world. AI can be used to make your data more safe and
secure. Some examples such as AEG bot, AI2 Platform,are used to determine
software bug and cyber-attacks in a better way.

6. AI in Social Media

o Social Media sites such as Facebook, Twitter, and Snapchat contain billions of
user profiles, which need to be stored and managed in a very efficient way. AI
can organize and manage massive amounts of data. AI can analyze lots of data to
identify the latest trends, hashtag, and requirement of different users.

7. AI in Travel & Transport

UNIT-I- Page 7 of 64
KCE-CSE –AI&ML 2023

o AI is becoming highly demanding for travel industries. AI is capable of doing


various travel related works such as from making travel arrangement to
suggesting the hotels, flights, and best routes to the customers. Travel industries
are using AI-powered chatbots which can make human-like interaction with
customers for better and fast response.

8. AI in Automotive Industry

o Some Automotive industries are using AI to provide virtual assistant to their user
for better performance. Such as Tesla has introduced TeslaBot, an intelligent
virtual assistant.
o Various Industries are currently working for developing self-driven cars which
can make your journey more safe and secure.

9. AI in Robotics:

o Artificial Intelligence has a remarkable role in Robotics. Usually, general robots


are programmed such that they can perform some repetitive task, but with the
help of AI, we can create intelligent robots which can perform tasks with their
own experiences without pre-programmed.
o Humanoid Robots are best examples for AI in robotics, recently the intelligent
Humanoid robot named as Erica and Sophia has been developed which can talk
and behave like humans.

10. AI in Entertainment

o We are currently using some AI based applications in our daily life with some
entertainment services such as Netflix or Amazon. With the help of ML/AI
algorithms, these services show the recommendations for programs or shows.

11. AI in Agriculture

o Agriculture is an area which requires various resources, labor, money, and time
for best result. Now a day's agriculture is becoming digital, and AI is emerging in
this field. Agriculture is applying AI as agriculture robotics, solid and crop
monitoring, predictive analysis. AI in agriculture can be very helpful for farmers.

12. AI in E-commerce

o AI is providing a competitive edge to the e-commerce industry, and it is


becoming more demanding in the e-commerce business. AI is helping shoppers
to discover associated products with recommended size, color, or even brand.

13. AI in education:

UNIT-I- Page 8 of 64
KCE-CSE –AI&ML 2023

o AI can automate grading so that the tutor can have more time to teach. AI chatbot
can communicate with students as a teaching assistant.
o AI in the future can be work as a personal virtual tutor for students, which will
be accessible easily at any time and any place.

2.1.APPLICATION IN DETAIL

AI IN AGRICULTURE

 AI saves the agriculture sector from different factors such as climate change,
population growth, employment issues in this field, and food safety.
 Today's agriculture system has reached at a different level due to AI. Artificial
Intelligence has improved crop production and real-time monitoring, harvesting,
processing and marketing.
 Different hi-tech computer-based systems are designed to determine various
important parameters such as weed detection, yield detection, crop quality, and
many more.
 1. Weather & price Forecasting: it is difficult for the farmers to take
the right decision for harvesting, sowing seeds, and soli preparing
due to climate change. But with the help of AI weather forecasting,
farmers can have information on weather analysis, and accordingly,
they can plan for the type of crop to grow, seeds to sow, and
harvesting the crop. With price forecasting, farmers can get a better
idea about the price of crops for the next few weeks, which can help
them to get maximum profit.
 2. Health Monitoring of Crops:
 The quality of crop widely depends on the type of soil and nutrition
of the soil. But with the increasing rate of deforestation, the soil
quality is degrading day by day, and it is hard to determine it.
 To resolve this issue, AI has come up with a new application to
identify the deficiencies in soil, including plant pests and diseases.
With the help of this application, farmers can get an idea to use better
fertilizer which can improve the harvest quality. In this app, AI's
image recognition technology is used by which farmers can capture
the images of plants and get information about the quality.
 3. Agriculture Robotics:
 Robotics is being widely used in different sectors, mainly in
manufacturing, to perform complex tasks. Nowadays, different AI
companies are developing robots to be employed in the Agriculture

UNIT-I- Page 9 of 64
KCE-CSE –AI&ML 2023

sector. These AI robots are developed in such a way that they can
perform multiple tasks in farming.
 AI robots are also trained in checking the quality of crops, detect and
controlling weeds, and harvesting the crop with faster speed
compared to a human.
 4. Intelligent Spraying
 With AI sensors, weed can be detected easily, and it also detects weed
affected areas. On finding such areas, herbicides can be precisely
sprayed to reduce the use of herbicides and also saves time and crop.
There are different AI companies that are building robots with AI and
computer vision, which can precisely spray on weeds. The use of AI
sprayers can widely reduce the number of chemicals to be used on
fields, and hence improves the quality of crops and also saves money.
 5. Disease Diagnosis
 With AI predictions, farmers can get knowledge of diseases easily.
With this, they can easily diagnose diseases with proper strategy and
on time. It can save the life of plants and farmer's time. To do this,
firstly, images of plants are pre-processed using computer vision
technology. This ensures that plant images are properly divided into
the diseased and non-diseased parts. After detection, the diseased
part is cropped and send to the labs for further diagnosis. This
technique also helps in the detection of pests, deficiency of nutrients,
and many more.
 6. Precision Farming
 Precision farming is all about "Right place, Right Time, and Right
products". The precision farming technique is a much accurate and
controlled way that can replace the labour-intensive part of farming
to perform repetitive tasks. One example of Precision farming is the
identification of stress levels in plants. This can be obtained using
high-resolution images and different sensor data on plants. The data
obtained from sensors is then fed to a machine learning model as
input for stress recognition.

2.2. AI IN BANKING

Almost every industry, including banking and finance, has been significantly disrupted
by artificial intelligence. This industry is now more customer-centric and
technologically relevant thanks to the use of AI inside banking applications and services.

By enhancing efficiency as well as making a judgement based upon information that is


incomprehensible to a particular operator, AI-based technologies can assist bankers
reduce expenses. Additionally, clever algorithms may quickly detect incorrect facts.

UNIT-I- Page 10 of 64
KCE-CSE –AI&ML 2023

1.Cybersecurity and fraud detection

Large numbers of online payments happen every day when consumers utilise
applications with account information to pay bills, withdraw money, deposit checks, and
do much more. As a result, every financial system must increase its operations towards
cybersecurity and fraud detection.

At this point financial artificial intelligence enters the picture. Artificial intelligence can
help banks with eliminating hazards, tracking system flaws, and enhancing the security
of online financial transactions. AI and machine learning can quickly spot potential
fraud and notify both consumers as well as banks.

2.Chatbots

Unquestionably, chatbots represent some of the best instances of how artificial


intelligence is used in banking. They may work any time they want once deployed, in
contrast to people who've already set operating time.

They also expect to study more concerning a specific customer's usage statistics. It aids
in their effective comprehension of user expectations.

The banks may guarantee that they remain accessible to their consumers 24 hours a day
by introducing bots within existing banking apps. Additionally, chatbots can provide
focused on customer attention and make appropriate financial service and product
recommendations through comprehending consumer behaviour.

3.Loan and credit decisions

In order to make better, safer, and more profitable loan and credit choices, banks are
trying implementing AI-based solutions. Presently, most banks still only consider a
person's or business's dependability based on their credit history, credit scores, and
consumer recommendations.

Somebody can ignore the fact that these credit reporting systems frequently contain
inaccuracies, exclude after all histories, and incorrectly identify creditors.

Consumers with little payment history can use an AI-based loan and credit system to
analyse existing behavioral patterns to assess its trustworthiness. Additionally, this
technology notifies banks from certain actions that can raise the likelihood of
bankruptcy. In short, these innovations were significantly altering the way that
customer borrowing will be conducted in the future.

4. Tracking market trends

UNIT-I- Page 11 of 64
KCE-CSE –AI&ML 2023

Bankers can analyse huge amounts of data as well as forecast the most recent economic
movements, commodities, and equities thanks to artificial intelligence in financial
institutions. Modern machine learning methods offer financial suggestions and assist in
evaluating market sentiments.

AI for banking additionally recommends when and how to buy equities and issues alerts
when there is a potential consequence.

5. Data collection and analysis

Everyday, financial and banking institutions record millions of transactions. Due to the
vast amount of knowledge gained, it becomes challenging for staff must acquire and
register it.

AI-based alternative approaches can aid in effective data collection and analysis in these
kind of circumstances. Thus, the whole user development is achieved. Additionally, the
data may be utilised to identify theft or make credit decisions.

6. Customer experience

Artificial intelligence integration will improve comfort conditions as well as the


customer experience in banking and finance operations. AI technology speeds up the
recording of Know Your Customer (KYC) data and removes mistakes. Furthermore,
timely releases of new goods and monetary incentives are possible.

TOPIC 3: PROBLEM SOLVING AGENTS

 An agent is anything that can be viewed as perceiving its environment through


sensors and acting upon that environment through actuators.
 A human agent has eyes, ears, and other organs for sensors and hands, legs, vocal
tract, and so on for actuators. A robotic agent might have cameras and infrared
range finders for sensors and various motors for actuators.
 A software agent receives file contents, network packets, and human input
(keyboard/mouse/touchscreen/voice) as sensory inputs and acts on the
environment by writing files, sending network packets, and displaying
information or generating sounds.

An AI system is composed of an agent and its environment. The agents act in their
environment. The environment may contain other agents

UNIT-I- Page 12 of 64
KCE-CSE –AI&ML 2023

Agent = Architecture + Agent Program


Architecture = the machinery that an agent executes on.
Agent Program = an implementation of an agent function.

What are Agent and Environment?


• An agent is anything that can perceive its environment through sensors and acts
upon that environment through effectors.
• A human agent has sensory organs such as eyes, ears, nose, tongue and skin
parallel to the sensors, and other organs such as hands, legs, mouth, for effectors.
• A robotic agent replaces cameras and infrared range finders for the sensors, and
various motors and actuators for effectors.
• A software agent has encoded bit strings as its programs and actions.

Specifying the Agent Environment

PEAS (Performance, Environment, Actuators, Sensors)

UNIT-I- Page 13 of 64
KCE-CSE –AI&ML 2023

3.1.Types of AI Agents

Agents can be grouped into five classes based on their degree of perceived intelligence
and capability. All these agents can improve their performance and generate better
action over the time. These are given below:

o Simple Reflex Agent


o Model-based reflex agent
o Goal-based agents
o Utility-based agent
o Learning agent

1. Simple Reflex agent:

o The Simple reflex agents are the simplest agents. These agents take decisions on
the basis of the current percepts and ignore the rest of the percept history.

o These agents only succeed in the fully observable environment.

UNIT-I- Page 14 of 64
KCE-CSE –AI&ML 2023

o The Simple reflex agent does not consider any part of percepts history during
their decision and action process.

o The Simple reflex agent works on Condition-action rule, which means it maps the
current state to action. Such as a Room Cleaner agent, it works only if there is
dirt in the room.

o Problems for the simple reflex agent design approach:

o They have very limited intelligence

o They do not have knowledge of non-perceptual parts of the current state

o Mostly too big to generate and to store.

o Not adaptive to changes in the environment.

3.2. Model-based reflex agent

o The Model-based agent can work in a partially observable environment, and


track the situation.

o A model-based agent has two important factors:

o Model: It is knowledge about "how things happen in the world," so it is


called a Model-based agent.

UNIT-I- Page 15 of 64
KCE-CSE –AI&ML 2023

o Internal State: It is a representation of the current state based on percept


history.

o These agents have the model, "which is knowledge of the world" and based on
the model they perform actions.

o Updating the agent state requires information about:

a. How the world evolves

b. How the agent's action affects the world.

3.3. Goal-based agents

o The knowledge of the current state environment is not always sufficient to


decide for an agent to what to do.

o The agent needs to know its goal which describes desirable situations.

o Goal-based agents expand the capabilities of the model-based agent by having


the "goal" information.

o They choose an action, so that they can achieve the goal.

UNIT-I- Page 16 of 64
KCE-CSE –AI&ML 2023

o These agents may have to consider a long sequence of possible actions before
deciding whether the goal is achieved or not. Such considerations of different
scenario are called searching and planning, which makes an agent proactive.

3.4. Utility-based agents

o These agents are similar to the goal-based agent but provide an extra component
of utility measurement which makes them different by providing a measure of
success at a given state.

o Utility-based agent act based not only goals but also the best way to achieve the
goal.

o The Utility-based agent is useful when there are multiple possible alternatives,
and an agent has to choose in order to perform the best action.

o The utility function maps each state to a real number to check how efficiently
each action achieves the goals.

UNIT-I- Page 17 of 64
KCE-CSE –AI&ML 2023

3.5. Learning Agents

o A learning agent in AI is the type of agent which can learn from its past
experiences, or it has learning capabilities.

o It starts to act with basic knowledge and then able to act and adapt automatically
through learning.

o A learning agent has mainly four conceptual components, which are:

a. Learning element: It is responsible for making improvements by


learning from environment

b. Critic: Learning element takes feedback from critic which describes that
how well the agent is doing with respect to a fixed performance standard.

c. Performance element: It is responsible for selecting external action

d. Problem generator: This component is responsible for suggesting


actions that will lead to new and informative experiences.

Hence, learning agents are able to learn, analyze performance, and look for new ways to
improve the performance.

UNIT-I- Page 18 of 64
KCE-CSE –AI&ML 2023

TOPIC 4: SEARCH ALGORITHMS

In Artificial Intelligence, Search techniques are universal problem-solving


methods. Rational agents or Problem-solving agents in AI mostly used these search
strategies or algorithms to solve a specific problem and provide the best result.
Problem-solving agents are the goal-based agents and use atomic representation.

Search Algorithm Terminologies:

Search: Searching is a step by step procedure to solve a search-problem in a given


search space. A search problem can have three main factors:

o Search Space: Search space represents a set of possible solutions, which


a system may have.
o Start State: It is a state from where agent begins the search.
o Goal test: It is a function which observe the current state and returns
whether the goal state is achieved or not.

Search tree: A tree representation of search problem is called Search tree. The root of
the search tree is the root node which is corresponding to the initial state.

Actions: It gives the description of all the available actions to the agent.

Transition model: A description of what each action do, can be represented as a


transition model.

UNIT-I- Page 19 of 64
KCE-CSE –AI&ML 2023

Path Cost: It is a function which assigns a numeric cost to each path.

Solution: It is an action sequence which leads from the start node to the goal node.
Optimal Solution: If a solution has the lowest cost among all solutions.

Properties of Search Algorithms:

Following are the four essential properties of search algorithms to compare the
efficiency of these algorithms:

 Completeness: A search algorithm is said to be complete if it guarantees to


return a solution if at least any solution exists for any random input.
 Optimality: If a solution found for an algorithm is guaranteed to be the best
solution (lowest path cost) among all other solutions, then such a solution for is
said to be an optimal solution.
 Time Complexity: Time complexity is a measure of time for an algorithm to
complete its task.
 Space Complexity: It is the maximum storage space required at any point during
the search, as the complexity of the problem.

Types of search algorithms

Based on the search problems we can classify the search algorithms into uninformed
(Blind search) search and informed search (Heuristic search) algorithms.

UNIT-I- Page 20 of 64
KCE-CSE –AI&ML 2023

Uninformed/Blind Search:

 The uninformed search does not contain any domain knowledge such as
closeness, the location of the goal.
 It operates in a brute-force way as it only includes information about how to
traverse the tree and how to identify leaf and goal nodes.
 Uninformed search applies a way in which search tree is searched without any
information about the search space like initial state operators and test for the
goal, so it is also called blind search.
 It examines each node of the tree until it achieves the goal node.

It can be divided into five main types:

o Breadth-first search
o Uniform cost search
o Depth-first search
o Iterative deepening depth-first search
o Bidirectional Search

Informed Search

 Informed search algorithms use domain knowledge. In an informed search,


problem information is available which can guide the search.
 Informed search strategies can find a solution more efficiently than an
uninformed search strategy. Informed search is also called a Heuristic search.
 A heuristic is a way which might not always be guaranteed for best solutions but
guaranteed to find a good solution in reasonable time.

Informed search can solve much complex problem which could not be solved in another
way.

An example of informed search algorithms is a traveling salesman problem.


1. Greedy Search
2. A* Search

TOPIC 5: UNINFORMED SEARCH STRATEGIES

Breadth-first Search:

o Breadth-first search is the most common search strategy for traversing a tree or
graph. This algorithm searches breadthwise in a tree or graph, so it is called
breadth-first search.

UNIT-I- Page 21 of 64
KCE-CSE –AI&ML 2023

o BFS algorithm starts searching from the root node of the tree and expands all
successor node at the current level before moving to nodes of next level.
o The breadth-first search algorithm is an example of a general-graph search
algorithm.
o Breadth-first search implemented using FIFO queue data structure.

Advantages:
o BFS will provide a solution if any solution exists.
o If there are more than one solutions for a given problem, then BFS will provide
the minimal solution which requires the least number of steps.

Disadvantages:
o It requires lots of memory since each level of the tree must be saved into
memory to expand the next level.
o BFS needs lots of time if the solution is far away from the root node.

Example:

In the below tree structure, we have shown the traversing of the tree using BFS
algorithm from the root node S to goal node K. BFS search algorithm traverse in layers,
so it will follow the path which is shown by the dotted arrow, and the traversed path
will be:

1. S---> A--->B---->C--->D---->G--->H--->E---->F---->I---->K

UNIT-I- Page 22 of 64
KCE-CSE –AI&ML 2023

Time Complexity: Time Complexity of BFS algorithm can be obtained by the number of
nodes traversed in BFS until the shallowest Node. Where the d= depth of shallowest
solution and b is a node at every state.

T (b) = 1+b2+b3+.......+ bd= O (bd)

Space Complexity: Space complexity of BFS algorithm is given by the Memory size of
frontier which is O(bd).

Completeness: BFS is complete, which means if the shallowest goal node is at some
finite depth, then BFS will find a solution.

Optimality: BFS is optimal if path cost is a non-decreasing function of the depth of the
node.

2. Depth-first Search

o Depth-first search isa recursive algorithm for traversing a tree or graph data
structure.
o It is called the depth-first search because it starts from the root node and follows
each path to its greatest depth node before moving to the next path.
o DFS uses a stack data structure for its implementation.
o The process of the DFS algorithm is similar to the BFS algorithm.

Note: Backtracking is an algorithm technique for finding all possible solutions using
recursion.

Advantage:
o DFS requires very less memory as it only needs to store a stack of the nodes on
the path from root node to the current node.
o It takes less time to reach to the goal node than BFS algorithm (if it traverses in
the right path).

Disadvantage:
o There is the possibility that many states keep re-occurring, and there is no
guarantee of finding the solution.
o DFS algorithm goes for deep down searching and sometime it may go to the
infinite loop.

Example:

In the below search tree, we have shown the flow of depth-first search, and it will follow
the order as:
Root node--->Left node ----> right node.

UNIT-I- Page 23 of 64
KCE-CSE –AI&ML 2023

It will start searching from root node S, and traverse A, then B, then D and E, after
traversing E, it will backtrack the tree as E has no other successor and still goal node is
not found. After backtracking it will traverse node C and then G, and here it will
terminate as it found goal node.

Completeness: DFS search algorithm is complete within finite state space as it will
expand every node within a limited search tree.

Time Complexity: Time complexity of DFS will be equivalent to the node traversed by
the algorithm. It is given by:

T(n)= 1+ n2+ n3 +.........+ nm=O(nm)

Where, m= maximum depth of any node and this can be much larger than d (Shallowest
solution depth)

Space Complexity: DFS algorithm needs to store only single path from the root node,
hence space complexity of DFS is equivalent to the size of the fringe set, which is O(bm).

Optimal: DFS search algorithm is non-optimal, as it may generate a large number of


steps or high cost to reach to the goal node.

3. Depth-Limited Search Algorithm:

A depth-limited search algorithm is similar to depth-first search with a predetermined


limit. Depth-limited search can solve the drawback of the infinite path in the Depth-first

UNIT-I- Page 24 of 64
KCE-CSE –AI&ML 2023

search. In this algorithm, the node at the depth limit will treat as it has no successor
nodes further.

Depth-limited search can be terminated with two Conditions of failure:

o Standard failure value: It indicates that problem does not have any solution.
o Cutoff failure value: It defines no solution for the problem within a given depth
limit.

Advantages:

Depth-limited search is Memory efficient.

Disadvantages:

o Depth-limited search also has a disadvantage of incompleteness.


o It may not be optimal if the problem has more than one solution.

Example:

Completeness: DLS search algorithm is complete if the solution is above the depth-
limit.
Time Complexity: Time complexity of DLS algorithm is O(bℓ).
Space Complexity: Space complexity of DLS algorithm is O(b×ℓ).
Optimal: Depth-limited search can be viewed as a special case of DFS, and it is also not
optimal even if ℓ>d.

UNIT-I- Page 25 of 64
KCE-CSE –AI&ML 2023

4. Uniform-cost Search Algorithm:

 Uniform-cost search is a searching algorithm used for traversing a weighted tree


or graph. This algorithm comes into play when a different cost is available for
each edge.
 The primary goal of the uniform-cost search is to find a path to the goal node
which has the lowest cumulative cost.
 Uniform-cost search expands nodes according to their path costs form the root
node.
 It can be used to solve any graph/tree where the optimal cost is in demand.
 A uniform-cost search algorithm is implemented by the priority queue. It gives
maximum priority to the lowest cumulative cost.
 Uniform cost search is equivalent to BFS algorithm if the path cost of all edges is
the same.

Advantages:
o Uniform cost search is optimal because at every state the path with the least cost
is chosen.
Disadvantages:
o It does not care about the number of steps involve in searching and only
concerned about path cost. Due to which this algorithm may be stuck in an
infinite loop.
Example:

Completeness:

UNIT-I- Page 26 of 64
KCE-CSE –AI&ML 2023

Uniform-cost search is complete, such as if there is a solution, UCS will find it.

Time Complexity:

Let C* is Cost of the optimal solution, and ε is each step to get closer to the goal node.
Then the number of steps is = C*/ε+1. Here we have taken +1, as we start from state 0
and end to C*/ε.

Hence, the worst-case time complexity of Uniform-cost search isO(b1 + [C*/ε])/.

Space Complexity:

The same logic is for space complexity so, the worst-case space complexity of Uniform-
cost search is O(b1 + [C*/ε]).

Optimal:

Uniform-cost search is always optimal as it only selects a path with the lowest path cost.

5. Iterative deepening depth-first Search:

 The iterative deepening algorithm is a combination of DFS and BFS algorithms.


This search algorithm finds out the best depth limit and does it by gradually
increasing the limit until a goal is found.
 This algorithm performs depth-first search up to a certain "depth limit", and it
keeps increasing the depth limit after each iteration until the goal node is found.
 This Search algorithm combines the benefits of Breadth-first search's fast search
and depth-first search's memory efficiency.
 The iterative search algorithm is useful uninformed search when search space is
large, and depth of goal node is unknown.

Advantages:
o It combines the benefits of BFS and DFS search algorithm in terms of fast search
and memory efficiency.
Disadvantages:
o The main drawback of IDDFS is that it repeats all the work of the previous phase.

Example:

Following tree structure is showing the iterative deepening depth-first search. IDDFS
algorithm performs various iterations until it does not find the goal node. The iteration
performed by the algorithm is given as:

UNIT-I- Page 27 of 64
KCE-CSE –AI&ML 2023

1'st Iteration-----> A
2'nd Iteration----> A, B, C
3'rd Iteration------>A, B, D, E, C, F, G
4'th Iteration------>A, B, D, H, I, E, C, F, K, G
In the fourth iteration, the algorithm will find the goal node.

Completeness:
This algorithm is complete is ifthe branching factor is finite.
Time Complexity:
Let's suppose b is the branching factor and depth is d then the worst-case time
complexity is O(bd).
Space Complexity:
The space complexity of IDDFS will be O(bd).
Optimal:
IDDFS algorithm is optimal if path cost is a non- decreasing function of the depth of the
node.
6. Bidirectional Search Algorithm:

 Bidirectional search algorithm runs two simultaneous searches, one form initial
state called as forward-search and other from goal node called as backward-
search, to find the goal node.
 Bidirectional search replaces one single search graph with two small subgraphs
in which one starts the search from an initial vertex and other starts from goal
vertex. The search stops when these two graphs intersect each other.

Bidirectional search can use search techniques such as BFS, DFS, DLS, etc.

UNIT-I- Page 28 of 64
KCE-CSE –AI&ML 2023

Advantages:
o Bidirectional search is fast.
o Bidirectional search requires less memory
Disadvantages:
o Implementation of the bidirectional search tree is difficult.

o In bidirectional search, one should know the goal state in advance.

Example:

In the below search tree, bidirectional search algorithm is applied. This algorithm
divides one graph/tree into two sub-graphs. It starts traversing from node 1 in the
forward direction and starts from goal node 16 in the backward direction.

The algorithm terminates at node 9 where two searches meet.

Completeness: Bidirectional Search is complete if we use BFS in both searches.

Time Complexity: Time complexity of bidirectional search using BFS is O(bd).

Space Complexity: Space complexity of bidirectional search is O(bd).

Optimal: Bidirectional search is Optimal.

UNIT-I- Page 29 of 64
KCE-CSE –AI&ML 2023

TOPIC 6: HEURISTIC SEARCH STRATEGIES

 Informed search algorithm contains an array of knowledge such as how far we


are from the goal, path cost, how to reach to goal node, etc. This knowledge help
agents to explore less to the search space and find more efficiently the goal node.
 The informed search algorithm is more useful for large search space. Informed
search algorithm uses the idea of heuristic, so it is also called Heuristic search.

Heuristics function:

 Heuristic is a function which is used in Informed Search, and it finds the most
promising path.
 It takes the current state of the agent as its input and produces the estimation of
how close agent is from the goal.
 The heuristic method, however, might not always give the best solution, but it
guaranteed to find a good solution in reasonable time.
 Heuristic function estimates how close a state is to the goal. It is represented by
h(n), and it calculates the cost of an optimal path between the pair of states. The
value of the heuristic function is always positive.

Admissibility of the heuristic function is given as:

1. h(n) <= h*(n)

Here h(n) is heuristic cost, and h*(n) is the estimated cost. Hence heuristic cost should
be less than or equal to the estimated cost.

Pure heuristic search is the simplest form of heuristic search algorithms. It expands
nodes based on their heuristic value h(n). It maintains two lists, OPEN and CLOSED list.
In the CLOSED list, it places those nodes which have already expanded and in the OPEN
list, it places nodes which have yet not been expanded.

On each iteration, each node n with the lowest heuristic value is expanded and
generates all its successors and n is placed to the closed list. The algorithm continues
unit a goal state is found.

In the informed search we will discuss two main algorithms which are given below:

o Best First Search Algorithm(Greedy search)


o A* Search Algorithm

6.1.Best-first Search Algorithm (Greedy Search):

Greedy best-first search algorithm always selects the path which appears best at that
moment. It is the combination of depth-first search and breadth-first search algorithms.

UNIT-I- Page 30 of 64
KCE-CSE –AI&ML 2023

It uses the heuristic function and search. Best-first search allows us to take the
advantages of both algorithms. With the help of best-first search, at each step, we can
choose the most promising node. In the best first search algorithm, we expand the node
which is closest to the goal node and the closest cost is estimated by heuristic function,
i.e.

1. f(n)= g(n).

Were, h(n)= estimated cost from node n to the goal.

The greedy best first algorithm is implemented by the priority queue.

Best first search algorithm:

o Step 1: Place the starting node into the OPEN list.


o Step 2: If the OPEN list is empty, Stop and return failure.
o Step 3: Remove the node n, from the OPEN list which has the lowest value of
h(n), and places it in the CLOSED list.
o Step 4: Expand the node n, and generate the successors of node n.
o Step 5: Check each successor of node n, and find whether any node is a goal node
or not. If any successor node is goal node, then return success and terminate the
search, else proceed to Step 6.
o Step 6: For each successor node, algorithm checks for evaluation function f(n),
and then check if the node has been in either OPEN or CLOSED list. If the node
has not been in both list, then add it to the OPEN list.
o Step 7: Return to Step 2.

Advantages:
o Best first search can switch between BFS and DFS by gaining the advantages of
both the algorithms.
o This algorithm is more efficient than BFS and DFS algorithms.

Disadvantages:
o It can behave as an unguided depth-first search in the worst case scenario.
o It can get stuck in a loop as DFS.
o This algorithm is not optimal.

Example:

Consider the below search problem, and we will traverse it using greedy best-first
search. At each iteration, each node is expanded using evaluation function f(n)=h(n) ,
which is given in the below table.

UNIT-I- Page 31 of 64
KCE-CSE –AI&ML 2023

In this search example, we are using two lists which are OPEN and CLOSED Lists.
Following are the iteration for traversing the above example.

Expand the nodes of S and put in the CLOSED list

Initialization: Open [A, B], Closed [S]

Iteration 1: Open [A], Closed [S, B]

Iteration 2: Open [E, F, A], Closed [S, B]


: Open [E, A], Closed [S, B, F]

UNIT-I- Page 32 of 64
KCE-CSE –AI&ML 2023

Iteration 3: Open [I, G, E, A], Closed [S, B, F]


: Open [I, E, A], Closed [S, B, F, G]

Hence the final solution path will be: S----> B----->F----> G

Time Complexity: The worst case time complexity of Greedy best first search is O(b m).

Space Complexity: The worst case space complexity of Greedy best first search is
O(bm). Where, m is the maximum depth of the search space.

Complete: Greedy best-first search is also incomplete, even if the given state space is
finite.

Optimal: Greedy best first search algorithm is not optimal.

6.2. A* Search Algorithm:

 A* search is the most commonly known form of best-first search. It uses heuristic
function h(n), and cost to reach the node n from the start state g(n).
 It has combined features of UCS and greedy best-first search, by which it solve the
problem efficiently. A* search algorithm finds the shortest path through the search
space using the heuristic function.
 This search algorithm expands less search tree and provides optimal result faster. A*
algorithm is similar to UCS except that it uses g(n)+h(n) instead of g(n).
 In A* search algorithm, we use search heuristic as well as the cost to reach the node.
Hence we can combine both costs as following, and this sum is called as a fitness
number.

At each point in the search space, only those node is expanded which have the lowest
value of f(n), and the algorithm terminates when the goal node is found.

Algorithm of A* search:

Step1: Place the starting node in the OPEN list.

Step 2: Check if the OPEN list is empty or not, if the list is empty then return failure and
stops.

UNIT-I- Page 33 of 64
KCE-CSE –AI&ML 2023

Step 3: Select the node from the OPEN list which has the smallest value of evaluation
function (g+h), if node n is goal node then return success and stop, otherwise

Step 4: Expand node n and generate all of its successors, and put n into the closed list.
For each successor n', check whether n' is already in the OPEN or CLOSED list, if not
then compute evaluation function for n' and place into Open list.

Step 5: Else if node n' is already in OPEN and CLOSED, then it should be attached to the
back pointer which reflects the lowest g(n') value.

Step 6: Return to Step 2.

Advantages:
o A* search algorithm is the best algorithm than other search algorithms.
o A* search algorithm is optimal and complete.

o This algorithm can solve very complex problems.

Disadvantages:
o It does not always produce the shortest path as it mostly based on heuristics and
approximation.
o A* search algorithm has some complexity issues.
o The main drawback of A* is memory requirement as it keeps all generated nodes
in the memory, so it is not practical for various large-scale problems.

Example:
In this example, we will traverse the given graph using the A* algorithm. The heuristic
value of all states is given in the below table so we will calculate the f(n) of each state
using the formula f(n)= g(n) + h(n), where g(n) is the cost to reach any node from start
state.
Here we will use OPEN and CLOSED list.

UNIT-I- Page 34 of 64
KCE-CSE –AI&ML 2023

Solution:

Initialization: {(S, 5)}


Iteration1: {(S--> A, 4), (S-->G, 10)}
Iteration2: {(S--> A-->C, 4), (S--> A-->B, 7), (S-->G, 10)}
Iteration3: {(S--> A-->C--->G, 6), (S--> A-->C--->D, 11), (S--> A-->B, 7), (S-->G, 10)}
Iteration 4 will give the final result, as S--->A--->C--->G it provides the optimal path with
cost 6.
Points to remember:

o A* algorithm returns the path which occurred first, and it does not search for all
remaining paths.
o The efficiency of A* algorithm depends on the quality of heuristic.
o A* algorithm expands all nodes which satisfy the condition f(n)<="" li="">

Complete: A* algorithm is complete as long as:

o Branching factor is finite.


o Cost at every action is fixed.

Optimal: A* search algorithm is optimal if it follows below two conditions:

o Admissible: the first condition requires for optimality is that h(n) should be an
admissible heuristic for A* tree search. An admissible heuristic is optimistic in
nature.
o Consistency: Second required condition is consistency for only A* graph-search.

If the heuristic function is admissible, then A* tree search will always find the least cost
path.

UNIT-I- Page 35 of 64
KCE-CSE –AI&ML 2023

Time Complexity: The time complexity of A* search algorithm depends on heuristic


function, and the number of nodes expanded is exponential to the depth of solution d. So
the time complexity is O(b^d), where b is the branching factor.

Space Complexity: The space complexity of A* search algorithm is O(b^d)

6.3.Memory Bounded Heuristic Search

Some solutions to A* space problems (maintain completeness and optimality)


– Iterative-deepening A* (IDA*)
– Here cutoff information is the f-cost (g+h) instead of depth
– Recursive best-first search(RBFS)
– Recursive algorithm that attempts to mimic standard best-first
search with linear space.
– (simple) Memory-bounded A* ((S)MA*)
– Drop the worst-leaf node when memory is full

Recursive best-first search(RBFS)

 Keeps track of the f-value of the best-alternative path available.


 If current f-values exceeds this alternative f-value than backtrack to alternative
path.
 Upon backtracking change f-value to best f-value of its children.
 Re-expansion of this result is thus still possible

 Path until Rumnicu Vilcea is already expanded


 Above node; f-limit for every recursive call is shown on top.
 Below node: f(n)
 The path is followed until Pitesti which has a f-value worse than the f-limit.

UNIT-I- Page 36 of 64
KCE-CSE –AI&ML 2023

Unwind recursion and store best f-value for current best leaf Pitesti

result, f [best]  RBFS(problem, best, min(f_limit, alternative))

best is now Fagaras. Call RBFS for new best


– best value is now 450

Unwind recursion and store best f-value for current best leaf Fagaras

result, f [best]  RBFS(problem, best, min(f_limit, alternative))

UNIT-I- Page 37 of 64
KCE-CSE –AI&ML 2023

best is now Rimnicu Viclea (again). Call RBFS for new best
– Subtree is again expanded.
– Best alternative subtree is now through Timisoara.
Solution is found since because 447 > 417.

Simple Memory Bounded A* (SMA*)

 This is like A*, but when memory is full we delete the worst node (largest f-
value).
 Like RBFS, we remember the best descendent in the branch we delete.
 If there is a tie (equal f-values) we delete the oldest nodes first.
 simple-MBA* finds the optimal reachable solution given the memory
constraint.
 Time can still be exponential.

Progress of SMA*. Each node is labeled with its current f-cost. Values in
parentheses show the value of the best forgotten descendant

g+h = f

UNIT-I- Page 38 of 64
KCE-CSE –AI&ML 2023

maximal depth is 3, since


memory limit is 3. This
best estimated branch is now useless.
solution
so far for that
node

TOPIC 7: LOCAL SEARCH AND OPTIMIZATION PROBLEMS

o Hill climbing algorithm is a local search algorithm which continuously moves in


the direction of increasing elevation/value to find the peak of the mountain or
best solution to the problem. It terminates when it reaches a peak value where
no neighbor has a higher value.
o Hill climbing algorithm is a technique which is used for optimizing the
mathematical problems. One of the widely discussed examples of Hill climbing
algorithm is Traveling-salesman Problem in which we need to minimize the
distance traveled by the salesman.
o It is also called greedy local search as it only looks to its good immediate
neighbor state and not beyond that.
o A node of hill climbing algorithm has two components which are state and value.
o Hill Climbing is mostly used when a good heuristic is available.

UNIT-I- Page 39 of 64
KCE-CSE –AI&ML 2023

o In this algorithm, we don't need to maintain and handle the search tree or graph
as it only keeps a single current state.

Features of Hill Climbing:

Following are some main features of Hill Climbing Algorithm:

o Generate and Test variant: Hill Climbing is the variant of Generate and Test
method. The Generate and Test method produce feedback which helps to decide
which direction to move in the search space.
o Greedy approach: Hill-climbing algorithm search moves in the direction which
optimizes the cost.
o No backtracking: It does not backtrack the search space, as it does not
remember the previous states.

State-space Diagram for Hill Climbing:

The state-space landscape is a graphical representation of the hill-climbing


algorithm which is showing a graph between various states of algorithm and Objective
function/Cost.

On Y-axis we have taken the function which can be an objective function or cost
function, and state-space on the x-axis. If the function on Y-axis is cost then, the goal of
search is to find the global minimum and local minimum. If the function of Y-axis is
Objective function, then the goal of the search is to find the global maximum and local
maximum.

Different regions in the state space landscape:

UNIT-I- Page 40 of 64
KCE-CSE –AI&ML 2023

Local Maximum: Local maximum is a state which is better than its neighbor states, but
there is also another state which is higher than it.

Global Maximum: Global maximum is the best possible state of state space landscape.
It has the highest value of objective function.

Current state: It is a state in a landscape diagram where an agent is currently present.

Flat local maximum: It is a flat space in the landscape where all the neighbor states of
current states have the same value.

Shoulder: It is a plateau region which has an uphill edge.

Types of Hill Climbing Algorithm:

o Simple hill Climbing:


o Steepest-Ascent hill-climbing:
o Stochastic hill Climbing:

7.1. Simple Hill Climbing:

Simple hill climbing is the simplest way to implement a hill climbing algorithm. It only
evaluates the neighbor node state at a time and selects the first one which optimizes
current cost and set it as a current state. It only checks it's one successor state, and if it
finds better than the current state, then move else be in the same state. This algorithm
has the following features:

o Less time consuming


o Less optimal solution and the solution is not guaranteed

Algorithm for Simple Hill Climbing:

o Step 1: Evaluate the initial state, if it is goal state then return success and Stop.
o Step 2: Loop Until a solution is found or there is no new operator left to apply.
o Step 3: Select and apply an operator to the current state.
o Step 4: Check new state:
a. If it is goal state, then return success and quit.
b. Else if it is better than the current state then assign new state as a current state.
c. Else if not better than the current state, then return to step2.
Step 5: Exit.

7.2. Steepest-Ascent hill climbing:

UNIT-I- Page 41 of 64
KCE-CSE –AI&ML 2023

The steepest-Ascent algorithm is a variation of simple hill climbing algorithm. This


algorithm examines all the neighboring nodes of the current state and selects one
neighbor node which is closest to the goal state. This algorithm consumes more time as
it searches for multiple neighbors

Algorithm for Steepest-Ascent hill climbing:

o Step 1: Evaluate the initial state, if it is goal state then return success and stop,
else make current state as initial state.
o Step 2: Loop until a solution is found or the current state does not change.
a. Let SUCC be a state such that any successor of the current state will be better
than it.
b. For each operator that applies to the current state:

 Apply the new operator and generate a new state.


 Evaluate the new state.
 If it is goal state, then return it and quit, else compare it to the SUCC.
 If it is better than SUCC, then set new state as SUCC.
 If the SUCC is better than the current state, then set current state to SUCC.

Step 3: Exit.

7.3. Stochastic hill climbing:

Stochastic hill climbing does not examine for all its neighbor before moving. Rather, this
search algorithm selects one neighbor node at random and decides whether to choose it
as a current state or examine another state.

Problems in Hill Climbing Algorithm:

1. Local Maximum: A local maximum is a peak state in the landscape which is better
than each of its neighboring states, but there is another state also present which is
higher than the local maximum.

Solution: Backtracking technique can be a solution of the local maximum in state space
landscape. Create a list of the promising path so that the algorithm can backtrack the
search space and explore other paths as well.

UNIT-I- Page 42 of 64
KCE-CSE –AI&ML 2023

2. Plateau: A plateau is the flat area of the search space in which all the neighbor states
of the current state contains the same value, because of this algorithm does not find any
best direction to move. A hill-climbing search might be lost in the plateau area.

Solution: The solution for the plateau is to take big steps or very little steps while
searching, to solve the problem. Randomly select a state which is far away from the
current state so it is possible that the algorithm could find non-plateau region.

3. Ridges: A ridge is a special form of the local maximum. It has an area which is higher
than its surrounding areas, but itself has a slope, and cannot be reached in a single
move.

Solution: With the use of bidirectional search, or by moving in different directions, we


can improve this problem.

Simulated Annealing:

A hill-climbing algorithm which never makes a move towards a lower value guaranteed
to be incomplete because it can get stuck on a local maximum. And if algorithm applies a

UNIT-I- Page 43 of 64
KCE-CSE –AI&ML 2023

random walk, by moving a successor, then it may complete but not efficient. Simulated
Annealing is an algorithm which yields both efficiency and completeness.

In mechanical term Annealing is a process of hardening a metal or glass to a high


temperature then cooling gradually, so this allows the metal to reach a low-energy
crystalline state. The same process is used in simulated annealing in which the
algorithm picks a random move, instead of picking the best move. If the random move
improves the state, then it follows the same path. Otherwise, the algorithm follows the
path which has a probability of less than 1 or it moves downhill and chooses another
path.

TOPIC 8: ADVERSARIAL SEARCH

Adversarial search is a search, where we examine the problem which arises when
we try to plan ahead of the world and other agents are planning against us.

o The search strategies which are only associated with a single agent that aims to
find the solution which often expressed in the form of a sequence of actions.
o But, there might be some situations where more than one agent is searching for
the solution in the same search space, and this situation usually occurs in game
playing.
o The environment with more than one agent is termed as multi-agent
environment, in which each agent is an opponent of other agent and playing
against each other. Each agent needs to consider the action of other agent and
effect of that action on their performance.
o So, Searches in which two or more players with conflicting goals are trying
to explore the same search space for the solution, are called adversarial
searches, often known as Games.
o Games are modeled as a Search problem and heuristic evaluation function, and
these are the two main factors which help to model and solve games in AI.

Types of Games in AI

Chance Moves
Deterministic

Perfect information Chess, Checkers, go, Backgammon, monopoly


Othello

Imperfect Battleships, blind, tic- Bridge, poker, scrabble,


information tac-toe nuclear war

UNIT-I- Page 44 of 64
KCE-CSE –AI&ML 2023

o Perfect information: A game with the perfect information is that in which


agents can look into the complete board. Agents have all the information about
the game, and they can see each other moves also. Examples are Chess, Checkers,
Go, etc.
o Imperfect information: If in a game agents do not have all information about
the game and not aware with what's going on, such type of games are called the
game with imperfect information, such as tic-tac-toe, Battleship, blind, Bridge,
etc.
o Deterministic games: Deterministic games are those games which follow a
strict pattern and set of rules for the games, and there is no randomness
associated with them. Examples are chess, Checkers, Go, tic-tac-toe, etc.
o Non-deterministic games: Non-deterministic are those games which have
various unpredictable events and has a factor of chance or luck. This factor of
chance or luck is introduced by either dice or cards. These are random, and each
action response is not fixed. Such games are also called as stochastic games.
Example: Backgammon, Monopoly, Poker, etc.

Zero-Sum Game

o Zero-sum games are adversarial search which involves pure competition.


o In Zero-sum game each agent's gain or loss of utility is exactly balanced by the
losses or gains of utility of another agent.
o One player of the game try to maximize one single value, while other player tries
to minimize it.
o Each move by one player in the game is called as ply.
o Chess and tic-tac-toe are examples of a Zero-sum game.

The Zero-sum game involved embedded thinking in which one agent or player is
trying to figure out:

o What to do.
o How to decide the move
o Needs to think about his opponent as well
o The opponent also thinks what to do

Each of the players is trying to find out the response of his opponent to their actions.
This requires embedded thinking or backward reasoning to solve the game problems in
AI.

Problem Formulation

o Initial state: It specifies how the game is set up at the start.

UNIT-I- Page 45 of 64
KCE-CSE –AI&ML 2023

o Player(s): It specifies which player has moved in the state space.


o Action(s): It returns the set of legal moves in state space.
o Result(s, a): It is the transition model, which specifies the result of moves in the
state space.
o Terminal-Test(s): Terminal test is true if the game is over, else it is false at any
case. The state where the game ends is called terminal states.
o Utility(s, p): A utility function gives the final numeric value for a game that ends
in terminal states s for player p. It is also called payoff function. For Chess, the
outcomes are a win, loss, or draw and its payoff values are +1, 0, ½. And for tic-
tac-toe, utility values are +1, -1, and 0.

Example: Tic-Tac-Toe game tree:

The following figure is showing part of the game-tree for tic-tac-toe game. Following are
some key points of the game:

o There are two players MAX and MIN.


o Players have an alternate turn and start with MAX.
o MAX maximizes the result of the game tree
o MIN minimizes the result.

UNIT-I- Page 46 of 64
KCE-CSE –AI&ML 2023

o From the initial state, MAX has 9 possible moves as he starts first. MAX place x
and MIN place o, and both player plays alternatively until we reach a leaf node
where one player has three in a row or all squares are filled.
o Both players will compute each node, minimax, the minimax value which is the
best achievable utility against an optimal adversary.
o Suppose both the players are well aware of the tic-tac-toe and playing the best
play. Each player is doing his best to prevent another one from winning. MIN is
acting against Max in the game.
o So in the game tree, we have a layer of Max, a layer of MIN, and each layer is
called as Ply. Max place x, then MIN puts o to prevent Max from winning, and this
game continues until the terminal node.

UNIT-I- Page 47 of 64
KCE-CSE –AI&ML 2023

o In this either MIN wins, MAX wins, or it's a draw. This game-tree is the whole
search space of possibilities that MIN and MAX are playing tic-tac-toe and taking
turns alternately.

Hence adversarial Search for the minimax procedure works as follows:

o It aims to find the optimal strategy for MAX to win the game.
o It follows the approach of Depth-first search.
o In the game tree, optimal leaf node could appear at any depth of the tree.
o Propagate the minimax values up to the tree until the terminal node discovered.

Min-Max Algorithm

o In this algorithm two players play the game, one is called MAX and other is called
MIN.
o Both the players fight it as the opponent player gets the minimum benefit while
they get the maximum benefit.
o Both Players of the game are opponent of each other, where MAX will select the
maximized value and MIN will select the minimized value.
o The minimax algorithm performs a depth-first search algorithm for the
exploration of the complete game tree.
o The minimax algorithm proceeds all the way down to the terminal node of the
tree, then backtrack the tree as the recursion.

Step-1: In the first step, the algorithm generates the entire game-tree and apply
the utility function to get the utility values for the terminal states. In the below
tree diagram, let's take A is the initial state of the tree. Suppose maximizer takes
first turn which has worst-case initial value =- infinity, and minimizer will take
next turn which has worst-case initial value = +infinity.

UNIT-I- Page 48 of 64
KCE-CSE –AI&ML 2023

Step 2: Now, first we find the utilities value for the Maximizer, its initial value is -∞, so
we will compare each value in terminal state with initial value of Maximizer and
determines the higher nodes values. It will find the maximum among the all.

o For node D max(-1,- -∞) => max(-1,4)= 4


o For Node E max(2, -∞) => max(2, 6)= 6
o For Node F max(-3, -∞) => max(-3,-5) = -3
o For node G max(0, -∞) = max(0, 7) = 7

Step 3: In the next step, it's a turn for minimizer, so it will compare all nodes
value with +∞, and will find the 3rd layer node values.

UNIT-I- Page 49 of 64
KCE-CSE –AI&ML 2023

o For node B= min(4,6) = 4


o For node C= min (-3, 7) = -3

Step 4: Now it's a turn for Maximizer, and it will again choose the maximum of all nodes
value and find the maximum value for the root node. In this game tree, there are only 4
layers, hence we reach immediately to the root node, but in real games, there will be
more than 4 layers.

o For node A max(4, -3)= 4

Properties of Mini-Max algorithm:

UNIT-I- Page 50 of 64
KCE-CSE –AI&ML 2023

o Complete- Min-Max algorithm is Complete. It will definitely find a solution (if


exist), in the finite search tree.
o Optimal- Min-Max algorithm is optimal if both opponents are playing optimally.
o Time complexity- As it performs DFS for the game-tree, so the time complexity
of Min-Max algorithm is O(bm), where b is branching factor of the game-tree, and
m is the maximum depth of the tree.
o Space Complexity- Space complexity of Mini-max algorithm is also similar to
DFS which is O(bm).

Limitation of the minimax Algorithm:

The main drawback of the minimax algorithm is that it gets really slow for complex
games such as Chess, go, etc. This type of games has a huge branching factor, and the
player has lots of choices to decide. This limitation of the minimax algorithm can be
improved from alpha-beta pruning

o In the minimax search algorithm, the number of game states it has to examine
are exponential in depth of the tree.
o Since we cannot eliminate the exponent, but we can cut it to half. Hence there is a
technique by which without checking each node of the game tree we can
compute the correct minimax decision, and this technique is called pruning.
o This involves two threshold parameter Alpha and beta for future expansion, so it
is called alpha-beta pruning. It is also called as Alpha-Beta Algorithm.
o Alpha-beta pruning can be applied at any depth of a tree, and sometimes it not
only prune the tree leaves but also entire sub-tree.
o The two-parameter can be defined as:
o Alpha: The best (highest-value) choice we have found so far at any
point along the path of Maximizer. The initial value of alpha is -∞.
o Beta: The best (lowest-value) choice we have found so far at any
point along the path of Minimizer. The initial value of beta is +∞.

o The Alpha-beta pruning to a standard minimax algorithm returns the same move
as the standard algorithm does, but it removes all the nodes which are not really
affecting the final decision but making algorithm slow. Hence by pruning these
nodes, it makes the algorithm fast.

Key points about alpha-beta pruning:

UNIT-I- Page 51 of 64
KCE-CSE –AI&ML 2023

o The Max player will only update the value of alpha.


o The Min player will only update the value of beta.
o While backtracking the tree, the node values will be passed to upper nodes
instead of values of alpha and beta.

Working of Alpha-Beta Pruning:


o Let's take an example of two-player search tree to understand the working of
Alpha-beta pruning
o Step 1: At the first step the, Max player will start first move from node A where
α= -∞ and β= +∞, these value of alpha and beta passed down to node B where
again α= -∞ and β= +∞, and Node B passes the same value to its child D.

Step 2: At Node D, the value of α will be calculated as its turn for Max. The value of α is
compared with firstly 2 and then 3, and the max (2, 3) = 3 will be the value of α at node
D and node value will also 3.

Step 3: Now algorithm backtrack to node B, where the value of β will change as this is a
turn of Min, Now β= +∞, will compare with the available subsequent nodes value, i.e.
min (∞, 3) = 3, hence at node B now α= -∞, and β= 3.

UNIT-I- Page 52 of 64
KCE-CSE –AI&ML 2023

In the next step, algorithm traverse the next successor of Node B which is node E, and
the values of α= -∞, and β= 3 will also be passed.

Step 4: At node E, Max will take its turn, and the value of alpha will change. The current
value of alpha will be compared with 5, so max (-∞, 5) = 5, hence at node E α= 5 and β=
3, where α>=β, so the right successor of E will be pruned, and algorithm will not
traverse it, and the value at node E will be 5.

Step 5: At next step, algorithm again backtrack the tree, from node B to node A. At node
A, the value of alpha will be changed the maximum available value is 3 as max (-∞, 3)=
3, and β= +∞, these two values now passes to right successor of A which is Node C.

At node C, α=3 and β= +∞, and the same values will be passed on to node F.

UNIT-I- Page 53 of 64
KCE-CSE –AI&ML 2023

Step 6: At node F, again the value of α will be compared with left child which is 0, and
max(3,0)= 3, and then compared with right child which is 1, and max(3,1)= 3 still α
remains 3, but the node value of F will become 1.

Step 7: Node F returns the node value 1 to node C, at C α= 3 and β= +∞, here the value
of beta will be changed, it will compare with 1 so min (∞, 1) = 1. Now at C, α=3 and β= 1,
and again it satisfies the condition α>=β, so the next child of C which is G will be pruned,
and the algorithm will not compute the entire sub-tree G.

Step 8: C now returns the value of 1 to A here the best value for A is max (3, 1) = 3.
Following is the final game tree which is the showing the nodes which are computed

UNIT-I- Page 54 of 64
KCE-CSE –AI&ML 2023

and nodes which has never computed. Hence the optimal value for the maximizer is 3
for this example.

TOPIC 9: CONSTRAINT SATISFACTION PROBLEMS (CSP)

• What is a CSP?
– Finite set of variables X1, X2, …, Xn
– Nonempty domain of possible values for each variable
D1, D2, …, Dn
– Finite set of constraints C1, C2, …, Cm
• Each constraint Ci limits the values that variables can take,
• e.g., X1 ≠ X2
– Each constraint Ci is a pair <scope, relation>
• Scope = Tuple of variables that participate in the constraint.
• Relation = List of allowed combinations of variable values.

May be an explicit list of allowed combinations.

May be an abstract relation allowing membership testing and listing.

Three factors affect restriction compliance, particularly regarding:

o It refers to a group of parameters, or X.

UNIT-I- Page 55 of 64
KCE-CSE –AI&ML 2023

o D: The variables are contained within a collection several domain. Every


variables has a distinct scope.

o C: It is a set of restrictions that the collection of parameters must abide by.

In constraint satisfaction, domains are the areas wherein parameters were located after
the restrictions that are particular to the task. Those three components make up a
constraint satisfaction technique in its entirety. The pair "scope, rel" makes up the
number of something like the requirement. The scope is a tuple of variables that
contribute to the restriction, as well as rel is indeed a relationship that contains a list of
possible solutions for the parameters should assume in order to meet the restrictions of
something like the issue.

Issues

For a constraint satisfaction problem (CSP), the following conditions must be met:

o States area

o fundamental idea while behind remedy.

The definition of a state in phase space involves giving values to any or all of the
parameters, like as

X1 = v1, X2 = v2, etc.

There are 3 methods to economically beneficial to something like a parameter:

1. Consistent or Legal Assignment: A task is referred to as consistent or legal if it


complies with all laws and regulations.

2. Complete Assignment: An assignment in which each variable has a number


associated to it and that the CSP solution is continuous. One such task is referred
to as a completed task.

3. A partial assignment is one that just gives some of the variables values. Projects
of this nature are referred to as incomplete assignment.

Types of Constraints in CSP

Basically, there are three different categories of limitations in regard towards the
parameters:

o Unary restrictions are the easiest kind of restrictions because they only limit the
value of one variable.

o Binary resource limits: These restrictions connect two parameters. A value


between x1 and x3 can be found in a variable named x2.

UNIT-I- Page 56 of 64
KCE-CSE –AI&ML 2023

o Global Resource limits: This kind of restriction includes a unrestricted amount of


variables.

Map Coloring Problem

• Variables: WA, NT, Q, NSW, V, SA, T

• Domains: Di={red,green,blue}

• Constraints:adjacent regions must have different colors.

• E.g. WA  NT

• Solutions are assignments satisfying all constraints, e.g.

{WA=red,NT=green,Q=red,NSW=green,V=red,SA=blue,T=green

• Constraint graph:

• nodes are variables

• arcs are binary constraints

Variety of constraints

UNIT-I- Page 57 of 64
KCE-CSE –AI&ML 2023

• Unary constraints involve a single variable.

– e.g. SA  green

– Binary constraints involve pairs of variables.

– e.g. SA  WA

– Higher-order constraints involve 3 or more variables.

– Professors A, B,and C cannot be on a committee together

– Can always be represented by multiple binary constraints

– Preference (soft constraints)

– e.g. red is better than green often can be represented by a cost for each
variable assignment

– combination of optimization with CSPs

• A CSP can easily be expressed as a standard search problem.

• Incremental formulation

– Initial State: the empty assignment {}

– Successor function: Assign a value to an unassigned variable provided that


it does not violate a constraint

– Goal test: the current assignment is complete

(by construction it is consistent)

UNIT-I- Page 58 of 64
KCE-CSE –AI&ML 2023

– Path cost: constant cost for every step (not really relevant)

CONSTRAINT SATISFACTION PROBLEM - WATER JUG PROBLEM

Problem: You are given two jugs, a 4-gallon one and a 3-gallon one.Neither has any
measuring mark on it.There is a pump that can be used to fill the jugs with water.How
can you get exactly 2 gallons of water into the 4-gallon jug.

Solution:
The state space for this problem can be described as the set of ordered pairs of
integers (x,y)
Where,
X represents the quantity of water in the 4-gallon jug X= 0,1,2,3,4
Y represents the quantity of water in 3-gallon jug Y=0,1,2,3
Start State: (0,0)
Goal State: (2,0)
Generate production rules for the water jug problem

Production Rules:
Rule State Process
1 (X,Y | X<4) (4,Y)
{Fill 4-gallon jug}
2 (X,Y |Y<3) (X,3)
{Fill 3-gallon jug}
3 (X,Y |X>0) (0,Y)
{Empty 4-gallon jug}
4 (X,Y | Y>0) (X,0)
{Empty 3-gallon jug}
5 (X,Y | X+Y>=4 ^ (4,Y-(4-X))
Y>0) {Pour water from 3-gallon jug into 4-gallon jug
until 4-gallon jug is full}
6 (X,Y | X+Y>=3 (X-(3-Y),3)
^X>0) {Pour water from 4-gallon jug into 3-gallon jug
until 3-gallon jug is full}
7 (X,Y | X+Y<=4 (X+Y,0)
^Y>0) {Pour all water from 3-gallon jug into 4-gallon jug}
8 (X,Y | X+Y <=3^ (0,X+Y)
X>0) {Pour all water from 4-gallon jug into 3-gallon jug}
9 (0,2) (2,0)
{Pour 2 gallon water from 3 gallon jug into 4
gallon jug}

Initialization:
Start State: (0,0)
Apply Rule 2:
(X,Y | Y<3) -> (X,3)
{Fill 3-gallon jug}

UNIT-I- Page 59 of 64
KCE-CSE –AI&ML 2023

Now the state is (X,3)

Iteration 1:
Current State: (X,3)
Apply Rule 7:
(X,Y | X+Y<=4 (X+Y,0)
^Y>0) {Pour all water from 3-gallon jug into 4-gallon jug}
Now the state is (3,0)

Iteration 2:
Current State : (3,0)
Apply Rule 2:
(X,Y | Y<3) -> (3,3)
{Fill 3-gallon jug}
Now the state is (3,3)

Iteration 3:
Current State:(3,3)
Apply Rule 5:
(X,Y | X+Y>=4 ^ (4,Y-(4-X))
Y>0) {Pour water from 3-gallon jug into 4-gallon jug until
4-gallon jug is full}
Now the state is (4,2)

Iteration 4:
Current State : (4,2)
Apply Rule 3:
(X,Y | X>0) (0,Y)
{Empty 4-gallon jug}
Now state is (0,2)

Iteration 5:
Current State : (0,2)
Apply Rule 9:
(0,2) (2,0)
{Pour 2 gallon water from 3 gallon jug into 4 gallon
jug}
Now the state is (2,0)

Goal Achieved.

UNIT-I- Page 60 of 64
KCE-CSE –AI&ML 2023

The Crypt-Arithmetic problem in Artificial Intelligence is a type of encryption


problem in which the written message in an alphabetical form which is easily readable
and understandable is converted into a numeric form which is neither easily readable
nor understandable. In simpler words, the crypt-arithmetic problem deals with the
converting of the message from the readable plain text to the non-readable ciphertext.
The constraints which this problem follows during the conversion is as follows:

1. A number 0-9 is assigned to a particular alphabet.


2. Each different alphabet has a unique number.
3. All the same, alphabets have the same numbers.
4. The numbers should satisfy all the operations that any normal number does.

Let us take an example of the message: SEND MORE MONEY.

Here, to convert it into numeric form, we first split each word separately and represent
it as follows:

SEND
MORE
-------------
MONEY

These alphabets then are replaced by numbers such that all the constraints are satisfied.
So initially we have all blank spaces.

UNIT-I- Page 61 of 64
KCE-CSE –AI&ML 2023

These alphabets then are replaced by numbers such that all the constraints are satisfied.
So initially we have all blank spaces.

We first look for the MSB in the last word which is 'M' in the word 'MONEY' here. It is
the letter which is generated by carrying. So, carry generated can be only one. SO, we
have M=1.

Now, we have S+M=O in the second column from the left side. Here M=1. Therefore, we
have, S+1=O. So, we need a number for S such that it generates a carry when added
with 1. And such a number is 9. Therefore, we have S=9 and O=0.

Now, in the next column from the same side we have E+O=N. Here we have O=0. Which
means E+0=N which is not possible. This means a carry was generated by the lower
place digits. So we have:

1+E=N ----------(i)

Next alphabets that we have are N+R=E -------(ii)

So, for satisfying both equations (i) and (ii), we get E=5 and N=6.

Now, R should be 9, but 9 is already assigned to S, So, R=8 and we have 1 as a carry
which is generated from the lower place digits.

Now, we have D+5=Y and this should generate a carry. Therefore, D should be greater
than 4. As 5, 6, 8 and 9 are already assigned, we have D=7 and therefore Y=2.

Therefore, the solution to the given Crypt-Arithmetic problem is:

S=9; E=5; N=6; D=7; M=1; O=0; R=8; Y=2

Which can be shown in layout form as:

9 5 6 7
1 0 8 5
-------------
1 0 6 5 2

UNIT-I- Page 62 of 64
KCE-CSE –AI&ML 2023

Missionaries and Cannibals Problem

On one bank of a river are three missionaries and three cannibals. There is one boat
available that can hold up to two people and that they would like to use to cross the river.
If the cannibals ever outnumber the missionaries on either of the river’s banks, the
missionaries will get eaten.

How can the boat be used to safely carry all the missionaries and cannibals across the
river?

First let us consider that both the missionaries (M) and cannibals(C) are on the same
side of the river.

Left Right Initially the positions are : 0M , 0C and 3M , 3C (B)

Now let’s send 2 Cannibals to left of bank : 0M , 2C (B) and 3M , 1C

Send one cannibal from left to right : 0M , 1C and 3M , 2C (B)

Now send the 2 remaining Cannibals to left : 0M , 3C (B) and 3M , 0C

UNIT-I- Page 63 of 64
KCE-CSE –AI&ML 2023

Send 1 cannibal to the right : 0M , 2C and 3M , 1C (B)

Now send 2 missionaries to the left : 2M , 2C (B) and 1M . 1C

Send 1 missionary and 1 cannibal to right : 1M , 1C and 2M , 2C (B)

Send 2 missionaries to left : 3M , 1C (B) and 0M , 2C

Send 1 cannibal to right : 3M , 0C and 0M , 3C (B)

Send 2 cannibals to left : 3M , 2C (B) and 0M , 1C

Send 1 cannibal to right : 3M , 1C and 0M , 2C (B)’

Send 2 cannibals to left : 3M , 3C (B) and 0M , 0C •

Here (B) shows the position of the boat after the action is performed. Therefore all the
missionaries and cannibals have crossed the river safely.

UNIT-I- Page 64 of 64
KCE-CSE –AI&ML 2023

UNIT- II

PROBABILISTIC REASONING

S.NO TOPICS PAGE NO. STATUS OF


COVERAGE

8 Acting under uncertainty 2-5

9 Bayesian inference, Naïve 5-11


Bayes models.

10 Probabilistic reasoning, 11-18


Bayesian networks, exact
inference in BN

11 Approximate inference in BN 18-25

12 Causal networks. 25-29

UNIT-I II- Page 1 of 29


KCE-CSE –AI&ML 2023

TOPIC 8 : ACTING UNDER UNCERTAINTY

Agents in the real world need to handle uncertainty, whether due to partial
observability, nondeterminism, or adversaries. An agent may never know for sure what
state it is in now or where it will end up after a sequence of actions.

• In practice, programs have to act under uncertainty:

– using a simple but incorrect theory of the world, which does not take into
account uncertainty and will work most of the time

– handling uncertain knowledge and utility (tradeoff between accuracy and


usefulness) in a rational way

• The right thing to do (the rational decision) depends on:

– the relative importance of various goals

– the likelihood that, and degree to which, they will be


achieved

Handling Uncertain Knowledge

• Example of rule for dental diagnosis using first-order logic:

∀p Symptom(p, Toothache) ⇒ Disease(p, Cavity)

• This rule is wrong and in order to make it true we have to add an almost
unlimited list of possible causes:

∀p Symptom(p, Toothache) ⇒ Disease(p, Cavity) ⇒ Disease(p, GumDisease) ⇒


Disease(p, Abscess)…

• Trying to use first-order logic to cope with a domain like medical diagnosis fails
for three main reasons:

• Laziness. It is too much work to list the complete set of


antecedents or consequents needed to ensure an exceptionless
rule and too hard to use such rules.

• Theoretical ignorance. Medical science has no complete theory for


the domain.

• Practical ignorance. Even if we know all the rules, we might be


uncertain about a particular patient because not all the necessary
tests have been or can be run.

• Actually, the connection between toothaches and cavities is just not a logical
consequence in any direction.

UNIT-I II- Page 2 of 29


KCE-CSE –AI&ML 2023

• In judgmental domains (medical, law, design...) the agent’s knowledge can at best
provide a degree of belief in the relevant sentences.

• The main tool for dealing with degrees of belief is probability theory, which
assigns to each sentence a numerical degree of belief between 0 and 1.

• The belief could be derived from:

• statistical data

• 80% of the toothache patients have had cavities

• some general rules

• some combination of evidence sources

• Assigning a probability of 0 to a given sentence corresponds to an unequivocal


belief that the sentence is false.

Assigning a probability of 1 corresponds to an unequivocal belief that the


sentence is true

• A degree of belief is different from a degree of truth.

• A probability of 0.8 does not mean “80% true”, but rather an 80% degree of
belief that something is true.

• In logic, a sentence such as “The patient has a cavity” is true or false.

• In probability theory, a sentence such as “The probability that the patient has a
cavity is 0.8” is about the agent’s belief, not directly about the world.

• These beliefs depend on the percepts that the agent has received to date.

• These percepts constitute the evidence on which probability assertions are


based

• For example:

• An agent draws a card from a shuffled pack.

• Before looking at the card, the agent might assign a probability of 1/52 to
its being the ace of spades.

• After looking at the card, an appropriate probability for the same


proposition would be 0 or 1.

Following are some leading causes of uncertainty to occur in the real world.

• Information occurred from unreliable sources.

UNIT-I II- Page 3 of 29


KCE-CSE –AI&ML 2023

• Experimental Errors

• Equipment fault

• Temperature variation

• Climate change.

Probabilistic Reasoning

Probabilistic reasoning is a way of knowledge representation where we apply the


concept of probability to indicate the uncertainty in knowledge.

In probabilistic reasoning, we combine probability theory with logic to handle the


uncertainty

Need of probabilistic reasoning in AI:

• When there are unpredictable outcomes.

• When specifications or possibilities of predicates becomes too large to handle.

• When an unknown error occurs during an experiment.

In probabilistic reasoning, there are two ways to solve problems with uncertain
knowledge:

• Bayes' rule

• Bayesian Statistics

Probability

0 ≤ P(A) ≤ 1, where P(A) is the probability of an event A.

P(A) = 0, indicates total uncertainty in an event A.

P(A) =1, indicates total certainty in an event A.

Probabilistic Reasoning Terminologies

• Sample space: The collection of all possible events is called sample space.

• Random variables: Random variables are used to represent the events and
objects in the real world.

• Prior probability: The prior probability of an event is probability computed


before observing new information.

UNIT-I II- Page 4 of 29


KCE-CSE –AI&ML 2023

• Posterior Probability: The probability that is calculated after all evidence or


information has taken into account. It is a combination of prior probability and
new information.

Conditional probability:

• Conditional probability is a probability of occurring an event when another event


has already happened

• Let's suppose, we want to calculate the event A when event B has already
occurred, "the probability of A under the conditions of B", it can be written as:

Where P(A⋀B)= Joint probability of a and B

P(B)= Marginal probability of B.

• We can find the probability of an uncertain event by using the below formula.

• P(¬A) = probability of a not happening event.

• P(¬A) + P(A) = 1.

Example

• In a class, there are 70% of the students who like English and 40% of the
students who likes English and mathematics, and then what is the percent of
students those who like English also like mathematics?

Solution:

• Let, A is an event that a student likes Mathematics

• B is an event that a student likes English.

TOPIC 9: Bayesian inference, Naïve Bayes models.

Baye’s Theorem

Bayes' theorem is also known as Bayes' rule, Bayes' law, or Bayesian reasoning, which
determines the probability of an event with uncertain knowledge.

UNIT-I II- Page 5 of 29


KCE-CSE –AI&ML 2023

In probability theory, it relates the conditional probability and marginal probabilities of


two random events.

It is a way to calculate the value of P(B|A) with the knowledge of P(A|B)

Bayes' theorem allows updating the probability prediction of an event by observing new
information of the real world.

Example: If cancer corresponds to one's age then by using Bayes' theorem, we can
determine the probability of cancer more accurately with the help of age.

Bayes' theorem can be derived using product rule and conditional probability of event A
with known event B:

• Bayes' theorem is also known as Bayes' rule, Bayes' law, or Bayesian


reasoning, which determines the probability of an event with uncertain
knowledge. – Probabilistic reasoning

This equation is basic of most modern AI systems for probabilistic inference.

It shows the simple relationship between joint and conditional probabilities. Here,

P(A|B) is known as posterior, which we need to calculate, and it will be read as


Probability of hypothesis A when we have occurred an evidence B.

P(B|A) is called the likelihood, in which we consider that hypothesis is true, then we
calculate the probability of evidence.

P(A) is called the prior probability, probability of hypothesis before considering the
evidence

UNIT-I II- Page 6 of 29


KCE-CSE –AI&ML 2023

P(B) is called marginal probability, pure probability of an evidence.

Bayes' rule allows us to compute the single term P(B|A) in terms of P(A|B), P(B), and
P(A). This is very useful in cases where we have a good probability of these three terms
and want to determine the fourth one. Suppose we want to perceive the effect of some
unknown cause, and want to compute that cause, then the Bayes' rule becomes:

Question: what is the probability that a patient has diseases meningitis with a stiff
neck?

Given Data:

A doctor is aware that disease meningitis causes a patient to have a stiff neck, and it
occurs 80% of the time. He is also aware of some more facts, which are given as follows:

The Known probability that a patient has meningitis disease is 1/30,000.

The Known probability that a patient has a stiff neck is 2%.

Let a be the proposition that patient has stiff neck and b be the proposition that patient
has meningitis. , so we can calculate the following as:

P(a|b) = 0.8

P(b) = 1/30000

P(a)= .02

Hence, we can assume that 1 patient out of 750 patients has meningitis disease
with a stiff neck

Application of Bayes' theorem in Artificial intelligence:

Following are some applications of Bayes' theorem:

o It is used to calculate the next step of the robot when the already executed step is
given.

o Bayes' theorem is helpful in weather forecasting.

o It can solve the Monty Hall problem.

UNIT-I II- Page 7 of 29


KCE-CSE –AI&ML 2023

TOPIC 10: Probabilistic reasoning, Bayesian networks, exact inference in BN

Bayesian belief network is key computer technology for dealing with probabilistic
events and to solve a problem which has uncertainty. We can define a Bayesian network
as:

"A Bayesian network is a probabilistic graphical model which represents a set of


variables and their conditional dependencies using a directed acyclic graph."

It is also called a Bayes network, belief network, decision network, or Bayesian model.

Bayesian networks are probabilistic, because these networks are built from
a probability distribution, and also use probability theory for prediction and anomaly
detection.

Real world applications are probabilistic in nature, and to represent the relationship
between multiple events, we need a Bayesian network. It can also be used in various
tasks including prediction, anomaly detection, diagnostics, automated insight,
reasoning, time series prediction, and decision making under uncertainty.

Bayesian Network can be used for building models from data and experts opinions, and
it consists of two parts:

o Directed Acyclic Graph

o Table of conditional probabilities.

The generalized form of Bayesian network that represents and solve decision problems
under uncertain knowledge is known as an Influence diagram.

A Bayesian network graph is made up of nodes and Arcs (directed links), where:

o Each node corresponds to the random variables, and a variable can


be continuous or discrete.

UNIT-I II- Page 8 of 29


KCE-CSE –AI&ML 2023

o Arc or directed arrows represent the causal relationship or conditional


probabilities between random variables. These directed links or arrows connect
the pair of nodes in the graph.
These links represent that one node directly influence the other node, and if
there is no directed link that means that nodes are independent with each other

o In the above diagram, A, B, C, and D are random variables represented by


the nodes of the network graph.

o If we are considering node B, which is connected with node A by a


directed arrow, then node A is called the parent of Node B.

o Node C is independent of node A.

The Bayesian network has mainly two components:

o Causal Component

o Actual numbers

Each node in the Bayesian network has condition probability


distribution P(Xi |Parent(Xi) ), which determines the effect of the parent on that node.

Bayesian network is based on Joint probability distribution and conditional probability.


So let's first understand the joint probability distribution:

Joint probability distribution:

If we have variables x1, x2, x3,....., xn, then the probabilities of a different combination of
x1, x2, x3.. xn, are known as Joint probability distribution.

P[x1, x2, x3,....., xn], it can be written as the following way in terms of the joint probability
distribution.

= P[x1| x2, x3,....., xn]P[x2, x3,....., xn]

= P[x1| x2, x3,....., xn]P[x2|x3,....., xn]....P[xn-1|xn]P[xn].

In general for each variable Xi, we can write the equation as:

P(Xi|Xi-1,........., X1) = P(Xi |Parents(Xi ))

Example: Harry installed a new burglar alarm at his home to detect burglary. The alarm
reliably responds at detecting a burglary but also responds for minor earthquakes.
Harry has two neighbors David and Sophia, who have taken a responsibility to inform
Harry at work when they hear the alarm. David always calls Harry when he hears the
alarm, but sometimes he got confused with the phone ringing and calls at that time too.
On the other hand, Sophia likes to listen to high music, so sometimes she misses to hear
the alarm. Here we would like to compute the probability of Burglary Alarm.

UNIT-I II- Page 9 of 29


KCE-CSE –AI&ML 2023

Problem:

Calculate the probability that alarm has sounded, but there is neither a burglary, nor an
earthquake occurred, and David and Sophia both called the Harry.

Solution:

o The Bayesian network for the above problem is given below. The network
structure is showing that burglary and earthquake is the parent node of the
alarm and directly affecting the probability of alarm's going off, but David and
Sophia's calls depend on alarm probability.

o The network is representing that our assumptions do not directly perceive the
burglary and also do not notice the minor earthquake, and they also not confer
before calling.

o The conditional distributions for each node are given as conditional probabilities
table or CPT.

o Each row in the CPT must be sum to 1 because all the entries in the table
represent an exhaustive set of cases for the variable.

o In CPT, a boolean variable with k boolean parents contains 2 K probabilities.


Hence, if there are two parents, then CPT will contain 4 probability values

List of all events occurring in this network:

o Burglary (B)
o Earthquake(E)
o Alarm(A)
o David Calls(D)
o Sophia calls(S)
We can write the events of problem statement in the form of probability: P[D, S, A, B, E],
can rewrite the above probability statement using joint probability distribution:

UNIT-I II- Page 10 of 29


KCE-CSE –AI&ML 2023

P[D, S, A, B, E]= P[D | S, A, B, E]. P[S, A, B, E]

=P[D | S, A, B, E]. P[S | A, B, E]. P[A, B, E]

= P [D| A]. P [ S| A, B, E]. P[ A, B, E]

= P[D | A]. P[ S | A]. P[A| B, E]. P[B, E]

= P[D | A ]. P[S | A]. P[A| B, E]. P[B |E]. P[E]

Let's take the observed probability for the Burglary and earthquake component:

P(B= True) = 0.002, which is the probability of burglary.

P(B= False)= 0.998, which is the probability of no burglary.

P(E= True)= 0.001, which is the probability of a minor earthquake

P(E= False)= 0.999, Which is the probability that an earthquake not occurred.

We can provide the conditional probabilities as per the below tables:

Conditional probability table for Alarm A:

The Conditional probability of Alarm A depends on Burglar and earthquake:

UNIT-I II- Page 11 of 29


KCE-CSE –AI&ML 2023

B E P(A= True) P(A= False)

True True 0.94 0.06

True False 0.95 0.04

False True 0.31 0.69

False False 0.001 0.999

Conditional probability table for David Calls:

The Conditional probability of David that he will call depends on the probability
of Alarm.

A P(D= True) P(D= False)

True 0.91 0.09

False 0.05 0.95

Conditional probability table for Sophia Calls:

The Conditional probability of Sophia that she calls is depending on its Parent
Node "Alarm."

A P(S= True) P(S= False)

True 0.75 0.25

False 0.02 0.98

From the formula of joint distribution, we can write the problem statement in the form
of probability distribution:

P(S, D, A, ¬B, ¬E) = P (S|A) *P (D|A)*P (A|¬B ^ ¬E) *P (¬B) *P (¬E).

UNIT-I II- Page 12 of 29


KCE-CSE –AI&ML 2023

= 0.75* 0.91* 0.001* 0.998*0.999

= 0.00068045.

Hence, a Bayesian network can answer any query about the domain by using Joint
distribution.

The semantics of Bayesian Network:

There are two ways to understand the semantics of the Bayesian network, which is
given below:

1. To understand the network as the representation of the Joint probability distribution.

It is helpful to understand how to construct the network.

2. To understand the network as an encoding of a collection of conditional


independence statements.

It is helpful in designing inference procedure.

Exact inference by enumeration

UNIT-I II- Page 13 of 29


KCE-CSE –AI&ML 2023

UNIT-I II- Page 14 of 29


KCE-CSE –AI&ML 2023

Exact inference by variable elimination

UNIT-I II- Page 15 of 29


KCE-CSE –AI&ML 2023

UNIT-I II- Page 16 of 29


KCE-CSE –AI&ML 2023

UNIT-I II- Page 17 of 29


KCE-CSE –AI&ML 2023

TOPIC 11: Approximate inference in BN

Approximate inference by stochastic simulation

UNIT-I II- Page 18 of 29


KCE-CSE –AI&ML 2023

UNIT-I II- Page 19 of 29


KCE-CSE –AI&ML 2023

UNIT-I II- Page 20 of 29


KCE-CSE –AI&ML 2023

UNIT-I II- Page 21 of 29


KCE-CSE –AI&ML 2023

UNIT-I II- Page 22 of 29


KCE-CSE –AI&ML 2023

Approximate inference by Markov chain Monte Carlo

UNIT-I II- Page 23 of 29


KCE-CSE –AI&ML 2023

UNIT-I II- Page 24 of 29


KCE-CSE –AI&ML 2023

TOPIC 12: Causal networks.

Causal AI refers to the use of AI to make decisions and predictions based on


cause-and-effect relationships rather than just correlational relationships.

Causal reasoning is the process of understanding the relationships between causes and
effects. It is the way that we, as humans, make sense of the world around us and draw
conclusions based on our observations. In a similar vein, causal AI uses algorithms and
models to identify and analyse causal relationships in data, allowing it to make
predictions and decisions based on these relationships.

UNIT-I II- Page 25 of 29


KCE-CSE –AI&ML 2023

To illustrate, imagine a simplistic world where educational outcomes Y are related to


school expenditures X as well as to the parents’ involvement in their children’s
education.

Suppose now that you have some data on educational outcomes Y, school expenditures
X, and parent involvement C. The unit of observation is, say, a school district. The
educational outcome data might come from standardized testing. Parent involvement
might be the records of what fraction of parents attend their student’s quarterly teacher
conferences. You, the modeler, work for the national government. You’ve been asked to
figure out what will be the effect on educational outcomes of an intervention where the
national government will give additional funding to schools.

For the purpose of anticipating the impact of a change in X on Y, either of two models
might be appropriate: either Y ~ X or Y ~ X + C.

UNIT-I II- Page 26 of 29


KCE-CSE –AI&ML 2023

The above graph illustrates another simple yet typical Bayesian network. In contrast to
the statistical relationships in the non-causal example, this graph describes the causal
relationships among the seasons of the year (X 1), whether it is raining (X2), whether the
sprinkler is on (X3), whether the pavement is wet (X4), and whether the pavement is
slippery (X5).

Here, the absence of a direct link between X1 and X5, for example, captures our
understanding that there is no direct influence of season on slipperiness. The influence
is mediated by the wetness of the pavement (if freezing were a possibility, a direct link
could be added).

Perhaps the most important aspect of Bayesian networks is that they are direct
representations of the world, not of reasoning processes.

The arrows in the diagram represent real causal connections and not the flow of
information during reasoning (as in rule-based systems and neural networks).
Reasoning processes can operate on Bayesian networks by propagating information in
any direction.

For example, if the sprinkler is on, then the pavement is probably wet (prediction,
simulation). If someone slips on the pavement, that will also provide evidence that it is
wet (abduction, reasoning to a probable cause, or diagnosis).

On the other hand, if we see that the pavement is wet, that will make it more likely that
the sprinkler is on or that it is raining (abduction); but if we then observe that the
sprinkler is on, that will reduce the likelihood that it is raining (explaining away).

It is the latter form of reasoning, explaining away, that is especially difficult to model in
rule-based systems and neural networks in a natural way because it seems to require
the propagation of information in two directions.

UNIT-I II- Page 27 of 29


KCE-CSE –AI&ML 2023

Causal Reasoning

Most probabilistic models, including general Bayesian networks, describe a Joint


Probability Distribution (JPD) over possible observed events but say nothing about
what will happen if a certain intervention occurs.

For example, what if I turn the Sprinkler on instead of just observing that it is turned
on? What effect does that have on the Season, or on the connection between Wet and
Slippery?

A causal network, intuitively speaking, is a Bayesian network with the added property
that the parents of each node are its direct causes.

In such a network, the result of an intervention is obvious: the Sprinkler node is set
to X3=on and the causal link between the Season X1 and the Sprinkler X3 is removed. All
other causal links and conditional probabilities remain intact, so the new model is:

This differs from observing that X3=on, which would result in a new model
that included the term P(X3=on|x1). This mirrors the difference between
seeing and doing: after observing that the Sprinkler is on, we wish to infer
that the Season is dry, that it probably did not rain, and so on. An arbitrary
decision to turn on the Sprinkler should not result in any such beliefs.

Causal networks are more properly defined, then, as Bayesian networks in which the
correct probability model—after intervening to fix any node’s value—is given simply by
deleting links from the node’s parents. For example, Fire → Smoke is a causal network,
whereas Smoke → Fire is not, even though both networks are equally capable of
representing any Joint Probability Distribution (JPD) of the two variables.

Causal networks model the environment as a collection of stable component


mechanisms. These mechanisms may be reconfigured locally by interventions, with

UNIT-I II- Page 28 of 29


KCE-CSE –AI&ML 2023

corresponding local changes in the model. This, in turn, allows causal networks to be
used very naturally for prediction by an agent that is considering various courses of
action.

Learning Bayesian Network Parameters

Given a qualitative Bayesian network structure, the conditional probability


tables, P(xi|pai), are typically estimated with the maximum-likelihood approach from
the observed frequencies in the dataset associated with the network.

In pure Bayesian approaches, Bayesian networks are designed from expert knowledge
and include hyperparameter nodes. Data (usually scarce) is used as pieces of evidence
for incrementally updating the distributions of the hyperparameters

UNIT-I II- Page 29 of 29


KCE-CSE –AI&ML 2023

UNIT- II

PROBABILISTIC REASONING

S.NO TOPICS PAGE NO. STATUS OF


COVERAGE

12 Introduction to machine learning

Linear Regression Models: Least


13 squares, single & multiple variables

Bayesian linear regression, gradient


14 descent, Linear Classification
Models: Discriminant function

Probabilistic discriminative model -


15 Logistic regression,

Probabilistic generative model,


16 Naive Bayes

Maximum margin classifier –


17 Support vector machine

18 Decision Tree, Random forests

UNIT-3 III- Page 1 of 58


KCE-CSE –AI&ML 2023

TOPIC12. INTRODUCTION TO MACHINE LEARNING

Machine Learning is the field of study that gives computers the capability to learn without being
explicitly programmed.

A computer program is said to learn from experience E with respect to some class of tasks T and
performance measure P, if its performance at tasks in T, as measured by P, improves with
experience E. (Mitchell 1997)

This means : Given : A task T A performance measure P Some experience E with the task Goal :
Generalize the experience in a way that allows to improve your performance on the task.

In traditional programs, a developer designs logic or algorithms to solve a problem. The


program applies this logic to input and computes the output.

But in Machine Learning, a model is built from the data, and that model is the logic. ML
programs have two distinct phases:

 Training: Input and the expected output are used to train and test various models, and
select the most suitable model.

 Inference: The model is applied to the input to compute results. These results are
wrong sometimes. A mechanism is built into the application to gather user feedback on
such occasions.

This feedback is added to the training data, and this is how a model learns.

UNIT-3 III- Page 2 of 58


KCE-CSE –AI&ML 2023

Let’s take the problem of detecting email spam and compare both methods.

Traditional programs detect spam by checking an email against a fixed set of heuristic rules. For
example:

 Does the email contain FREE, weight loss, or lottery several times?

 Did it come from known spammer domain/IP addresses?

As spammers change tactics, developers need to continuously update these rules.

In Machine Learning Solutions, an engineer will:

 Prepare a data set: a large number of emails labeled manually as spam or not-spam.

 Train, test, and tune models, and select the best.

 During inference, apply the model to decide whether to keep an email in the inbox or in
the spam folder.

 If the user moves an email from inbox to spam or vice versa, add this feedback to the
training data.

 Retrain the model to be up-to-date with the spam trends.

As you can notice traditional programs are deterministic, but ML programs are probabilistic.
Both make mistakes. But the traditional program will require constant manual effort in
updating the rules, while the ML program will learn from new data when retrained.

Applications of ML

1. Image Recognition:

Image recognition is one of the most common applications of machine learning. It is used to
identify objects, persons, places, digital images, etc. The popular use case of image recognition
and face detection is, Automatic friend tagging suggestion:

Facebook provides us a feature of auto friend tagging suggestion. Whenever we upload a photo
with our Facebook friends, then we automatically get a tagging suggestion with name, and the
technology behind this is machine learning's face detection and recognition algorithm.

It is based on the Facebook project named "Deep Face," which is responsible for face
recognition and person identification in the picture.

2. Speech Recognition

While using Google, we get an option of "Search by voice," it comes under speech recognition,
and it's a popular application of machine learning.

Speech recognition is a process of converting voice instructions into text, and it is also known as
"Speech to text", or "Computer speech recognition." At present, machine learning algorithms

UNIT-3 III- Page 3 of 58


KCE-CSE –AI&ML 2023

are widely used by various applications of speech recognition. Google assistant, Siri, Cortana,
and Alexa are using speech recognition technology to follow the voice instructions.

3. Traffic prediction:

If we want to visit a new place, we take help of Google Maps, which shows us the correct path
with the shortest route and predicts the traffic conditions.

It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or heavily
congested with the help of two ways:

o Real Time location of the vehicle form Google Map app and sensors

o Average time has taken on past days at the same time.

Everyone who is using Google Map is helping this app to make it better. It takes information
from the user and sends back to its database to improve the performance.

4. Product recommendations:

Machine learning is widely used by various e-commerce and entertainment companies such
as Amazon, Netflix, etc., for product recommendation to the user. Whenever we search for
some product on Amazon, then we started getting an advertisement for the same product while
internet surfing on the same browser and this is because of machine learning.

Google understands the user interest using various machine learning algorithms and suggests
the product as per customer interest.

As similar, when we use Netflix, we find some recommendations for entertainment series,
movies, etc., and this is also done with the help of machine learning.

5. Self-driving cars:

One of the most exciting applications of machine learning is self-driving cars. Machine learning
plays a significant role in self-driving cars. Tesla, the most popular car manufacturing company
is working on self-driving car. It is using unsupervised learning method to train the car models
to detect people and objects while driving.

6. Email Spam and Malware Filtering:

Whenever we receive a new email, it is filtered automatically as important, normal, and spam.
We always receive an important mail in our inbox with the important symbol and spam emails
in our spam box, and the technology behind this is Machine learning. Below are some spam
filters used by Gmail:

o Content Filter

o Header filter

o General blacklists filter

o Rules-based filters

UNIT-3 III- Page 4 of 58


KCE-CSE –AI&ML 2023

o Permission filters

Some machine learning algorithms such as Multi-Layer Perceptron, Decision tree, and Naïve
Bayes classifier are used for email spam filtering and malware detection.

7. Virtual Personal Assistant:

We have various virtual personal assistants such as Google assistant, Alexa, Cortana, Siri. As
the name suggests, they help us in finding the information using our voice instruction. These
assistants can help us in various ways just by our voice instructions such as Play music, call
someone, Open an email, Scheduling an appointment, etc.

These virtual assistants use machine learning algorithms as an important part.

8. Online Fraud Detection:

Machine learning is making our online transaction safe and secure by detecting fraud
transaction. Whenever we perform some online transaction, there may be various ways that a
fraudulent transaction can take place such as fake accounts, fake ids, and steal money in the
middle of a transaction. So to detect this, Feed Forward Neural network helps us by checking
whether it is a genuine transaction or a fraud transaction.

For each genuine transaction, the output is converted into some hash values, and these values
become the input for the next round. For each genuine transaction, there is a specific pattern
which gets change for the fraud transaction hence, it detects it and makes our online
transactions more secure.

9. Stock Market trading:

Machine learning is widely used in stock market trading. In the stock market, there is always a
risk of up and downs in shares, so for this machine learning's long short term memory neural
network is used for the prediction of stock market trends.

10. Medical Diagnosis:

In medical science, machine learning is used for diseases diagnoses. With this, medical
technology is growing very fast and able to build 3D models that can predict the exact position
of lesions in the brain.

It helps in finding brain tumors and other brain-related diseases easily.

Machine learning life cycle involves seven major steps, which are given below:

o Gathering Data
o Data preparation
o Data Wrangling
o Analyse Data
o Train the model
o Test the model
o Deployment

UNIT-3 III- Page 5 of 58


KCE-CSE –AI&ML 2023

1. Gathering Data:

Data Gathering is the first step of the machine learning life cycle. The goal of this step is to
identify and obtain all data-related problems.

In this step, we need to identify the different data sources, as data can be collected from various
sources such as files, database, internet, or mobile devices. It is one of the most important steps
of the life cycle. The quantity and quality of the collected data will determine the efficiency of
the output. The more will be the data, the more accurate will be the prediction.

This step includes the below tasks:

o Identify various data sources

o Collect data

o Integrate the data obtained from different sources

By performing the above task, we get a coherent set of data, also called as a dataset. It will be
used in further steps.

2. Data preparation

After collecting the data, we need to prepare it for further steps. Data preparation is a step
where we put our data into a suitable place and prepare it to use in our machine learning
training.

In this step, first, we put all data together, and then randomize the ordering of data.

This step can be further divided into two processes:

o Data exploration:
It is used to understand the nature of data that we have to work with. We need to
understand the characteristics, format, and quality of data.
A better understanding of data leads to an effective outcome. In this, we find
Correlations, general trends, and outliers.

o Data pre-processing:
Now the next step is preprocessing of data for its analysis.

3. Data Wrangling

Data wrangling is the process of cleaning and converting raw data into a useable format. It is the
process of cleaning the data, selecting the variable to use, and transforming the data in a proper
format to make it more suitable for analysis in the next step. It is one of the most important
steps of the complete process. Cleaning of data is required to address the quality issues.

It is not necessary that data we have collected is always of our use as some of the data may not
be useful. In real-world applications, collected data may have various issues, including:

o Missing Values
o Duplicate data

UNIT-3 III- Page 6 of 58


KCE-CSE –AI&ML 2023

o Invalid data
o Noise
So, we use various filtering techniques to clean the data.
It is mandatory to detect and remove the above issues because it can negatively affect the
quality of the outcome.

4. Data Analysis

Now the cleaned and prepared data is passed on to the analysis step. This step involves:

o Selection of analytical techniques


o Building models
o Review the result
The aim of this step is to build a machine learning model to analyze the data using various
analytical techniques and review the outcome. It starts with the determination of the type of the
problems, where we select the machine learning techniques such
as Classification, Regression, Cluster analysis, Association, etc. then build the model using
prepared data, and evaluate the model.

5. Train Model

Now the next step is to train the model, in this step we train our model to improve its
performance for better outcome of the problem.

We use datasets to train the model using various machine learning algorithms. Training a model
is required so that it can understand the various patterns, rules, and, features.

6. Test Model

Once our machine learning model has been trained on a given dataset, then we test the model.
In this step, we check for the accuracy of our model by providing a test dataset to it.

Testing the model determines the percentage accuracy of the model as per the requirement of
project or problem.

7. Deployment

The last step of machine learning life cycle is deployment, where we deploy the model in the
real-world system.

If the above-prepared model is producing an accurate result as per our requirement with
acceptable speed, then we deploy the model in the real system. But before deploying the project,
we will check whether it is improving its performance using available data or not. The
deployment phase is similar to making the final report for a project.

UNIT-3 III- Page 7 of 58


KCE-CSE –AI&ML 2023

Based on the methods and way of learning, machine learning is divided into mainly four types,
which are:

1. Supervised Machine Learning

2. Unsupervised Machine Learning

3. Semi-Supervised Machine Learning

4. Reinforcement Learning

Supervised machine learning is based on supervision. It means in the supervised learning


technique, we train the machines using the "labelled" dataset, and based on the training, the
machine predicts the output. Here, the labelled data specifies that some of the inputs are already
mapped to the output. We train the machine with the input and corresponding output, and then
we ask the machine to predict the output using the test dataset.

Let's understand supervised learning with an example. Suppose we have an input dataset of cats
and dog images. So, first, we will provide the training to the machine to understand the images,
such as the shape & size of the tail of cat and dog, Shape of eyes, colour, height (dogs are taller,
cats are smaller), etc. After completion of training, we input the picture of a cat and ask the
machine to identify the object and predict the output. Now, the machine is well trained, so it will

UNIT-3 III- Page 8 of 58


KCE-CSE –AI&ML 2023

check all the features of the object, such as height, shape, colour, eyes, ears, tail, etc., and find
that it's a cat. So, it will put it in the Cat category. This is the process of how the machine
identifies the objects in Supervised Learning.

The main goal of the supervised learning technique is to map the input variable(x) with the
output variable(y). Some real-world applications of supervised learning are Risk Assessment,
Fraud Detection, Spam filtering, etc.

Categories of Supervised Machine Learning

Supervised machine learning can be classified into two types of problems, which are given
below:

o Classification

o Regression

a) Classification

Classification algorithms are used to solve the classification problems in which the output
variable is categorical, such as "Yes" or No, Male or Female, Red or Blue, etc. The classification
algorithms predict the categories present in the dataset. Some real-world examples of
classification algorithms are Spam Detection, Email filtering, etc.

Some popular classification algorithms are given below:

o Random Forest Algorithm

o Decision Tree Algorithm

o Logistic Regression Algorithm

o Support Vector Machine Algorithm

b) Regression

Regression algorithms are used to solve regression problems in which there is a linear
relationship between input and output variables. These are used to predict continuous output
variables, such as market trends, weather prediction, etc.

Some popular Regression algorithms are given below:

UNIT-3 III- Page 9 of 58


KCE-CSE –AI&ML 2023

o Simple Linear Regression Algorithm

o Multivariate Regression Algorithm

o Decision Tree Algorithm

o Lasso Regression

Advantages and Disadvantages of Supervised Learning

Advantages:

o Since supervised learning work with the labelled dataset so we can have an exact idea
about the classes of objects.

o These algorithms are helpful in predicting the output on the basis of prior experience.

Disadvantages:

o These algorithms are not able to solve complex tasks.

o It may predict the wrong output if the test data is different from the training data.

o It requires lots of computational time to train the algorithm.

Applications of Supervised Learning

Some common applications of Supervised Learning are given below:

o Image Segmentation:
Supervised Learning algorithms are used in image segmentation. In this process, image
classification is performed on different image data with pre-defined labels.

o Medical Diagnosis:
Supervised algorithms are also used in the medical field for diagnosis purposes. It is
done by using medical images and past labelled data with labels for disease conditions.
With such a process, the machine can identify a disease for the new patients.

o Fraud Detection - Supervised Learning classification algorithms are used for identifying
fraud transactions, fraud customers, etc. It is done by using historic data to identify the
patterns that can lead to possible fraud.

o Spam detection - In spam detection & filtering, classification algorithms are used. These
algorithms classify an email as spam or not spam. The spam emails are sent to the spam
folder.

o Speech Recognition - Supervised learning algorithms are also used in speech


recognition. The algorithm is trained with voice data, and various identifications can be
done using the same, such as voice-activated passwords, voice commands, etc.

UNIT-3 III- Page 10 of 58


KCE-CSE –AI&ML 2023

Unsupervised learning is different from the Supervised learning technique; as its name
suggests, there is no need for supervision. It means, in unsupervised machine learning, the
machine is trained using the unlabeled dataset, and the machine predicts the output without
any supervision.

In unsupervised learning, the models are trained with the data that is neither classified nor
labelled, and the model acts on that data without any supervision.

The main aim of the unsupervised learning algorithm is to group or categories the unsorted
dataset according to the similarities, patterns, and differences. Machines are instructed to find
the hidden patterns from the input dataset.

Let's take an example to understand it more preciously; suppose there is a basket of fruit
images, and we input it into the machine learning model. The images are totally unknown to the
model, and the task of the machine is to find the patterns and categories of the objects.

So, now the machine will discover its patterns and differences, such as colour difference, shape
difference, and predict the output when it is tested with the test dataset.

Categories of Unsupervised Machine Learning

UNIT-3 III- Page 11 of 58


KCE-CSE –AI&ML 2023

Unsupervised Learning can be further classified into two types, which are given below:

o Clustering

o Association

1) Clustering

The clustering technique is used when we want to find the inherent groups from the data. It is a
way to group the objects into a cluster such that the objects with the most similarities remain in
one group and have fewer or no similarities with the objects of other groups. An example of the
clustering algorithm is grouping the customers by their purchasing behaviour.

Some of the popular clustering algorithms are given below:

o K-Means Clustering algorithm


o Mean-shift algorithm
o DBSCAN Algorithm
o Principal Component Analysis
o Independent Component Analysis

2) Association
Association rule learning is an unsupervised learning technique, which finds interesting
relations among variables within a large dataset. The main aim of this learning algorithm is to
find the dependency of one data item on another data item and map those variables accordingly
so that it can generate maximum profit. This algorithm is mainly applied in Market Basket
analysis, Web usage mining, continuous production, etc.

Some popular algorithms of Association rule learning are Apriori Algorithm, Eclat, FP-growth
algorithm.

Advantages and Disadvantages of Unsupervised Learning Algorithm

Advantages:

o These algorithms can be used for complicated tasks compared to the supervised ones
because these algorithms work on the unlabeled dataset.

o Unsupervised algorithms are preferable for various tasks as getting the unlabeled
dataset is easier as compared to the labelled dataset.

Disadvantages:

o The output of an unsupervised algorithm can be less accurate as the dataset is not
labelled, and algorithms are not trained with the exact output in prior.

o Working with Unsupervised learning is more difficult as it works with the unlabelled
dataset that does not map with the output.

UNIT-3 III- Page 12 of 58


KCE-CSE –AI&ML 2023

Applications of Unsupervised Learning

o Network Analysis: Unsupervised learning is used for identifying plagiarism and


copyright in document network analysis of text data for scholarly articles.

o Recommendation Systems: Recommendation systems widely use unsupervised learning


techniques for building recommendation applications for different web applications and
e-commerce websites.

o Anomaly Detection: Anomaly detection is a popular application of unsupervised


learning, which can identify unusual data points within the dataset. It is used to discover
fraudulent transactions.

o Singular Value Decomposition: Singular Value Decomposition or SVD is used to extract


particular information from the database. For example, extracting information of each
user located at a particular location.

3. Semi-Supervised Learning

Semi-Supervised learning is a type of Machine Learning algorithm that lies between Supervised
and Unsupervised machine learning. It represents the intermediate ground between Supervised
(With Labelled training data) and Unsupervised learning (with no labelled training data)
algorithms and uses the combination of labelled and unlabeled datasets during the training
period.

Although Semi-supervised learning is the middle ground between supervised and unsupervised
learning and operates on the data that consists of a few labels, it mostly consists of unlabeled
data. As labels are costly, but for corporate purposes, they may have few labels. It is completely
different from supervised and unsupervised learning as they are based on the presence &
absence of labels.

To overcome the drawbacks of supervised learning and unsupervised learning algorithms, the
concept of Semi-supervised learning is introduced. The main aim of semi-supervised learning is
to effectively use all the available data, rather than only labelled data like in supervised learning.
Initially, similar data is clustered along with an unsupervised learning algorithm, and further, it
helps to label the unlabeled data into labelled data. It is because labelled data is a comparatively
more expensive acquisition than unlabeled data.

We can imagine these algorithms with an example. Supervised learning is where a student is
under the supervision of an instructor at home and college. Further, if that student is self-
analysing the same concept without any help from the instructor, it comes under unsupervised
learning. Under semi-supervised learning, the student has to revise himself after analyzing the
same concept under the guidance of an instructor at college.

Advantages and disadvantages of Semi-supervised Learning

Advantages:

o It is simple and easy to understand the algorithm.

o It is highly efficient.

UNIT-3 III- Page 13 of 58


KCE-CSE –AI&ML 2023

o It is used to solve drawbacks of Supervised and Unsupervised Learning algorithms.

Disadvantages:

o Iterations results may not be stable.

o We cannot apply these algorithms to network-level data.

o Accuracy is low.

4. Reinforcement Learning

Reinforcement learning works on a feedback-based process, in which an AI agent (A software


component) automatically explore its surrounding by hitting & trail, taking action, learning from
experiences, and improving its performance. Agent gets rewarded for each good action and get
punished for each bad action; hence the goal of reinforcement learning agent is to maximize the
rewards.

In reinforcement learning, there is no labelled data like supervised learning, and agents learn
from their experiences only.

The reinforcement learning process is similar to a human being; for example, a child learns
various things by experiences in his day-to-day life. An example of reinforcement learning is to
play a game, where the Game is the environment, moves of an agent at each step define states,
and the goal of the agent is to get a high score. Agent receives feedback in terms of punishment
and rewards.

Due to its way of working, reinforcement learning is employed in different fields such as Game
theory, Operation Research, Information theory, multi-agent systems.

A reinforcement learning problem can be formalized using Markov Decision Process(MDP). In


MDP, the agent constantly interacts with the environment and performs actions; at each action,
the environment responds and generates a new state.

Categories of Reinforcement Learning

Reinforcement learning is categorized mainly into two types of methods/algorithms:

o Positive Reinforcement Learning: Positive reinforcement learning specifies increasing


the tendency that the required behaviour would occur again by adding something. It
enhances the strength of the behaviour of the agent and positively impacts it.

o Negative Reinforcement Learning: Negative reinforcement learning works exactly


opposite to the positive RL. It increases the tendency that the specific behaviour would
occur again by avoiding the negative condition.

Real-world Use cases of Reinforcement Learning

o Video Games:
RL algorithms are much popular in gaming applications. It is used to gain super-human
performance. Some popular games that use RL algorithms are AlphaGO and AlphaGO
Zero.

UNIT-3 III- Page 14 of 58


KCE-CSE –AI&ML 2023

o Resource Management:
The "Resource Management with Deep Reinforcement Learning" paper showed that
how to use RL in computer to automatically learn and schedule resources to wait for
different jobs in order to minimize average job slowdown.

o Robotics:
RL is widely being used in Robotics applications. Robots are used in the industrial and
manufacturing area, and these robots are made more powerful with reinforcement
learning. There are different industries that have their vision of building intelligent
robots using AI and Machine learning technology.

o Text Mining
Text-mining, one of the great applications of NLP, is now being implemented with the
help of Reinforcement Learning by Salesforce company.

Advantages and Disadvantages of Reinforcement Learning

Advantages

o It helps in solving complex real-world problems which are difficult to be solved by


general techniques.

o The learning model of RL is similar to the learning of human beings; hence most
accurate results can be found.

o Helps in achieving long term results.

Disadvantage

o RL algorithms are not preferred for simple problems.

o RL algorithms require huge data and computations.

o Too much reinforcement learning can lead to an overload of states which can weaken
the results.

The curse of dimensionality limits reinforcement learning for real physical systems.

TOPIC 13. LINEAR REGRESSION MODELS: LEAST SQUARES, SINGLE & MULTIPLE
VARIABLES

Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
statistical method that is used for predictive analysis. Linear regression makes predictions for
continuous/real or numeric variables such as sales, salary, age, product price, etc.

Linear regression algorithm shows a linear relationship between a dependent (y) and one or
more independent (y) variables, hence called as linear regression. Since linear regression shows
the linear relationship, which means it finds how the value of the dependent variable is
changing according to the value of the independent variable.

The linear regression model provides a sloped straight line representing the relationship
between the variables. Consider the below image:

UNIT-3 III- Page 15 of 58


KCE-CSE –AI&ML 2023

y= a0+a1x+ ε
Here,

Y= Dependent Variable (Target Variable)


X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error

The values for x and y variables are training datasets for Linear Regression model
representation.

Types of Linear Regression

Linear regression can be further divided into two types of the algorithm:

o Simple Linear Regression:


If a single independent variable is used to predict the value of a numerical dependent
variable, then such a Linear Regression algorithm is called Simple Linear Regression.

o Multiple Linear regression:


If more than one independent variable is used to predict the value of a numerical
dependent variable, then such a Linear Regression algorithm is called Multiple Linear
Regression.

Linear Regression Line

A linear line showing the relationship between the dependent and independent variables is
called a regression line. A regression line can show two types of relationship:

o Positive Linear Relationship:


If the dependent variable increases on the Y-axis and independent variable increases on
X-axis, then such a relationship is termed as a Positive linear relationship.

UNIT-3 III- Page 16 of 58


KCE-CSE –AI&ML 2023

o Negative Linear Relationship:


If the dependent variable decreases on the Y-axis and independent variable increases on
the X-axis, then such a relationship is called a negative linear relationship.

Finding the best fit line:

When working with linear regression, our main goal is to find the best fit line that means the
error between predicted values and actual values should be minimized. The best fit line will
have the least error.

The different values for weights or the coefficient of lines (a0, a1) gives a different line of
regression, so we need to calculate the best values for a 0 and a1 to find the best fit line, so to
calculate this we use cost function.

Cost function-

o The different values for weights or coefficient of lines (a0, a1) gives the different line of
regression, and the cost function is used to estimate the values of the coefficient for the
best fit line.

o Cost function optimizes the regression coefficients or weights. It measures how a linear
regression model is performing.

o We can use the cost function to find the accuracy of the mapping function, which maps
the input variable to the output variable. This mapping function is also known
as Hypothesis function.

UNIT-3 III- Page 17 of 58


KCE-CSE –AI&ML 2023

For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the average
of squared error occurred between the predicted values and actual values. It can be written as:

For the above linear equation, MSE can be calculated as:

Where,

N=Total number of observation


Yi = Actual value
(a1xi+a0)= Predicted value.

Residuals: The distance between the actual value and predicted values is called residual. If the
observed points are far from the regression line, then the residual will be high, and so cost
function will high. If the scatter points are close to the regression line, then the residual will be
small and hence the cost function.

Gradient Descent:

o Gradient descent is used to minimize the MSE by calculating the gradient of the cost
function.

o A regression model uses gradient descent to update the coefficients of the line by
reducing the cost function.

o It is done by a random selection of values of coefficient and then iteratively update the
values to reach the minimum cost function.

Model Performance:

The Goodness of fit determines how the line of regression fits the set of observations. The
process of finding the best model out of various models is called optimization. It can be achieved
by below method:

Assumptions of Linear Regression

Below are some important assumptions of Linear Regression. These are some formal checks
while building a Linear Regression model, which ensures to get the best possible result from the
given dataset.

o Linear relationship between the features and target:


Linear regression assumes the linear relationship between the dependent and
independent variables.

o Small or no multicollinearity between the features:


Multicollinearity means high-correlation between the independent variables. Due to
multicollinearity, it may difficult to find the true relationship between the predictors and
target variables. Or we can say, it is difficult to determine which predictor variable is

UNIT-3 III- Page 18 of 58


KCE-CSE –AI&ML 2023

affecting the target variable and which is not. So, the model assumes either little or no
multicollinearity between the features or independent variables.

o Homoscedasticity Assumption:
Homoscedasticity is a situation when the error term is the same for all the values of
independent variables. With homoscedasticity, there should be no clear pattern
distribution of data in the scatter plot.

o Normal distribution of error terms:


Linear regression assumes that the error term should follow the normal distribution
pattern. If error terms are not normally distributed, then confidence intervals will
become either too wide or too narrow, which may cause difficulties in finding
coefficients.
It can be checked using the q-q plot. If the plot shows a straight line without any
deviation, which means the error is normally distributed.

o No autocorrelations:
The linear regression model assumes no autocorrelation in error terms. If there will be
any correlation in the error term, then it will drastically reduce the accuracy of the
model. Autocorrelation usually occurs if there is a dependency between residual errors.

Simple Linear Regression in Machine Learning

Simple Linear Regression is a type of Regression algorithms that models the relationship
between a dependent variable and a single independent variable. The relationship shown by a
Simple Linear Regression model is linear or a sloped straight line, hence it is called Simple
Linear Regression.

The key point in Simple Linear Regression is that the dependent variable must be a
continuous/real value. However, the independent variable can be measured on continuous or
categorical values.

Simple Linear regression algorithm has mainly two objectives:

o Model the relationship between the two variables. Such as the relationship between
Income and expenditure, experience and Salary, etc.

o Forecasting new observations. Such as Weather forecasting according to temperature,


Revenue of a company according to the investments in a year, etc.

Simple Linear Regression Model:

The Simple Linear Regression model can be represented using the below equation:

y= a0+a1x+ ε
Where,

UNIT-3 III- Page 19 of 58


KCE-CSE –AI&ML 2023

a0= It is the intercept of the Regression line (can be obtained putting x=0)
a1= It is the slope of the regression line, which tells whether the line is increasing or decreasing.
ε = The error term. (For a good model it will be negligible)

Problem Statement example for Simple Linear Regression:

Here we are taking a dataset that has two variables: salary (dependent variable) and experience
(Independent variable). The goals of this problem is:

o We want to find out if there is any correlation between these two variables

o We will find the best fit line for the dataset.

o How the dependent variable is changing by changing the independent variable.

Multiple Linear Regression

o In Simple Linear Regression, where a single Independent/Predictor(X) variable


is used to model the response variable (Y). But there may be various cases in
which the response variable is affected by more than one predictor variable; for
such cases, the Multiple Linear Regression algorithm is used.
o Moreover, Multiple Linear Regression is an extension of Simple Linear regression
as it takes more than one predictor variable to predict the response variable. We
can define it as:
o Multiple Linear Regression is one of the important regression algorithms which

models the linear relationship between a single dependent continuous variable and
more than one independent variable.

Example:

Prediction of CO2 emission based on engine size and number of cylinders in a car.

Some key points about MLR:

o For MLR, the dependent or target variable(Y) must be the continuous/real, but the
predictor or independent variable may be of continuous or categorical form.

o Each feature variable must model the linear relationship with the dependent variable.

o MLR tries to fit a regression line through a multidimensional space of data-points.

MLR equation:

In Multiple Linear Regression, the target variable(Y) is a linear combination of multiple


predictor variables x1, x2, x3, ...,xn. Since it is an enhancement of Simple Linear Regression, so the
same is applied for the multiple linear regression equation, the equation becomes:

Y= b<sub>0</sub>+b<sub>1</sub>x<sub>1</sub>+ b<sub>2</sub>x<sub>2</sub>+ b<sub>


3</sub>x<sub>3</sub>+...... bnxn ........................ (a)

UNIT-3 III- Page 20 of 58


KCE-CSE –AI&ML 2023

Where,

Y= Output/Response variable

b0, b1, b2, b3 , bn ......= Coefficients of the model.

x1, x2, x3, x4,.. = Various Independent/feature variable

Assumptions for Multiple Linear Regression:

o A linear relationship should exist between the Target and predictor variables.

o The regression residuals must be normally distributed.

o MLR assumes little or no multicollinearity (correlation between the independent


variable) in data.

Applications of Multiple Linear Regression:

There are mainly two applications of Multiple Linear Regression:

o Effectiveness of Independent variable on prediction:

o Predicting the impact of changes:

ML Polynomial Regression

o Polynomial Regression is a regression algorithm that models the relationship between a


dependent(y) and independent variable(x) as nth degree polynomial. The Polynomial
Regression equation is given below:

y= b0+b1x1+ b2x12+ b2x13+ ....... bnx1n

o It is also called the special case of Multiple Linear Regression in ML. Because we add
some polynomial terms to the Multiple Linear regression equation to convert it into
Polynomial Regression.

o It is a linear model with some modification in order to increase the accuracy.

o The dataset used in Polynomial regression for training is of non-linear nature.

o It makes use of a linear regression model to fit the complicated and non-linear functions
and datasets.

o Hence, "In Polynomial regression, the original features are converted into
Polynomial features of required degree (2,3,..,n) and then modeled using a linear
model."

Need for Polynomial Regression:

The need of Polynomial Regression in ML can be understood in the below points:

UNIT-3 III- Page 21 of 58


KCE-CSE –AI&ML 2023

o If we apply a linear model on a linear dataset, then it provides us a good result as we


have seen in Simple Linear Regression, but if we apply the same model without any
modification on a non-linear dataset, then it will produce a drastic output. Due to
which loss function will increase, the error rate will be high, and accuracy will be
decreased.

o So for such cases, where data points are arranged in a non-linear fashion, we need
the Polynomial Regression model. We can understand it in a better way using the
below comparison diagram of the linear dataset and non-linear dataset.

Equation of the Polynomial Regression Model:

Simple Linear Regression equation: y = b0+b1x ...................(a)

Multiple Linear Regression equation: y= b0+b1x+ b2x2+ b3x3+....+ bnxn ............................. (b)

Polynomial Regression equation: y= b0+b1x + b2x2+ b3x3+....+ bnxn .............................. (c)

TOPIC 14. BAYESIAN LINEAR REGRESSION, GRADIENT DESCENT, LINEAR CLASSIFICATION


MODELS: DISCRIMINANT FUNCTION

Bayesian Linear Regression

Introduction To Bayesian Linear Regression

To demonstrate the relationship between two components, linear regression fits a straight
condition to observed data. One variable is seen as illustrative, while the other is seen as
necessary. For instance, solid modeling using a straight recurrence model must connect many
people to their monuments.

Introduction To Bayesian Linear Regression

What Is Bayesian Linear Regression?

In Bayesian linear regression, the mean of one parameter is characterized by a weighted sum of
other variables. This type of conditional modeling aims to determine the prior distribution of
the regressors as well as other variables describing the allocation of the regressand) and

UNIT-3 III- Page 22 of 58


KCE-CSE –AI&ML 2023

eventually permits the out-of-sample forecasting of the regressand conditional on observations


of the regression coefficients.

The normal linear equation, where the distribution of display style YY given by display style XX
is Gaussian, is the most basic and popular variant of this model. The future can be determined
analytically for this model, and a specific set of prior probabilities for the parameters is known
as conjugate priors. The posteriors usually have more randomly selected priors.

When the dataset has too few or poorly dispersed data, Bayesian Regression might be quite
helpful. In contrast to conventional regression techniques, where the output is only derived
from a single number of each attribute, a Bayesian Regression model's output is derived from a
probability distribution.

The result, "y," is produced by a normal distribution (where the variance and mean are
normalized). The goal of the Bayesian Regression Model is to identify the 'posterior' distribution
again for model parameters rather than the model parameters themselves. The model
parameters will be expected to follow a distribution in addition to the output y.

The posterior expression is given below:

Posterior = (Likelihood * Prior)/Normalization

The expression parameters are explained below:

 Posterior: It is the likelihood that an event, such as H, will take place given the
occurrence of another event, such as E, i.e., P(H | E).

 Likelihood: It is a likelihood function in which a marginalization parameter variable is


used.

 Priority: This refers to the likelihood that event H happened before event A, i.e., P(H) (H)

This is the same as Bayes' Theorem, which states the following -

P(A|B) = (P(B|A) P(A))/P(B)

P(A) is the likelihood that event A will occur, while P(A|B) is the likelihood that event A will
occur, provided that event B has already occurred. Here, A and B seem to be events. P(B), the
likelihood of event B happening cannot be zero because it already has.

According to the aforementioned formula, we get a prior probability for the model parameters
that is proportional to the probability of the data divided by the posterior distribution of the
parameters, unlike Ordinary Least Square (OLS), which is what we observed in the case of the
OLS.

The value of probability will rise as more data points are collected and eventually surpass the
previous value. The parameter values converge to values obtained by OLS in the case of an
unlimited number of data points. Consequently, we start our regression method with an
estimate (the prior value).

UNIT-3 III- Page 23 of 58


KCE-CSE –AI&ML 2023

As we begin to include additional data points, the accuracy of our model improves. Therefore, to
make a Bayesian Ridge Regression model accurate, a considerable amount of train data is
required.

Real-life Application Of Bayesian Linear Regression

Some of the real-life applications of Bayesian Linear Regression are given below:

 Using Priors: Consider a scenario in which your supermarkets carry a new product, and
we want to predict its initial Christmas sales. For the new product's Christmas effect, we
may merely use the average of comparable things as a previous one.

Additionally, once we obtain data from the new item's initial Christmas sales, the previous is
immediately updated. As a result, the forecast for the next Christmas is influenced by both the
prior and the new item's data.

 Regularize Priors: With the season, day of the week, trend, holidays, and a tonne of
promotion indicators, our model is severely over-parameterized. Therefore
regularization is crucial to keep the forecasts in check.

Advantages Of Bayesian Regression

Some of the main advantages of Bayesian Regression are defined below:

 Extremely efficient when the dataset is tiny.

 Particularly well-suited for online learning as opposed to batch learning, when we know
the complete dataset before we begin training the model. This is so that Bayesian
Regression can be used without having to save data.

 The Bayesian technique has been successfully applied and is quite strong
mathematically. Therefore, using this requires no additional prior knowledge of the
dataset.

Let us now look at some disadvantages of Bayesian Regression.

Disadvantages Of Bayesian Regression

Some common disadvantages of using Bayesian Regression:

 The model's inference process can take some time.

 The Bayesian strategy is not worthwhile if there is a lot of data accessible for our
dataset, and the regular probability approach does the task more effectively.

Gradient Descent in Linear Regression

 In linear regression, the model targets to get the best-fit regression line to predict the
value of y based on the given input value (x). While training the model, the model

UNIT-3 III- Page 24 of 58


KCE-CSE –AI&ML 2023

calculates the cost function which measures the Root Mean Squared error between the
predicted value (pred) and true value (y).

 The model targets to minimize the cost function.


To minimize the cost function, the model needs to have the best value of θ 1 and θ2.
Initially model selects θ1 and θ2 values randomly and then iteratively update these value
in order to minimize the cost function until it reaches the minimum.

 By the time model achieves the minimum cost function, it will have the best θ1 and
θ2 values. Using these finally updated values of θ 1 and θ2 in the hypothesis equation of
linear equation, the model predicts the value of x in the best manner it can.
Therefore, the question arises – How do θ1 and θ2 values get updated?
Linear Regression Cost Function:

We graph cost function as a function of parameter estimates i.e. parameter range of our
hypothesis function and the cost resulting from selecting a particular set of parameters. We
move downward towards pits in the graph, to find the minimum value. The way to do this is
taking derivative of cost function as explained in the above figure. Gradient Descent step-downs
the cost function in the direction of the steepest descent. The size of each step is determined by
parameter α known as Learning Rate.
In the Gradient Descent algorithm, one can infer two points :

 If slope is +ve : θj = θj – (+ve value). Hence value of θj decreases.

 If slope is -ve : θj = θj – (-ve value). Hence value of θj increases.

UNIT-3 III- Page 25 of 58


KCE-CSE –AI&ML 2023

The choice of correct learning rate is very important as it ensures that


Gradient Descent converges in a reasonable time. :

If we choose α to be very large, Gradient Descent can overshoot the


minimum. It may fail to converge or even diverge.

For linear regression Cost, the Function graph is always convex shaped.

Gradient Descent is a popular optimization algorithm for linear regression models that involves
iteratively adjusting the model parameters to minimize the cost function. Here are some
advantages and disadvantages of using Gradient Descent for linear regression:

Advantages:

 Flexibility: Gradient Descent can be used with various cost functions and can handle
non-linear regression problems.

 Scalability: Gradient Descent is scalable to large datasets since it updates the parameters
for each training example one at a time.

 Convergence: Gradient Descent can converge to the global minimum of the cost function,
provided that the learning rate is set appropriately.

Disadvantages:

 Sensitivity to Learning Rate: The choice of learning rate can be critical in Gradient
Descent since using a high learning rate can cause the algorithm to overshoot the
minimum, while a low learning rate can make the algorithm converge slowly.

 Slow Convergence: Gradient Descent may require more iterations to converge to the
minimum since it updates the parameters for each training example one at a time.

 Local Minima: Gradient Descent can get stuck in local minima if the cost function has
multiple local minima.

UNIT-3 III- Page 26 of 58


KCE-CSE –AI&ML 2023

 Noisy updates: The updates in Gradient Descent are noisy and have a high variance,
which can make the optimization process less stable and lead to oscillations around the
minimum.

Overall, Gradient Descent is a useful optimization algorithm for linear regression, but it has
some limitations and requires careful tuning of the learning rate to ensure convergence.

Linear Classification Model

A linear classifier is a model that makes a decision to categories a set of data points to a discrete
class based on a linear combination of its explanatory variables. As an example, combining
details about a dog such as weight, height, colour and other features would be used by a model
to decide its species. The effectiveness of these models lie in their ability to find this
mathematical combination of features that groups data points together when they have the
same class and separate them when they have different classes, providing us with clear
boundaries for how to classify.

If each instance belongs to one and only one class, then our input data can be divided into
decision regions separated by decision boundaries.

The best example of an ML classification algorithm is Email Spam Detector.

The main goal of the Classification algorithm is to identify the category of a given dataset, and
these algorithms are mainly used to predict the output for the categorical data.

Classification algorithms can be better understood using the below diagram. In the below
diagram, there are two classes, class A and Class B. These classes have features that are similar
to each other and dissimilar to other classes

The algorithm which implements the classification on a dataset is known as a classifier. There
are two types of Classifications:

o Binary Classifier: If the classification problem has only two possible outcomes, then it is
called as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.

UNIT-3 III- Page 27 of 58


KCE-CSE –AI&ML 2023

o Multi-class Classifier: If a classification problem has more than two outcomes, then it is
called as Multi-class Classifier.
Example: Classifications of types of crops, Classification of types of music.

Learners in Classification Problems:

In the classification problems, there are two types of learners:

1. Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it receives
the test dataset. In Lazy learner case, classification is done on the basis of the most
related data stored in the training dataset. It takes less time in training but more time for
predictions.
Example: K-NN algorithm, Case-based reasoning

2. Eager Learners:Eager Learners develop a classification model based on a training


dataset before receiving a test dataset. Opposite to Lazy learners, Eager Learner takes
more time in learning, and less time in prediction. Example: Decision Trees, Naïve Bayes,
ANN.

Types of ML Classification Algorithms:

Classification Algorithms can be further divided into the Mainly two category:

o Linear Models

o Logistic Regression

o Support Vector Machines

o Non-linear Models

o K-Nearest Neighbours

o Kernel SVM

o Naïve Bayes

o Decision Tree Classification

o Random Forest Classification

Evaluating a Classification model:

Once our model is completed, it is necessary to evaluate its performance; either it is a


Classification or Regression model. So for evaluating a Classification model, we have the
following ways:

1. Log Loss or Cross-Entropy Loss:

o It is used for evaluating the performance of a classifier, whose output is a probability


value between the 0 and 1.

o For a good binary Classification model, the value of log loss should be near to 0.

UNIT-3 III- Page 28 of 58


KCE-CSE –AI&ML 2023

o The value of log loss increases if the predicted value deviates from the actual value.

o The lower log loss represents the higher accuracy of the model.

o For Binary classification, cross-entropy can be calculated as:

1. ?(ylog(p)+(1?y)log(1?p))

Where y= Actual output, p= predicted output.

2. Confusion Matrix:

o The confusion matrix provides us a matrix/table as output and describes the


performance of the model.

o It is also known as the error matrix.

o The matrix consists of predictions result in a summarized form, which has a total
number of correct predictions and incorrect predictions. The matrix looks like as below
table:

Actual Positive Actual Negative

Predicted Positive True Positive False Positive

Predicted Negative False Negative True Negative

3. AUC-ROC curve:

o ROC curve stands for Receiver Operating Characteristics Curve and AUC stands for Area
Under the Curve.

o It is a graph that shows the performance of the classification model at different


thresholds.

o To visualize the performance of the multi-class classification model, we use the AUC-
ROC Curve.

o The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on Y-axis
and FPR(False Positive Rate) on X-axis.

Use cases of Classification Algorithms

Classification algorithms can be used in different places. Below are some popular use
cases of Classification Algorithms:

UNIT-3 III- Page 29 of 58


KCE-CSE –AI&ML 2023

o Email Spam Detection


o Speech Recognition
o Identifications of Cancer tumor cells.
o Drugs Classification
o Biometric Identification, etc.
A linear classifier does classification decision based on the value of a linear combination
of the characteristics. Imagine that the linear classifier will merge into it's weights all the
characteristics that define a particular class. (Like merge all samples of the class cars
together)
o This type of classifier works better when the problem is linear separable.

Linear Regression

Linear Regression is a statistical approach that predicts the result of a response variable by
combining numerous influencing factors. It attempts to represent the linear connection between
features (independent variables) and the target (dependent variables). The cost function
enables us to find the best possible values for the model parameters.

Logistic Regression

Logistic regression is an extension of linear regression. The sigmoid function first transforms
the linear regression output between 0 and 1. After that, a predefined threshold helps to
determine the probability of the output values. The values higher than the threshold value tend
towards having a probability of 1, whereas values lower than the threshold value tend towards
having a probability of 0. A separate article dives deeper into the mathematics behind the
Logistic Regression Model.

UNIT-3 III- Page 30 of 58


KCE-CSE –AI&ML 2023

TOPIC 15. PROBABILISTIC DISCRIMINATIVE MODEL - LOGISTIC

REGRESSION

• Discriminative model makes predictions on unseen data based on


conditional probability and can be used either for classification or regression
problem statements.
• On the contrary, a Generative model focuses on the distribution of a dataset
to return a probability for a given example.

Model Generative Discriminative


Goal Probability estimates Classification rule
Performance Likelihood Misclassification rate
measure
Mismatch problems Outliers Misclassifications

Discriminative model -Problem Formulation

Problem Formulation

Suppose we are working on a classification problem where our task is to decide if


an email is spam or not spam based on the words present in a particular email. To
solve this problem, we have a joint model over.
Labels: Y=y, and
Features: X={x1, x2, …xn}
The discriminative model refers to a class of models used in Statistical
Classification, mainly used for supervised machine learning. These types of models
are also known as conditional models since they learn the boundaries between
classes or labels in a dataset.
 Discriminative models (just as in the literal meaning) separate classes
instead of modeling the conditional probability and don’t make any
assumptions about the data points.
 But these models are not capable of generating new data points.
Therefore, the ultimate objective of discriminative models is to separate one
class from another.
 If we have some outliers present in the dataset, discriminative models
work better compared to generative models i.e., discriminative models are
more robust to outliers. However, one major drawback of these models is
the misclassification problem, i.e., wrongly classifying a data point.
 With the help of training data, we estimate the parameters of P(Y|X)

UNIT-3 III- Page 31 of 58


KCE-CSE –AI&ML 2023

 Examples of Discriminative Models


o Logistic regression
o Support vector machines(SVMs)
o Traditional neural networks
o Nearest neighbor
o Conditional Random Fields (CRFs)
o Decision Trees and Random Forest

o Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the
categorical dependent variable using a given set of independent variables.
o Logistic regression predicts the output of a categorical dependent variable.
Therefore the outcome must be a categorical or discrete value. It can be either Yes
or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and 1, it
gives the probabilistic values which lie between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how they
are used. Linear Regression is used for solving Regression problems,
whereas Logistic regression is used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped
logistic function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its
weight, etc.
o Logistic Regression is a significant machine learning algorithm because it has the
ability to provide probabilities and classify new data using continuous and
discrete datasets.

UNIT-3 III- Page 32 of 58


KCE-CSE –AI&ML 2023

o Logistic Regression can be used to classify the observations using different types
of data and can easily determine the most effective variables used for the
classification. The below image is showing the logistic function:

Logistic Function (Sigmoid Function):


o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go beyond this
limit, so it forms a curve like the "S" form. The S-form curve is called the Sigmoid function
or the logistic function.
o In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a
value below the threshold values tends to 0.

Assumptions for Logistic Regression:


o The dependent variable must be categorical in nature.
o The independent variable should not have multi-collinearity.

Logistic Regression Equation:

The Logistic regression equation can be obtained from the Linear Regression equation.
The mathematical steps to get Logistic Regression equations are given below:

o We know the equation of the straight line can be written as:

o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):

UNIT-3 III- Page 33 of 58


KCE-CSE –AI&ML 2023

o But we need range between -[infinity] to +[infinity], then take logarithm of the equation it
will become:

The above equation is the final equation for Logistic Regression.

Type of Logistic Regression:

On the basis of the categories, Logistic Regression can be classified into three types:

o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".

Advantages of the Logistic Regression Algorithm

 Logistic regression performs better when the data is linearly separable


 It does not require too many computational resources as it’s highly interpretable
 There is no problem scaling the input features—It does not require tuning
 It is easy to implement and train a model using logistic regression
 It gives a measure of how relevant a predictor (coefficient size) is, and its direction
of association (positive or negative)

Applications of Logistic Regression
 Using the logistic regression algorithm, banks can predict whether a customer
would default on loans or not
 To predict the weather conditions of a certain place (sunny, windy, rainy, humid,
etc.)
 Ecommerce companies can identify buyers if they are likely to purchase a certain
product
 Companies can predict whether they will gain or lose money in the next quarter,
year, or month based on their current performance
 To classify objects based on their features and attributes

TOPIC 16. PROBABILISTIC GENERATIVE MODEL, NAIVE BAYES

Generative models are considered a class of statistical models that can generate new data
UNIT-3 III- Page 34 of 58
KCE-CSE –AI&ML 2023

instances. These models are used in unsupervised machine learning as a means to perform tasks
such as
• Probability and Likelihood estimation,
• Modeling data points
• To describe the phenomenon in data,
• To distinguish between classes based on these probabilities.
Since these models often rely on the Bayes theorem to find the joint probability, generative
models can tackle a more complex task than analogous discriminative models
These models use the concept of joint probability and create instances where a given feature (x) or
input and the desired output or label (y) exist simultaneously..
Training generative classifiers involve estimating a function f: X -> Y, or probability P(Y|X):

Examples of Generative Models


Naïve Bayes
Bayesian networks
Markov random fields
Hidden Markov Models (HMMs)

 Discriminative models draw boundaries in the data space, while generative models try to
model how data is placed throughout the space.
 A generative model explains how the data was generated, while a discriminative model
focuses on predicting the labels of the data.

UNIT-3 III- Page 35 of 58


KCE-CSE –AI&ML 2023

TOPIC 17. MAXIMUM MARGIN CLASSIFIER – SUPPORT VECTOR MACHINE

 Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems.
However, primarily, it is used for Classification problems in Machine Learning.
 The goal of the SVM algorithm is to create the best line or decision boundary that
can segregate n-dimensional space into classes so that we can easily put the new
data point in the correct category in the future. This best decision boundary is
called a hyperplane.
 SVM chooses the extreme points/vectors that help in creating the hyperplane.
These extreme cases are called as support vectors, and hence algorithm is termed
as Support Vector Machine. Consider the below diagram in which there are two
different categories that are classified using a decision boundary or hyperplane:

 SVM can be understood with the example that we have used in the KNN classifier.
 Suppose we see a strange cat that also has some features of dogs, so if we want a
model that can accurately identify whether it is a cat or dog, so such a model can
be created by using the SVM algorithm.
 We will first train our model with lots of images of cats and dogs so that it can
learn about different features of cats and dogs, and then we test it with this strange
creature. So as support vector creates a decision boundary between these two data
(cat and dog) and choose extreme cases (support vectors), it will see the extreme
case of cat and dog.

UNIT-3 III- Page 36 of 58


KCE-CSE –AI&ML 2023

 On the basis of the support vectors, it will classify it as a cat. Consider the below
diagram:

SVM algorithm can be used for Face detection, image classification, text
categorization, etc.

Types of SVM

SVM can be of two types:

o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can
be classified into two classes by using a single straight line, then such data is termed as
linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if
a dataset cannot be classified by using a straight line, then such data is termed as non-
linear data and classifier used is called as Non-linear SVM classifier.

Hyperplane and Support Vectors in the SVM algorithm:

Hyperplane:

 There can be multiple lines/decision boundaries to segregate the classes in n-


dimensional space, but we need to find out the best decision boundary that
helps to classify the data points. This best boundary is known as the hyperplane of
SVM.
 The dimensions of the hyperplane depend on the features present in the
dataset, which means if there are 2 features (as shown in image), then hyperplane
will be a straight line. And if there are 3 features, then hyperplane will be a 2-
dimension plane.
 We always create a hyperplane that has a maximum margin, which means the
maximum distance between the data points.

Support Vectors:

The data points or vectors that are the closest to the hyperplane and which affect
the position of the hyperplane are termed as Support Vector. Since these vectors
support the hyperplane, hence called a Support vector.
UNIT-3 III- Page 37 of 58
KCE-CSE –AI&ML 2023

Working of SVM

 The working of the SVM algorithm can be understood by using an example.


Suppose we have a dataset that has two tags (green and blue), and the dataset has
two features x1 and x2. We want a classifier that can classify the pair(x1, x2) of
coordinates in either green or blue. Consider the below image:

So as it is 2-d space so by just using a straight line, we can easily separate these two
classes. But there can be multiple lines that can separate these classes. Consider the
below image:

 Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane.
 SVM algorithm finds the closest point of the lines from both the classes. These points are
called support vectors.
 The distance between the vectors and the hyperplane is called as margin. And the goal of
SVM is to maximize this margin. The hyperplane with maximum margin is called
the optimal hyperplane.

UNIT-3 III- Page 38 of 58


KCE-CSE –AI&ML 2023

Non-Linear SVM:

If data is linearly arranged, then we can separate it by using a straight line, but for
non-linear data, we cannot draw a single straight line. Consider the below image:

So to separate these data points, we need to add one more dimension. For linear data,
we have used two dimensions x and y, so for non-linear data, we will add a third
dimension z. It can be calculated as:

z=x2 +y2

By adding the third dimension, the sample space will become as below image:

UNIT-3 III- Page 39 of 58


KCE-CSE –AI&ML 2023

Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we

convert it in 2d space with z=1, then it will become as:

Advantages of SVM
 Effective in high-dimensional cases.
 Its memory is efficient as it uses a subset of training points in the decision
function called support vectors.
 Different kernel functions can be specified for the decision functions and its
possible to specify custom kernels.
Cons
 If the number of features is a lot bigger than the number of data points,
avoiding over-fitting when choosing kernel functions and regularization term is
crucial.
 SVMs don't directly provide probability estimates. Those are calculated using
an expensive five-fold cross-validation.
 Works best on small sample sets because of its high training time.
What does a hard margin mean?

The hard margin means that the SVM is very rigid in classification and tries to
perform extremely well in the training set, covering all available data points.
This works well when the deviation is insignificant, but can lead to overfitting, in
which case we would need to switch to a soft margin.

What does a soft margin mean?

The idea behind the soft margin is based on a simple assumption: allow the support
vector classifier to make some mistakes while still keeping the margin as large
as possible.

If we want a soft margin, we have to decide how to ignore some of the outliers while
still getting good classification results. The solution is to introduce penalties for
outfitting that are regulated by the C parameter (as it’s called in many frameworks).

UNIT-3 III- Page 40 of 58


KCE-CSE –AI&ML 2023

Which margin to choose?

We use an SVM with hard margin when data is evidently separable. However, we may
choose the soft margin when we have to disregard some of the outliers to improve the
model’s overall performance.

TOPIC 18. DECISION TREE, RANDOM FORESTS

o Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches represent the decision
rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do not contain any further
branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification
and Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further
split the tree into subtrees.
o Below diagram explains the general structure of a decision tree:

UNIT-3 III- Page 41 of 58


KCE-CSE –AI&ML 2023

Why use Decision Trees?

There are various algorithms in Machine learning, so choosing the best algorithm for the
given dataset and problem is the main point to remember while creating a machine
learning model. Below are the two reasons for using the Decision tree:

o Decision Trees usually mimic human thinking ability while making a decision, so it is

easy to understand.

o The logic behind the decision tree can be easily understood because it shows a tree-

like structure.

Decision Tree Terminologies


Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated
further after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-
nodes according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other
nodes are called the child nodes.

How does the Decision Tree algorithm Work?

 In a decision tree, for predicting the class of the given dataset, the algorithm starts
from the root node of the tree.
 This algorithm compares the values of root attribute with the record (real dataset)
attribute and, based on the comparison, follows the branch and jumps to the next
node.
 For the next node, the algorithm again compares the attribute value with the other
sub-nodes and move further.
 It continues the process until it reaches the leaf node of the tree. The complete
process can be better understood using the below algorithm:

UNIT-3 III- Page 42 of 58


KCE-CSE –AI&ML 2023

 Step-1: Begin the tree with the root node, says S, which contains the complete
dataset.
 Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
 Step-3: Divide the S into subsets that contains possible values for the best
attributes.
 Step-4: Generate the decision tree node, which contains the best attribute.
 Step-5: Recursively make new decision trees using the subsets of the dataset
created in step -3. Continue this process until a stage is reached where you cannot
further classify the nodes and called the final node as a leaf node.

Eg:
Suppose there is a candidate who has a job offer and wants to decide whether he should
accept the offer or Not. So, to solve this problem, the decision tree starts with the root
node (Salary attribute by ASM). The root node splits further into the next decision node
(distance from the office) and one leaf node based on the corresponding labels. The
next decision node further gets split into one decision node (Cab facility) and one leaf
node. Finally, the decision node splits into two leaf nodes (Accepted offers and Declined
offer). Consider the below diagram:

Attribute Selection Measures


While implementing a Decision tree, the main issue arises that how to select the best
attribute for the root node and for sub-nodes. So, to solve such problems there is a
technique which is called as Attribute selection measure or ASM. By this
measurement, we can easily select the best attribute for the nodes of the tree. There
are two popular techniques for ASM, which are:

o Information Gain
o Gini Index
UNIT-3 III- Page 43 of 58
KCE-CSE –AI&ML 2023

1. Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation
of a dataset based on an attribute.
o It calculates how much information a feature provides us about a class.

o According to the value of information gain, we split the node and build the decision
tree.
o A decision tree algorithm always tries to maximize the value of information gain,
and a node/attribute having the highest information gain is split first. It can be
calculated using the below formula:
o The original information needed for classification of a tuple in dataset D is given
by:

o
o Where p is the probability that the tuple belongs to class C. The information is
encoded in bits, therefore, log to the base 2 is used. E(s) represents the average
amount of information required to find out the class label of dataset D. This
information gain is also called Entropy.
o The information required for exact classification after portioning is given by the
formula:

o
o Where P (c) is the weight of partition. This information represents the
information needed to classify the dataset D on portioning by X.

o Information gain is the difference between the original and expected information
that is required to classify the tuples of dataset D.

o
o Gain is the reduction of information that is required by knowing the value of X.
The attribute with the highest information gain is chosen as “best”.

2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
UNIT-3 III- Page 44 of 58
KCE-CSE –AI&ML 2023

o An attribute with the low Gini index should be preferred as compared to the high Gini
index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create
binary splits.
o Gini Index is calculated for binary variables only. It measures the impurity in
training tuples of dataset D, as

o
o P is the probability that tuple belongs to class C. The Gini index that is calculated for
binary split dataset D by attribute A is given by:

o
o Where n is the nth partition of the dataset D.

o The reduction in impurity is given by the difference of the Gini index of the original
dataset D and Gini index after partition by attribute A. The maximum reduction in
impurity or max Gini index is selected as the best attribute for splitting.

Pruning: Getting an Optimal Decision tree

Pruning is a process of deleting the unnecessary nodes from a tree in order to get the
optimal decision tree.

A too-large tree increases the risk of overfitting, and a small tree may not capture all
the important features of the dataset. Therefore, a technique that decreases the size of
the learning tree without reducing accuracy is known as Pruning. There are mainly
two types of tree pruning technology used: Cost Complexity Pruning, Reduced
Error Pruning.

Advantages of the Decision Tree


o It is simple to understand as it follows the same process which a human follow while
making any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree

UNIT-3 III- Page 45 of 58


KCE-CSE –AI&ML 2023

o The decision tree contains lots of layers, which makes it complex.


o It may have an overfitting issue, which can be resolved using the Random Forest
algorithm.
o For more class labels, the computational complexity of the decision tree may increase.

Algorithm BuiltDT
 Input: D : Training data set
 Output: T : Decision tree
Steps
1. If all tuples in D belongs to the same class Cj
Add a leaf node labeled as Cj
Return // Termination condition
2. Select an attribute Ai (so that it is not selected twice in the same branch)
3. Partition D = { D1, D2, …, Dp} based on p different values of Ai in D
4. For each Dk ϵ D
Create a node and add an edge between D and Dk with label as the Ai’s attribute
value in Dk
5. For each Dk ϵ D
BuildTD(Dk) // Recursive call
6. Stop

BuildDT algorithm must provides a method for expressing an attribute test condition
and corresponding outcome for different attribute type
 Case: Binary attribute
 This is the simplest case of node splitting
 The test condition for a binary attribute generates only two outcomes

UNIT-3 III- Page 46 of 58


KCE-CSE –AI&ML 2023

 Case: Nominal attribute


 Since a nominal attribute can have many values, its test condition can be expressed in
two ways:
 A multi-way split
 A binary split
 Muti-way split: Outcome depends on the number of distinct values for the
corresponding attribute

 Binary splitting by grouping attribute values

 Case: Ordinal attribute


 It also can be expressed in two ways:
 A multi-way split
 A binary split
 Muti-way split: It is same as in the case of nominal attribute
 Binary splitting attribute values should be grouped maintaining the order property of
the attribute values

 Case: Numerical attribute


 For numeric attribute (with discrete or continuous values), a test condition can be
expressed as a comparison set
 Binary outcome: A > v or A ≤ v
 In this case, decision tree induction must consider all possible split positions
 Range query : vi ≤ A < vi+1 for i = 1, 2, …, q (if q number of ranges are chosen)
 Here, q should be decided a priori

UNIT-3 III- Page 47 of 58


KCE-CSE –AI&ML 2023

Decision Tree Induction


Decision tree induction is the method of learning the decision trees from the training
set. The training set consists of attributes and class labels. Applications of decision
tree induction include astronomy, financial analysis, medical diagnosis,
manufacturing, and production.

The image below shows the decision tree for the Titanic dataset to predict
whether the passenger will survive or not.

CART
CART model i.e. Classification and Regression Models is a decision tree algorithm for
building models. Decision Tree model where the target values have a discrete nature
is called classification models.

A discrete value is a finite or countably infinite set of values, For Example, age, size,
etc. The models where the target values are represented by continuous values are
usually numbers that are called Regression Models. Continuous variables are floating-
point variables. These two models together are called CART.
UNIT-3 III- Page 48 of 58
KCE-CSE –AI&ML 2023

CART uses Gini Index as Classification matrix.

Decision Tree Induction for Machine Learning: ID3


#1) Initially, there are three parameters i.e. attribute list, attribute selection
method and data partition. The attribute list describes the attributes of the training
set tuples.
#2) The attribute selection method describes the method for selecting the best
attribute for discrimination among tuples. The methods used for attribute selection
can either be Information Gain or Gini Index.
#3) The structure of the tree (binary or non-binary) is decided by the attribute
selection method.
#4) When constructing a decision tree, it starts as a single node representing the
tuples.
#5) If the root node tuples represent different class labels, then it calls an attribute
selection method to split or partition the tuples. The step will lead to the formation of
branches and decision nodes.
#6) The splitting method will determine which attribute should be selected to
partition the data tuples. It also determines the branches to be grown from the node
according to the test outcome. The main motive of the splitting criteria is that the
partition at each branch of the decision tree should represent the same class label

An example of splitting attribute is shown below:

a. The portioning above is discrete-valued.

UNIT-3 III- Page 49 of 58


KCE-CSE –AI&ML 2023

b. The portioning above is for continuous-valued.

#7) The above partitioning steps are followed recursively to form a decision tree for
the training dataset tuples.
#8) The portioning stops only when either all the partitions are made or when the
remaining tuples cannot be partitioned further.
#9) The complexity of the algorithm is described by n * |D| * log |D| where n is the
number of attributes in training dataset D and |D| is the number of tuples.
What Is Greedy Recursive Binary Splitting?
In the binary splitting method, the tuples are split and each split cost function is
calculated. The lowest cost split is selected. The splitting method is binary which is
formed as 2 branches. It is recursive in nature as the same method (calculating the
cost) is used for splitting the other tuples of the dataset.

This algorithm is called as greedy as it focuses only on the current node. It focuses on
lowering its cost, while the other nodes are ignored.

How To Select Attributes For Creating A Tree?


Attribute selection measures are also called splitting rules to decide how the tuples
are going to split. The splitting criteria are used to best partition the dataset. These
measures provide a ranking to the attributes for partitioning the training tuples.

The most popular methods of selecting the attribute are information gain, Gini
index.
#1) Information Gain
This method is the main method that is used to build decision trees. It reduces the
information that is required to classify the tuples. It reduces the number of tests that
are needed to classify the given tuple. The attribute with the highest information gain
is selected.

#2) Gain Ratio


Information gain might sometimes result in portioning useless for classification.
However, the Gain ratio splits the training data set into partitions and considers the
number of tuples of the outcome with respect to the total tuples. The attribute with
the max gain ratio is used as a splitting attribute.

#3) Gini Index


Overfitting In Decision Trees
Overfitting happens when a decision tree tries to be as perfect as possible by

UNIT-3 III- Page 50 of 58


KCE-CSE –AI&ML 2023

increasing the depth of tests and thereby reduces the error. This results in very
complex trees and leads to overfitting. Overfitting reduces the predictive nature of the
decision tree. The approaches to avoid overfitting of the trees include pre pruning and
post pruning.

What Is Tree Pruning?


Pruning is the method of removing the unused branches from the decision tree. Some
branches of the decision tree might represent outliers or noisy data.

Tree pruning is the method to reduce the unwanted branches of the tree. This will
reduce the complexity of the tree and help in effective predictive analysis. It reduces
the overfitting as it removes the unimportant branches from the trees.

There are two ways of pruning the tree:


#1) Prepruning: In this approach, the construction of the decision tree is stopped
early. It means it is decided not to further partition the branches. The last node
constructed becomes the leaf node and this leaf node may hold the most frequent
class among the tuples.
The attribute selection measures are used to find out the weightage of the split.
Threshold values are prescribed to decide which splits are regarded as useful. If the
portioning of the node results in splitting by falling below threshold then the process
is halted.

#2) Postpruning: This method removes the outlier branches from a fully grown tree.
The unwanted branches are removed and replaced by a leaf node denoting the most
frequent class label. This technique requires more computation than prepruning,
however, it is more reliable.
The pruned trees are more precise and compact when compared to unpruned trees
but they carry a disadvantage of replication and repetition.

Repetition occurs when the same attribute is tested again and again along a branch of
a tree. Replication occurs when the duplicate subtrees are present within the tree.
These issues can be solved by multivariate splits.

Example of Decision Tree Algorithm


Constructing a Decision Tree
Let us take an example of the last 10 days weather dataset with attributes outlook,
temperature, wind, and humidity. The outcome variable will be playing cricket or not.
We will use the ID3 algorithm to build the decision tree.

Play
Day Outlook Temperature Humidity Wind
cricket

1 Sunny Hot High Weak No

2 Sunny Hot High Strong No

3 Overcast Hot High Weak Yes

UNIT-3 III- Page 51 of 58


KCE-CSE –AI&ML 2023

Play
Day Outlook Temperature Humidity Wind
cricket

4 Rain Mild High Weak Yes

5 Rain Cool Normal Weak Yes

6 Rain Cool Normal Strong No

7 Overcast Cool Normal Strong Yes

8 Sunny Mild High Weak No

9 Sunny Cool Normal Weak Yes

10 Rain Mild Normal Weak Yes

11 Sunny Mild Normal Strong Yes

12 Overcast Mild High Strong Yes

13 Overcast Hot Normal Weak Yes

14 Rain Mild High Strong No

Step1: The first step will be to create a root node.


Step2: If all results are yes, then the leaf node “yes” will be returned else the leaf node
“no” will be returned.
Step3: Find out the Entropy of all observations and entropy with attribute “x” that is
E(S) and E(S, x).
Step4: Find out the information gain and select the attribute with high information
gain.
Step5: Repeat the above steps until all attributes are covered.
Calculation of Entropy:
Yes No

9 5

If entropy is zero, it means that all members belong to the same class and if entropy is
one then it means that half of the tuples belong to one class and one of them belong to
other class. 0.94 means fair distribution.

UNIT-3 III- Page 52 of 58


KCE-CSE –AI&ML 2023

Find the information gain attribute which gives maximum information gain.

For Example “Wind”, it takes two values: Strong and Weak, therefore, x = {Strong,
Weak}.

Find out H(x), P(x) for x =weak and x= strong. H(S) is already calculated above.

Weak= 8

Strong= 8

For “weak” wind, 6 of them say “Yes” to play cricket and 2 of them say “No”. So
entropy will be:

For “strong” wind, 3 said “No” to play cricket and 3 said “Yes”.

This shows perfect randomness as half items belong to one class and the remaining
half belong to others.

Calculate the information gain,

UNIT-3 III- Page 53 of 58


KCE-CSE –AI&ML 2023

Similarly the information gain for other attributes is:

The attribute outlook has the highest information gain of 0.246, thus it is chosen as
root.
Overcast has 3 values: Sunny, Overcast and Rain. Overcast with play cricket is always
“Yes”. So it ends up with a leaf node, “yes”. For the other values “Sunny” and “Rain”.

Table for Outlook as “Sunny” will be:


Temperature Humidity Wind Golf

Hot High Weak No

Hot High Strong No

Mild High Weak No

Cool Normal Weak Yes

Mild Normal Strong Yes


Entropy for “Outlook” “Sunny” is:

Information gain for attributes with respect to Sunny is:

The information gain for humidity is highest, therefore it is chosen as the next node.
Similarly, Entropy is calculated for Rain. Wind gives the highest information gain.
The decision tree would look like below:

UNIT-3 III- Page 54 of 58


KCE-CSE –AI&ML 2023

What Is Predictive Modelling?


The classification models can be used to predict the outcomes of an unknown set of
attributes.

When a dataset with unknown class labels is fed into the model, then it will
automatically assign the class label to it. This method of applying probability to
predict outcomes is called predictive modeling.

The Below image shows an unpruned and pruned tree.

Other Examples

UNIT-3 III- Page 55 of 58


KCE-CSE –AI&ML 2023

UNIT-3 III- Page 56 of 58


KCE-CSE –AI&ML 2023

RANDOM FORESTS

"Random Forest is a classifier that contains a number of decision trees on various


subsets of the given dataset and takes the average to improve the predictive
accuracy of that dataset."

Instead of relying on one decision tree, the random forest takes the prediction
from each tree and based on the majority votes of predictions, and it predicts
the final output.

Algorithm

Step-1: Select random K data points from the training set.


Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-4: Repeat Step 1 & 2.
Step-5: For new data points, find the predictions of each decision tree, and assign the
new data points to the category that wins the majority votes.

UNIT-3 III- Page 57 of 58


KCE-CSE –AI&ML 2023

Characteristics
• It takes less training time as compared to other algorithms.
• It predicts output with high accuracy, even for the large dataset it runs efficiently.
• It can also maintain accuracy when a large proportion of data is missing.
Advantages of Random Forest
• Random Forest is capable of performing both Classification and Regression tasks.
• It is capable of handling large datasets with high dimensionality.
• It enhances the accuracy of the model and prevents the overfitting issue.
Disadvantages of Random Forest
• Although random forest can be used for both classification and regression tasks, it is
not more suitable for Regression tasks.

UNIT-3 III- Page 58 of 58


KCE-CSE –AI&ML 2023

UNIT- IV

ENSEMBLE LEARNING AND UNSUPERVISED LEARNING

S.NO TOPICS PAGE NO. STATUS OF


COVERAGE

19 Combining multiple learners:


Model combination schemes,
Voting
Ensemble Learning - bagging,
20 boosting, stacking

21 Unsupervised learning: K-means

Instance Based Learning: KNN


22

Gaussian mixture models and


23 Expectation maximization

UNIT-4 IV- Page 1 of 24


KCE-CSE –AI&ML 2023

TOPIC19. COMBINING MULTIPLE LEARNERS: MODEL COMBINATION SCHEMES, VOTING

Ensemble learning is one of the most powerful machine learning techniques that use the
combined output of two or more models/weak learners and solve a particular
computational intelligence problem. E.g., a Random Forest algorithm is an ensemble of
various decision trees combined.

An ensembled model is a machine learning model that combines the predictions from two or
more models.”

 A Voting Classifier is a machine learning model that trains on an ensemble of numerous


models and predicts an output (class) based on their highest probability of chosen
class as the output.
 It simply aggregates the findings of each classifier passed into Voting Classifier and
predicts the output class based on the highest majority of voting.
 The idea is instead of creating separate dedicated models and finding the accuracy for
each them, we create a single model which trains by these models and predicts output
based on their combined majority of voting for each output class.

Voting Classifier supports two types of votings.

 Hard Voting: In hard voting, the predicted output class is a class with the highest
majority of votes i.e the class which had the highest probability of being predicted by each
of the classifiers. Suppose three classifiers predicted the output class(A, A, B), so here the
majority predicted A as output. Hence A will be the final prediction.
 Soft Voting: In soft voting, the output class is the prediction based on the average of
probability given to that class. Suppose given some input to three models, the prediction
probability for class A = (0.30, 0.47, 0.53) and B = (0.20, 0.32, 0.40). So the average for
UNIT-4 IV- Page 2 of 24
KCE-CSE –AI&ML 2023

class A is 0.4333 and B is 0.3067, the winner is clearly class A because it had the highest
probability averaged by each classifier

TOPIC 20. ENSEMBLE LEARNING - BAGGING, BOOSTING, STACKING

Ensemble learning is primarily used to improve the model performance, such as classification,
prediction, function approximation, etc. In simple words, we can summarise the ensemble
learning as follows:

 There are many ways to ensemble models in machine learning, such as Bagging,
Boosting, and stacking.
 Stacking is one of the most popular ensemble machine learning techniques used to
predict multiple nodes to build a new model and improve model performance.
 Stacking enables us to train multiple models to solve similar problems, and based on
their combined output, it builds a new model with improved performance.

1. Bagging
Bagging is a method of ensemble modeling, which is primarily used to solve supervised
machine learning problems. It is generally completed in two steps as follows:

Bootstrapping:
 It is a random sampling method that is used to derive samples from the data using the
replacement procedure.
 In this method, first, random data samples are fed to the primary model, and then a
base learning algorithm is run on the samples to complete the learning process.
Aggregation:
 This is a step that involves the process of combining the output of all base models and,
based on their output, predicting an aggregate result with greater accuracy and
reduced variance.
 Example: In the Random Forest method, predictions from multiple decision trees are
ensembled parallelly.
 Further, in regression problems, we use an average of these predictions to get the final
output, whereas, in classification problems, the model is selected as the predicted
class.

Steps to Perform Bagging


 Consider there are n observations and m features in the training set. You need to select
a random sample from the training dataset without replacement
 A subset of m features is chosen randomly to create a model using sample observations

UNIT-4 IV- Page 3 of 24


KCE-CSE –AI&ML 2023

 The feature offering the best split out of the lot is used to split the nodes
 The tree is grown, so you have the best root nodes
 The above steps are repeated n times. It aggregates the output of individual decision
trees to give the best prediction

Advantages of Bagging in Machine Learning


 Bagging minimizes the overfitting of data
 It improves the model’s accuracy
 It deals with higher dimensional data efficiently

2. Boosting
 Boosting is an ensemble method that enables each member to learn from the
preceding member's mistakes and make better predictions for the future.
 Unlike the bagging method, in boosting, all base learners (weak) are arranged in a
sequential format so that they can learn from the mistakes of their preceding learner.
 Hence, in this way, all weak learners get turned into strong learners and make a better
predictive model with significantly improved performance.

The process for building one sample can be summarized as follows:

 Choose the size of the sample.


 While the size of the sample is less than the chosen size
 Randomly select an observation from the dataset
 Add it to the sample
 The bootstrap method can be used to estimate a quantity of a population. This is done
by repeatedly taking small samples, calculating the statistic, and taking the average of
the calculated statistics.

We can summarize this procedure as follows:

 Choose a number of bootstrap samples to perform


 Choose a sample size
 For each bootstrap sample
 Draw a sample with replacement with the chosen size
 Calculate the statistic on the sample
 Calculate the mean of the calculated sample statistics.

UNIT-4 IV- Page 4 of 24


KCE-CSE –AI&ML 2023

3. Stacking
 Stacking is one of the popular ensemble modeling techniques in machine learning.
 Various weak learners are ensembled in a parallel manner in such a way that by
combining them with Meta learners, we can predict better predictions for the future.
 This ensemble technique works by applying input of combined multiple weak learners'
predictions and Meta learners so that a better output prediction model can be
achieved.
 In stacking, an algorithm takes the outputs of sub-models as input and attempts to
learn how to best combine the input predictions to make a better output prediction.
 Stacking is also known as a stacked generalization and is an extended form of the
Model Averaging Ensemble technique in which all sub-models equally participate as
per their performance weights and build a new model with better predictions. This
new model is stacked up on top of the others; this is the reason why it is named
stacking.

UNIT-4 IV- Page 5 of 24


KCE-CSE –AI&ML 2023

 Original data: This data is divided into n-folds and is also considered test data or
training data.
 Base models: These models are also referred to as level-0 models. These models use
training data and provide compiled predictions (level-0) as an output.
 Level-0 Predictions: Each base model is triggered on some training data and provides
different predictions, which are known as level-0 predictions.
 Meta Model: The architecture of the stacking model consists of one meta-model,
which helps to best combine the predictions of the base models. The meta-model is
also known as the level-1 model.
 Level-1 Prediction: The meta-model learns how to best combine the predictions of
the base models and is trained on different predictions made by individual base
models, i.e., data not used to train the base models are fed to the meta-model,
predictions are made, and these predictions, along with the expected outputs, provide
the input and output pairs of the training dataset used to fit the meta-model.

Steps to implement Stacking models:


1. Split training data sets into n-folds using the RepeatedStratifiedKFold as this is the
most common approach to preparing training datasets for meta-models.
2. Now the base model is fitted with the first fold, which is n-1, and it will make
predictions for the nth folds.
3. The prediction made in the above step is added to the x1_train list.
4. Repeat steps 2 & 3 for remaining n-1folds, so it will give x1_train array of size n,
5. Now, the model is trained on all the n parts, which will make predictions for the
sample data.
6. Add this prediction to the y1_test list.
7. In the same way, we can find x2_train, y2_test, x3_train, and y3_test by using Model 2
and 3 for training, respectively, to get Level 2 predictions.
8. Now train the Meta model on level 1 prediction, where these predictions will be used
as features for the model.
Finally, Meta learners can now be used to make a prediction on test data in the
stacking model.

TOPIC 21. UNSUPERVISED LEARNING: K-MEANS


K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering
problems in machine learning or data science.

What is K-Means Algorithm?


 K-Means Clustering is an Unsupervised Learning algorithm, which groups the
unlabeled dataset into different clusters.
UNIT-4 IV- Page 6 of 24
KCE-CSE –AI&ML 2023

 Here K defines the number of pre-defined clusters that need to be created in the
process, as if K=2, there will be two clusters, and for K=3, there will be three clusters,
and so on.
 It is an iterative algorithm that divides the unlabeled dataset into k different
clusters in such a way that each dataset belongs only one group that has similar
properties.
 It allows us to cluster the data into different groups and a convenient way to discover
the categories of groups in the unlabeled dataset on its own without the need for any
training.
 It is a centroid-based algorithm, where each cluster is associated with a centroid.
 The main aim of this algorithm is to minimize the sum of distances between the
data point and their corresponding clusters.
 The algorithm takes the unlabeled dataset as input, divides the dataset into k-
number of clusters, and repeats the process until it does not find the best
clusters. The value of k should be predetermined in this algorithm.

The k-means clustering algorithm mainly performs two tasks:

 Determines the best value for K center points or centroids by an iterative process.
 Assigns each data point to its closest k-center. Those data points which are near to
the particular k-center, create a cluster.
 Hence each cluster has datapoints with some commonalities, and it is away from other
clusters.

The below diagram explains the working of the K-means Clustering Algorithm:

How does the K-Means Algorithm Work?


The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the predefined
K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
UNIT-4 IV- Page 7 of 24
KCE-CSE –AI&ML 2023

centroid of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Let's understand the above steps by considering the visual plots:

Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two
variables is given below:

 Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into
different clusters. It means here we will try to group these datasets into two different
clusters.
 We need to choose some random k points or centroid to form the cluster. These points
can be either the points from the dataset or any other point.
 So, here we are selecting the below two points as k points, which are not the part of
our dataset. Consider the below image:

 Now we will assign each data point of the scatter plot to its closest K-point or centroid.
We will compute it by applying some mathematics that we have studied to calculate
the distance between two points. So, we will draw a median between both the
centroids. Consider the below image:

From the above image, it is clear that points left side of the line is near to the K1 or blue
centroid, and points to the right of the line are close to the yellow centroid. Let's color
UNIT-4 IV- Page 8 of 24
KCE-CSE –AI&ML 2023

them as blue and yellow for clear visualization.

 As we need to find the closest cluster, so we will repeat the process by choosing a new
centroid. To choose the new centroids, we will compute the center of gravity of these
centroids, and will find new centroids as below:

Next, we will reassign each datapoint to the new centroid. For this, we will repeat the
same process of finding a median line. The median will be like below image:

From the above image, we can see, one yellow point is on the left side of the line, and
two blue points are right to the line. So, these three points will be assigned to new
centroids.

As reassignment has taken place, so we will again go to the step-4, which is finding
new centroids or K-points. We will repeat the process by finding the center of gravity

UNIT-4 IV- Page 9 of 24


KCE-CSE –AI&ML 2023

of centroids, so the new centroids will be as shown in the below image:

o As we got the new centroids so again will draw the median line and reassign the data
points. So, the image will be:

o We can see in the above image; there are no dissimilar data points on either side of the
line, which means our model is formed. Consider the below image:

As our model is ready, so we can now remove the assumed centroids, and the two final clusters
will be as shown in the below image:

UNIT-4 IV- Page 10 of 24


KCE-CSE –AI&ML 2023

K Means Clustering working

Refer handwritten notes for solved example.

Limitations of K-Means clustering


1. When the numbers of data are not so many, initial grouping will determine the cluster
significantly.
2. The number of cluster, K, must be determined before hand. Its disadvantage is that it does
not yield the same result with each run, since the resulting clusters depend on the initial
random assignments.
3. We never know the real cluster, using the same data, because if it is inputted in a different
order it may produce different cluster if the number of data is few.
4. It is sensitive to initial condition. Different initial condition may produce different result
of cluster. The algorithm may be trapped in the local optimum.

Applications
 Customer Segmentation: K-means clustering in machine learning permits marketers
to enhance their customer base, work on a target base, and segment customers based
on purchase patterns, interests, or activity monitoring. Segmentation helps companies
target specific clusters/groups of customers for particular campaigns.
 Document Classification: Cluster documents in multiple categories based on tags,
topics, and content. K-means clustering in machine learning is a suitable algorithm for
this purpose. The initial processing of the documents is needed to represent each
document as a vector and uses term frequency for identifying commonly used terms
that help classify the document. The document vectors are then clustered to help
identify similarities in document groups.
 Delivery store optimization: K-means clustering in machine learning helps to
optimize the process of good delivery using truck drones. K-means clustering in
machine learning helps to find the optimal number of launch locations.
 Insurance fraud detection: Machine learning is critical in fraud detection and has
numerous applications in automobile, healthcare, and insurance fraud detection.
 K-Means clustering in machine learning can also be used for performing image
segmentation by trying to group similar pixels in the image together and creating
clusters.

UNIT-4 IV- Page 11 of 24


KCE-CSE –AI&ML 2023

TOPIC 22. INSTANCE BASED LEARNING: KNN

One more way to categorize Machine Learning systems is by how they generalize.
Generalization — usually refers to a ML model’s ability to perform well on new unseen data
rather than just the data that it was trained on.

Most Machine Learning tasks are about making predictions. This means that given a number
of training examples, the system needs to be able to make good “predictions for” / “generalize
to” examples it has never seen before. Having a good performance measure on the training
data is good, but insufficient; the true goal is to perform well on new instances.

There are two main approaches to generalization: instance-based learning and model-
based learning

1. Instance-based learning:
(sometimes called memory-based learning) is a family of learning algorithms that, instead of
performing explicit generalization, compares new problem instances with instances seen in
training, which have been stored in memory.

Instance-based learning systems, also known as lazy learning systems, store the entire
training dataset in memory and when a new instance is to be classified, it compares the
new instance with the stored instances and returns the most similar one. These systems
do not build a model using the training dataset.

Examples of instance-based learning algorithms


K-Nearest Neighbors – This algorithm classifies a data point based on the majority class of
its k nearest neighbors.
Locally Weighted Regression: This algorithm is used for regression problems and creates a
linear model around the test data point by giving more weight to the training instances that
are closer to the test instance.

2. Model-based learning:
 Machine learning models that are parameterized with a certain number of
parameters that do not change as the size of training data changes.

 If you don’t assume any distribution with a fixed number of parameters over your data,
for example, in k-nearest neighbor, or in a decision tree, where the number of
parameters grows with the size of the training data, then you are not model-based, or
nonparametric
UNIT-4 IV- Page 12 of 24
KCE-CSE –AI&ML 2023

 Model-based learning systems are also known as eager learning systems, where the
model learns the training data. These systems build a machine learning model using
the entire training dataset, which is built by analyzing the training data and
identifying patterns and relationships. After that, the model can be used to make
predictions on new data.

Examples of model-based learning algorithms


Linear Regression: This algorithm is used for predicting continuous variables and assumes
that the relationship between the input and output variables is linear.
Logistic Regression: This algorithm is used for predicting binary outcomes and is based on
the logistic function.
Decision Trees: This algorithm is used for both classification and regression problems and is
based on a tree-like structure where each internal node represents a feature, and each leaf
node represents a class label or a predicted value.

Which is better?
 In general, it’s better to choose model-based learning when the goal is to make
predictions on unseen data, and when there are enough computational resources
available. And it’s better to choose instance-based learning when the goal is to make
predictions on new data that are similar to the training instances, and when there are
limited computational resources available.

 Also, instance-based learning systems have a lower training time compared to


model-based learning systems, but a higher prediction time. Model-based learning
systems have a higher training time compared to instance-based learning systems, but
a lower prediction time.

 So, both instance-based and model-based learning have their own advantages and
disadvantages, and the choice between them depends on the specific problem and the
available resources.

KNN K-Nearest Neighbour

 K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on


Supervised Learning technique.
 K-NN algorithm assumes the similarity between the new case/data and available
cases and put the new case into the category that is most similar to the available
categories.
 K-NN algorithm stores all the available data and classifies a new data point based on
the similarity. This means when new data appears then it can be easily classified into a
well suite category by using K- NN algorithm.
 K-NN algorithm can be used for Regression as well as for Classification but mostly it is
used for the Classification problems.
 K-NN is a non-parametric algorithm, which means it does not make any assumption
on underlying data.
 It is also called a lazy learner algorithm because it does not learn from the training
UNIT-4 IV- Page 13 of 24
KCE-CSE –AI&ML 2023

set immediately instead it stores the dataset and at the time of classification, it
performs an action on the dataset.
 KNN algorithm at the training phase just stores the dataset and when it gets new data,
then it classifies that data into a category that is much similar to the new data.
 Example: Suppose, we have an image of a creature that looks similar to cat and dog, but
we want to know either it is a cat or dog. So for this identification, we can use the KNN
algorithm, as it works on a similarity measure. Our KNN model will find the similar
features of the new data set to the cats and dogs images and based on the most similar
features it will put it in either cat or dog category.

Why do we need a K-NN Algorithm?


 Suppose there are two categories, i.e., Category A and Category B, and we have a new
data point x1, so this data point will lie in which of these categories.
 To solve this type of problem, we need a K-NN algorithm. With the help of K-NN, we
can easily identify the category or class of a particular dataset.
 Consider the below diagram

How does K-NN work?


The K-NN working can be explained on the basis of the below algorithm:

Step-1: Select the number K of the neighbors


Step-2: Calculate the Euclidean distance of K number of neighbors
Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
Step-4: Among these k neighbors, count the number of the data points in each
category.
Step-5: Assign the new data points to that category for which the number of the
neighbor is maximum.
Step-6: Our model is ready.

UNIT-4 IV- Page 14 of 24


KCE-CSE –AI&ML 2023

Suppose we have a new data point and we need to put it in the required category.
Consider the below image:

 Firstly, we will choose the number of neighbors, so we will choose the k=5.
 Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in
geometry. It can be calculated as:

How to select the value of K in the K-NN Algorithm?


Below are some points to remember while selecting the value of K in the K-NN
algorithm

There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
 A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of
outliers in the model.
 Large values for K are good, but it may find some difficulties.

Advantages of KNN Algorithm:


 It is simple to implement.
 It is robust to the noisy training data
 It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:


 Always needs to determine the value of K which may be complex some time.
UNIT-4 IV- Page 15 of 24
KCE-CSE –AI&ML 2023

 The computation cost is high because of calculating the distance between the data
points for all the training samples.

We have data from the questionnaires survey and objective testing with two attributes (acid
durability and strength) to classify whether a special paper tissue is good or not. Here are the
training samples

Now, the factory produces a new paper tissue that pass lab test with x1=3, x2=7. Without
another expensive survey, find the classification of this new tissue

1. Determine K eg:3
2. Calculate the distance between the query-instance and all the training samples . Here
instance is (3,7)

X1= Acid X2=strength Square distance for (3,7)


Durability (kb/sq.m)
(seconds)
7 7 (7-3)2 +(7-7)2=16
7 4 (7-3)2 +(4-7)2=25
3 4 (3-3)2 +(4-7)2=9
1 4 (1-3)2 +(4-7)2=13
3. Sort the distance and determine nearest neighbors based on the k-th minimum

X1 X2 Square distance for (3,7) Rank Is it included in


minimum 3nearest
distance neighbor
7 7 (7-3)2 +(7-7)2=16 3 Yes

7 4 (7-3)2 +(4-7)2=25 4 No

3 4 (3-3)2 +(4-7)2=9 1 Yes

1 4 (1-3)2 +(4-7)2=13 2 Yes

4. Gather the category Y of Nearest Neighbors

UNIT-4 IV- Page 16 of 24


KCE-CSE –AI&ML 2023

X1 X2 Square distance for (3,7) Rank Is it included in Y


minimum 3nearest neighbor
distance
7 7 (7-3)2 +(7-7)2=16 3 Yes Bad

7 4 (7-3)2 +(4-7)2=25 4 No -

3 4 (3-3)2 +(4-7)2=9 1 Yes Good

1 4 (1-3)2 +(4-7)2=13 2 Yes Good

Use simple majority of the category and predict instance. Here Y=Good (2 votes)

Eg 2
Suppose we have height, weight and T-shirt size of some customers and we need to
predict the T-shirt size of a new customer given only height and weight information we
have. Data including height, weight and T-shirt size information is shown. New customer
named 'Monica' has height 161cm and weight 61kg. Find the T-shirt size.

UNIT-4 IV- Page 17 of 24


KCE-CSE –AI&ML 2023

Eg:3

The table above represents our data set. We have two columns — Brightness and Saturation.
Each row in the table has a class of either Red or Blue.

Let's assume the value of K is 5.

How to Calculate Euclidean Distance in the K-Nearest Neighbors Algorithm

We have a new entry but it doesn't have a class yet. To know its class, we have to calculate the
distance from the new entry to other entries in the data set using the Euclidean distance
formula.

Here's the formula: √(X₂-X₁)²+(Y₂-Y₁)²

Where:

X₂ = New entry's brightness (20).


X₁= Existing entry's brightness.
Y₂ = New entry's saturation (35).
Y₁ = Existing entry's saturation.
Let's do the calculation together. I'll calculate the first three.

Distance #1
For the first row, d1:

UNIT-4 IV- Page 18 of 24


KCE-CSE –AI&ML 2023

d1 = √(20 - 40)² + (35 - 20)²


= √400 + 225
= √625
= 25

We now know the distance from the new data entry to the first entry in the table. Let's update
the table.

Distance #2
For the second row, d2:

d2 = √(20 - 50)² + (35 - 50)²


= √900 + 225
= √1125
= 33.54
Here's the table with the updated distance:

UNIT-4 IV- Page 19 of 24


KCE-CSE –AI&ML 2023

Table will look like after all the distances have been calculated:

Let's rearrange the distances in ascending order:

Since we chose 5 as the value of K, we'll only consider the first five rows. That is:

As you can see above, the majority class within the 5 nearest neighbors to the new entry is
Red. Therefore, we'll classify the new entry as Red.
UNIT-4 IV- Page 20 of 24
KCE-CSE –AI&ML 2023

Here's the updated table:

How to Choose the Value of K in the K-NN Algorithm

 Choosing a very low value will most likely lead to inaccurate predictions.
 The commonly used value of K is 5.
 Always use an odd number as the value of K.

TOPIC 23. GAUSSIAN MIXTURE MODELS AND EXPECTATION MAXIMIZATION

 Gaussian mixture models (GMMs) are a type of machine learning algorithm. They are used
to classify data into different categories based on the probability distribution.
 Gaussian mixture models can be used in many different areas, including finance,
marketing and so much more!

 Gaussian mixture models (GMM) are a probabilistic concept used to model real-world data
sets. GMMs are a generalization of Gaussian distributions and can be used to represent
any data set that can be clustered into multiple Gaussian distributions.
 The Gaussian mixture model is a probabilistic model that assumes all the data points
are generated from a mix of Gaussian distributions with unknown parameters.
 A Gaussian mixture model can be used for clustering, which is the task of grouping a set
of data points into clusters.
 GMMs can be used to find clusters in data sets where the clusters may not be clearly
defined. Additionally, GMMs can be used to estimate the probability that a new data point
belongs to each cluster.
 Gaussian mixture models are also relatively robust to outliers, meaning that they can
still yield accurate results even if there are some data points that do not fit neatly into any
of the clusters. This makes GMMs a flexible and powerful tool for clustering data.
 It can be understood as a probabilistic model where Gaussian distributions are assumed
UNIT-4 IV- Page 21 of 24
KCE-CSE –AI&ML 2023

for each group and they have means and covariances which define their parameters.
 GMM consists of two parts – mean vectors (μ) & covariance matrices (Σ). A Gaussian
distribution is defined as a continuous probability distribution that takes on a bell-shaped
curve.
 Another name for Gaussian distribution is the normal distribution. Here is a picture of
Gaussian mixture models:

 GMM has many applications, such as density estimation, clustering, and image
segmentation.
 For density estimation, GMM can be used to estimate the probability density function of a
set of data points.
 For clustering, GMM can be used to group together data points that come from the same
Gaussian distribution. And for image segmentation, GMM can be used to partition an
image into different regions.
 Gaussian mixture models can be used for a variety of use cases, including identifying
customer segments, detecting fraudulent activity, and clustering images.
 In each of these examples, the Gaussian mixture model is able to identify clusters in the
data that may not be immediately obvious.
 As a result, Gaussian mixture models are a powerful tool for data analysis and should be
considered for any clustering task.

What is the expectation-maximization (EM) method in relation to GMM?


 In Gaussian mixture models, an expectation-maximization method is a powerful tool for
estimating the parameters of a Gaussian mixture model (GMM). The expectation is
termed E and maximization is termed M.
 Expectation is used to find the Gaussian parameters which are used to represent each
component of gaussian mixture models. Maximization is termed M and it is involved in
determining whether new data points can be added or not.
 The expectation-maximization method is a two-step iterative algorithm that alternates
between performing an expectation step, in which we compute expectations for each
data point using current parameter estimates and then maximize these to produce a
new gaussian, followed by a maximization step where we update our gaussian
means based on the maximum likelihood estimate.
 The EM method works by first initializing the parameters of the GMM, then iteratively
improving these estimates. At each iteration, the expectation step calculates the
expectation of the log-likelihood function with respect to the current parameters. This
expectation is then used to maximize the likelihood in the maximization step. The process
UNIT-4 IV- Page 22 of 24
KCE-CSE –AI&ML 2023

is then repeated until convergence.


 Here is a picture representing the two-step iterative aspect of the algorithm

The following are three different steps to using gaussian mixture models:

 Determining a covariance matrix that defines how each Gaussian is related to one another.
The more similar two Gaussians are, the closer their means will be and vice versa if they
are far away from each other in terms of similarity.
 A gaussian mixture model can have a covariance matrix that is diagonal or symmetric.
 Determining the number of Gaussians in each group defines how many clusters there are.
 Selecting the hyperparameters which define how to optimally separate data using
gaussian mixture models as well as deciding on whether or not each gaussian’s covariance
matrix is diagonal or symmetric.

Application areas
There are many different real-world problems that can be solved with gaussian mixture
models. Gaussian mixture models are very useful when there are large datasets and it is
difficult to find clusters. This is where Gaussian mixture models help. It is able to find
clusters of Gaussians more efficiently than other clustering algorithms such as k-means.

Here are some real-world problems which can be solved using Gaussian mixture models:

 Finding patterns in medical datasets: GMMs can be used for segmenting images into
multiple categories based on their content or finding specific patterns in medical datasets.
They can be used to find clusters of patients with similar symptoms, identify disease
subtypes, and even predict outcomes. In one recent study, a Gaussian mixture model was
used to analyze a dataset of over 700,000 patient records. The model was able to identify
previously unknown patterns in the data, which could lead to better treatment for patients
with cancer.
 Modeling natural phenomena: GMM can be used to model natural phenomena where it
has been found that noise follows Gaussian distributions. This model of probabilistic
modeling relies on the assumption that there exists some underlying continuum of
unobserved entities or attributes and that each member is associated with measurements
taken at equidistant points in multiple observation sessions.
 Customer behavior analysis: GMMs can be used for performing customer behavior
analysis in marketing to make predictions about future purchases based on historical data.
 Stock price prediction: Another area Gaussian mixture models are used is in finance
where they can be applied to a stock’s price time series. GMMs can be used to detect
changepoints in time series data and help find turning points of stock prices or other
market movements that are otherwise difficult to spot due to volatility and noise.
 Gene expression data analysis: Gaussian mixture models can be used for gene
expression data analysis. In particular, GMMs can be used to detect differentially
UNIT-4 IV- Page 23 of 24
KCE-CSE –AI&ML 2023

expressed genes between two conditions and identify which genes might contribute
toward a certain phenotype or disease state.

What are the differences between Gaussian mixture models and other types of
clustering algorithms such as K-means?
Here are some of the key differences between Gaussian mixture models and the K-means
algorithm used for clustering:
 A Gaussian mixture model is a type of clustering algorithm that assumes that the data
point is generated from a mixture of Gaussian distributions with unknown parameters.
The goal of the algorithm is to estimate the parameters of the Gaussian distributions, as
well as the proportion of data points that come from each distribution. In contrast, K-
means is a clustering algorithm that does not make any assumptions about the underlying
distribution of the data points. Instead, it simply partitions the data points into K clusters,
where each cluster is defined by its centroid.
 While Gaussian mixture models are more flexible, they can be more difficult to train than
K-means. K-means is typically faster to converge and so may be preferred in cases where
the runtime is an important consideration.
 In general, K-means will be faster and more accurate when the data set is large and the
clusters are well-separated. Gaussian mixture models will be more accurate when the data
set is small or the clusters are not well-separated.
 Gaussian mixture models take into account the variance of the data, whereas K-means
does not.
 Gaussian mixture models are more flexible in terms of the shape of the clusters, whereas
K-means is limited to spherical clusters.
 Gaussian mixture models can handle missing data, whereas K-means cannot. This
difference can make Gaussian mixture models more effective in certain applications, such
as data with a lot of noise or data that is not well-defined.

UNIT-4 IV- Page 24 of 24


KCE-CSE –AI&ML 2023

UNIT- V

NEURAL NETWORKS

S.NO TOPICS PAGE NO. STATUS OF


COVERAGE

Perceptron, Multilayer perceptron,


24
activation functions

25 Network training, gradient descent


optimization

26 Stochastic gradient descent, error


backpropagation, from shallow
networks to deep networks.

27 Unit saturation (aka the vanishing


gradient problem) – ReLU,
hyperparameter tuning, batch
normalization, regularization,
dropout

UNIT-5 V- Page 1 of 29
KCE-CSE –AI&ML 2023

INTRODUCTION

 The term "Artificial neural network" refers to a biologically inspired sub-field of artificial
intelligence modeled after the brain.
 An Artificial neural network is usually a computational network based on biological neural
networks that construct the structure of the human brain.
 Similar to a human brain has neurons interconnected to each other, artificial neural networks also
have neurons that are linked to each other in various layers of the networks. These neurons are
known as nodes.
The typical diagram of Biological Neural Network.

The typical Artificial Neural Network looks something like the given figure.

Dendrites from Biological Neural Network represent inputs in Artificial Neural Networks, cell
nucleus represents Nodes, synapse represents Weights, and Axon represents Output.

An Artificial Neural Network in the field of Artificial intelligence where it attempts to mimic
the network of neurons makes up a human brain so that computers will have an option to
understand things and make decisions in a human-like manner. The artificial neural network
is designed by programming computers to behave simply like interconnected brain cells.

 There are around 1000 billion neurons in the human brain. Each neuron has an
association point somewhere in the range of 1,000 and 100,000.
UNIT-5 V- Page 2 of 29
KCE-CSE –AI&ML 2023

 In the human brain, data is stored in such a manner as to be distributed, and we can
extract more than one piece of this data when necessary from our memory parallelly.
 We can say that the human brain is made up of incredibly amazing parallel processors.

 We can understand the artificial neural network with an example, consider an example of
a digital logic gate that takes an input and gives an output. "OR" gate, which takes two
inputs.

 If one or both the inputs are "On," then we get "On" in output. If both the inputs are "Off,"
then we get "Off" in output. Here the output depends upon input. Our brain does not
perform the same task.

 The outputs to inputs relationship keep changing because of the neurons in our brain,
which are "learning."

Relationship between Biological neural network and artificial neural network:

Biological Neural Network Artificial Neural Network

Dendrites Inputs

Cell nucleus Nodes

Synapse Weights

Axon Output

The architecture of an artificial neural network:


In order to define a neural network that consists of a large number of artificial neurons, which
are termed units arranged in a sequence of layers

Artificial Neural Network primarily consists of three layers:

Input Layer:

As the name suggests, it accepts inputs in several different formats provided by the programmer.

UNIT-5 V- Page 3 of 29
KCE-CSE –AI&ML 2023

Hidden Layer:

The hidden layer presents in-between input and output layers. It performs all the calculations to
find hidden features and patterns.

Output Layer:

The input goes through a series of transformations using the hidden layer, which finally results
in output that is conveyed using this layer.

The artificial neural network takes input and computes the weighted sum of the inputs and
includes a bias. This computation is represented in the form of a transfer function.

It determines weighted total is passed as an input to an activation function to produce the


output. Activation functions choose whether a node should fire or not. Only those who are
fired make it to the output layer. There are distinctive activation functions available that can be
applied upon the sort of task we are performing.

Advantages of Artificial Neural Network (ANN)


 Parallel processing capability:
Artificial neural networks have a numerical value that can perform more than one task
simultaneously.

 Storing data on the entire network:


Data that is used in traditional programming is stored on the whole network, not on a
database. The disappearance of a couple of pieces of data in one place doesn't prevent the
network from working.

 Capability to work with incomplete knowledge:


After ANN training, the information may produce output even with inadequate data. The
loss of performance here relies upon the significance of missing data.

 Having a memory distribution:


For ANN is to be able to adapt, it is important to determine the examples and to
encourage the network according to the desired output by demonstrating these examples
to the network. The succession of the network is directly proportional to the chosen
instances, and if the event can't appear to the network in all its aspects, it can produce
false output.

Disadvantages of Artificial Neural Network:

 Assurance of proper network structure:


UNIT-5 V- Page 4 of 29
KCE-CSE –AI&ML 2023

There is no particular guideline for determining the structure of artificial neural


networks. The appropriate network structure is accomplished through experience, trial,
and error.

 Unrecognized behavior of the network:


It is the most significant issue of ANN. When ANN produces a testing solution, it does not
provide insight concerning why and how. It decreases trust in the network.

 Hardware dependence:
Artificial neural networks need processors with parallel processing power, as per their
structure. Therefore, the realization of the equipment is dependent.

 Difficulty of showing the issue to the network:


ANNs can work with numerical data. Problems must be converted into numerical values
before being introduced to ANN. The presentation mechanism to be resolved here will
directly impact the performance of the network. It relies on the user's abilities.

 The duration of the network is unknown:


The network is reduced to a specific value of the error, and this value does not give us
optimum results.

How do artificial neural networks work?

 Artificial Neural Network can be best represented as a weighted directed graph, where
the artificial neurons form the nodes.
 The association between the neurons outputs and neuron inputs can be viewed as the
directed edges with weights.
 The Artificial Neural Network receives the input signal from the external source in the
form of a pattern and image in the form of a vector.

 These inputs are then mathematically assigned by the notations x(n) for every n number
of inputs.
 Afterward, each of the input is multiplied by its corresponding weights ( these weights
are the details utilized by the artificial neural networks to solve a specific problem ).

UNIT-5 V- Page 5 of 29
KCE-CSE –AI&ML 2023

 In general terms, these weights normally represent the strength of the


interconnection between neurons inside the artificial neural network. All the
weighted inputs are summarized inside the computing unit.
 If the weighted sum is equal to zero, then bias is added to make the output non-zero or
something else to scale up to the system's response. Bias has the same input, and weight
equals to 1.
 Here the total of weighted inputs can be in the range of 0 to positive infinity. Here, to keep
the response in the limits of the desired value, a certain maximum value is benchmarked,
and the total of weighted inputs is passed through the activation function.
 The activation function refers to the set of transfer functions used to achieve the
desired output.
 There is a different kind of the activation function, but primarily either linear or non-
linear sets of functions.
 Some of the commonly used sets of activation functions are the Binary, linear, and
Tan hyperbolic sigmoidal activation functions.

TOPIC 24.
PERCEPTRON, MULTILAYER PERCEPTRON, ACTIVATION FUNCTIONS

24.1. PERCEPTRON
A Perceptron is an Artificial Neuron. It is the simplest possible Neural Network. Neural
Networks are the building blocks of Machine Learning.Perceptron is a type of artificial neural
network, which is a fundamental concept in machine learning. The basic components of a
perceptron are:

 Input Layer: The input layer consists of one or more input neurons, which receive input
signals from the external world or from other layers of the neural network.
 Weights: Each input neuron is associated with a weight, which represents the strength of the
connection between the input neuron and the output neuron.
 Bias: A bias term is added to the input layer to provide the perceptron with additional
flexibility in modeling complex patterns in the input data.
 Activation Function: The activation function determines the output of the perceptron based
on the weighted sum of the inputs and the bias term. Common activation functions used in
perceptrons include the step function, sigmoid function, and ReLU function.
 Output: The output of the perceptron is a single binary value, either 0 or 1, which indicates
the class or category to which the input data belongs.

UNIT-5 V- Page 6 of 29
KCE-CSE –AI&ML 2023

 Training Algorithm: The perceptron is typically trained using a supervised learning


algorithm such as the perceptron learning algorithm or backpropagation. During training, the
weights and biases of the perceptron are adjusted to minimize the error between the
predicted output and the true output for a given set of training examples.
 Overall, the perceptron is a simple yet powerful algorithm that can be used to perform binary
classification tasks and has paved the way for more complex neural networks used in deep
learning today.

Perceptron
 The original Perceptron was designed to take a number of binary inputs, and produce one
binary output (0 or 1).

 The idea was to use different weights to represent the importance of each input, and that
the sum of the values should be greater than a threshold value before making a decision
like yes or no (true or false) (0 or 1).

How does Perceptron work


 In Machine Learning, Perceptron is considered as a single-layer neural network that
consists of four main parameters named input values (Input nodes), weights and Bias,
net sum, and an activation function.
 The perceptron model begins with the multiplication of all input values and their
weights, then adds these values together to create the weighted sum.
 Then this weighted sum is applied to the activation function 'f' to obtain the desired
output. This activation function is also known as the step function and is represented by 'f'.

∑wi*xi = x1*w1 + x2*w2 +…wn*xn


Add a special term called bias 'b' to this weighted sum to improve the model's performance.

∑wi*xi + b

UNIT-5 V- Page 7 of 29
KCE-CSE –AI&ML 2023

Perceptron Example
 Imagine a perceptron (in your brain). The perceptron tries to decide if you should go to a
concert. Is the artist good? Is the weather good? What weights should these facts have?

Criteria Input Weight


Artists is Good x1 = 0 or 1 w1 = 0.7
Weather is Good x2 = 0 or 1 w2 = 0.6
Friend will Come x3 = 0 or 1 w3 = 0.5
Food is Served x4 = 0 or 1 w4 = 0.3
Snacksl is Served x5 = 0 or 1 w5 = 0.4

The Perceptron Algorithm


Frank Rosenblatt suggested this algorithm:
 Set a threshold value
 Multiply all inputs with its weights
 Sum all the results
 Activate the output

1. Set a threshold value:


Threshold = 1.5

2. Multiply all inputs with its weights:


x1 * w1 = 1 * 0.7 = 0.7
x2 * w2 = 0 * 0.6 = 0
x3 * w3 = 1 * 0.5 = 0.5
x4 * w4 = 0 * 0.3 = 0
x5 * w5 = 1 * 0.4 = 0.4

3. Sum all the results:


0.7 + 0 + 0.5 + 0 + 0.4 = 1.6 (The Weighted Sum)

4. Activate the Output:


Return true if the sum > 1.5 ("Yes I will go to the Concert")

Types of Perceptron:
Single layer:
 Single layer perceptron can learn only linearly separable patterns.
 This is one of the easiest Artificial neural networks (ANN) types.
 A single-layered perceptron model consists feed-forward network and also includes a
threshold transfer function inside the model.
 The main objective of the single-layer perceptron model is to analyze the linearly
separable objects with binary outcomes.

UNIT-5 V- Page 8 of 29
KCE-CSE –AI&ML 2023

 In a single layer perceptron model, its algorithms do not contain recorded data, so it
begins with inconstantly allocated input for weight parameters.

 Further, it sums up all inputs (weight). After adding all inputs, if the total sum of all
inputs is more than a pre-determined value, the model gets activated and shows the
output value as +1.

 If the outcome is same as pre-determined or threshold value, then the performance of this
model is stated as satisfied, and weight demand does not change.

 However, this model consists of a few discrepancies triggered when multiple weight
inputs values are fed into the model. Hence, to find desired output and minimize errors,
some changes should be necessary for the weights input.

Multilayer: Multilayer perceptrons can learn about two or more layers having a greater
processing power.

 Like a single-layer perceptron model, a multi-layer perceptron model also has the same
model structure but has a greater number of hidden layers.

24.2. Multi-layer perceptron

 The multi-layer perceptron model is also known as the Backpropagation algorithm,


which executes in two stages as follows:

o Forward Stage: Activation functions start from the input layer in the forward stage
and terminate on the output layer.
o Backward Stage: In the backward stage, weight and bias values are modified as per
the model's requirement. In this stage, the error between actual output and demanded
originated backward on the output layer and ended on the input layer.

Hence, a multi-layered perceptron model has considered as multiple artificial neural


UNIT-5 V- Page 9 of 29
KCE-CSE –AI&ML 2023

networks having various layers in which activation function does not remain linear, similar
to a single layer perceptron model. Instead of linear, activation function can be executed as
sigmoid, TanH, ReLU, etc., for deployment.

A multi-layer perceptron model has greater processing power and can process linear and non-
linear patterns. Further, it can also implement logic gates such as AND, OR, XOR, NAND, NOT,
XNOR, NOR

Advantages of Multi-Layer Perceptron:

 A multi-layered perceptron model can be used to solve complex non-linear problems.


 It works well with both small and large input data.
 It helps us to obtain quick predictions after the training.
 It helps to obtain the same accuracy ratio with large as well as small data.

Disadvantages of Multi-Layer Perceptron:

 In Multi-layer perceptron, computations are difficult and time-consuming.


 In multi-layer Perceptron, it is difficult to predict how much the dependent variable affects
each independent variable.
 The model functioning depends on the quality of the training.
 NN types

Characteristics of Perceptron
The perceptron model has the following characteristics.
 Perceptron is a machine learning algorithm for supervised learning of binary classifiers.
 In Perceptron, the weight coefficient is automatically learned.
 Initially, weights are multiplied with input features, and the decision is made whether the
neuron is fired or not.
 The activation function applies a step rule to check whether the weight function is greater
than zero.
 The linear decision boundary is drawn, enabling the distinction between the two linearly
separable classes +1 and -1.
 If the added sum of all input values is more than the threshold value, it must have an output
signal; otherwise, no output will be shown.
UNIT-5 V- Page 10 of 29
KCE-CSE –AI&ML 2023

Limitations of Perceptron Model


A perceptron model has limitations as follows:

 The output of a perceptron can only be a binary number (0 or 1) due to the hard limit
transfer function.
 Perceptron can only be used to classify the linearly separable sets of input vectors. If input
vectors are non-linear, it is not easy to classify them properly.

24.3. ACTIVATION FUNCTION


 In artificial neural networks, an activation function is one that outputs a smaller value
for tiny inputs and a higher value if its inputs are greater than a threshold.
 An activation function "fires" if the inputs are big enough; otherwise, nothing happens.
 An activation function, then, is a gate that verifies how an incoming value is higher
than a threshold value.

 Neurons in neural networks operate in accordance with weight, bias, and their
corresponding activation functions.
 Based on the mistake, the values of the neurons inside a neural network would be
modified. This process is known as back-propagation.
 Back-propagation is made possible by activation functions since they provide the
gradients and error required to change the biases and weights.

UNIT-5 V- Page 11 of 29
KCE-CSE –AI&ML 2023

Categories of Activation function


Linear Activation Function, as can be observed, the functional is linear or linear. Therefore, no
region will be employed to restrict the functions' output.

Linear Function
Equation: A linear function's equation, which is y = x, is similar to the eqn of a single direction.

The ultimate activation function of the last layer is nothing more than a linear function of input
from the first layer, regardless of how many levels we have if they are all linear in nature. -inf to
+inf is the range.

Uses: The output layer is the only location where the activation function's function is applied.

If we separate a linear function to add non-linearity, the outcome will no longer depend on the
input "x," the function will become fixed, and our algorithm won't exhibit any novel behaviour.

A good example of a regression problem is determining the cost of a house. We can use linear
activation at the output layer since the price of a house may have any huge or little value. The
neural network's hidden layers must perform some sort of non-linear function even in this
circumstance.

Non-linear Activation Functions, The normal data input to neural networks is unaffected by
the complexity or other factors.

Sigmoid Function
It is a functional that is graphed in a "S" shape. The sigmoid function is used when the model
UNIT-5 V- Page 12 of 29
KCE-CSE –AI&ML 2023

is predicting probability.

A is equal to 1/(1 + e-x).

A sigmoid function is a mathematical function with a characteristic "S"-shaped curve or sigmoid


curve. It transforms any value in the domain (−∞,∞) to a number between 0 and 1.

Application
The sigmoid function's ability to transform any real number to one between 0 and 1 is
advantageous in data science and many other fields such as:
• In deep learning as a non-linear activation function within neurons in artificial neural
networks to allows the network to learn non-linear relationships between the data
• In binary classification, also called logistic regression, the sigmoid function is used to
predict the probability of a binary variable

Issues with the sigmoid function


Although the sigmoid function is prevalent in the context of gradient descent, the gradient of the
sigmoid function is in some cases problematic. The gradient vanishes to zero for very low and
very high input values, making it hard for some models to improve.

For example, during backpropagation in deep learning, the gradient of a sigmoid activation
function is used to update the weights & biases of a neural network. If these gradients are
tiny, the updates to the weights & biases are tiny and the network will not learn.
Alternatively, other non-linear functions such as the Rectified Linear Unit (ReLu) are used, which
do not show these flaws.

Mathematical function
We typically denote the sigmoid function by the greek letter (sigma) and define as

Where
x is the input to the sigmoid function
e is [Euler's number] (e=2.781)

UNIT-5 V- Page 13 of 29
KCE-CSE –AI&ML 2023

Tanh Function
• The activation that consistently outperforms sigmoid function is known as tangent
hyperbolic function. It's actually a sigmoid function that has been mathematically
adjusted. Both are comparable to and derivable from one another.

Range of values: -1 to +1. non-linear nature

Uses: - Since its values typically range from -1 to 1, the mean again for hidden layer of a neural
network will be 0 or very near to it. This helps to centre the data by getting the mean close to 0.
This greatly facilitates learning for the following layer.

Equation:

max A(x) (0, x). If x is positive, it outputs x; if not, it outputs 0.

ReLU (Rectified Linear Unit) Activation Function


The ReLU (rectified linear unit) function gives the value but says if it’s over 1, then it will just be
1, and if it’s less than 0, it will just be 0. Currently, the ReLU is the activation function that is
employed the most globally. Since practically all convolutional neural networks and deep
learning systems employ it. The derivative and the function are both monotonic.

However, the problem is that all negative values instantly become zero, which reduces the

UNIT-5 V- Page 14 of 29
KCE-CSE –AI&ML 2023

model's capacity to effectively fit or learn from the data. This means that any negative input to a
ReLU activation function immediately becomes zero in the graph, which has an impact on the
final graph by improperly mapping the negative values.

Hyperbolic Tangent Function


It is bipolar in nature. It is a widely adopted activation function for a special type of neural
network known as Backpropagation Network. The hyperbolic tangent function is of the form

Tanh function is very similar to the sigmoid/logistic activation function, and even has the same
S-shape with the difference in output range of -1 to 1. In Tanh, the larger the input (more
positive), the closer the output value will be to 1.0, whereas the smaller the input (more
negative), the closer the output will be to -1.0.

TOPIC 25.
NETWORK TRAINING, GRADIENT DESCENT OPTIMIZATION

25.1. NETWORK TRAINING


• In the training phase, the correct class for each record is known (this is termed supervised training),
and the output nodes can therefore be assigned "correct" values -- "1" for the node corresponding to
the correct class, and "0" for the others. (In practice it has been found better to use values of 0.9 and
0.1, respectively.)
UNIT-5 V- Page 15 of 29
KCE-CSE –AI&ML 2023

• It is thus possible to compare the network's calculated values for the output nodes to these "correct"
values, and calculate an error term for each node (the "Delta" rule).

• These error terms are then used to adjust the weights in the hidden layers so that, hopefully, the next
time around the output values will be closer to the "correct" values.

The Iterative Learning Process

• A key feature of neural networks is an iterative learning process in which data cases
(rows) are presented to the network one at a time, and the weights associated with the input
values are adjusted each time.
• After all cases are presented, the process often starts over again. During this learning phase,
the network learns by adjusting the weights so as to be able to predict the correct class
label of input samples. Neural network learning is also referred to as "connectionist
learning," due to connections between the units.
• Advantages of neural networks include their high tolerance to noisy data, as well as their
ability to classify patterns on which they have not been trained. The most popular neural
network algorithm is back-propagation algorithm proposed in the 1980's.
• Once a network has been structured for a particular application, that network is ready
to be trained. To start this process, the initial weights are chosen randomly. Then the
training, or learning, begins.
• The network processes the records in the training data one at a time, using the weights
and functions in the hidden layers, then compares the resulting outputs against the
desired outputs.
• Errors are then propagated back through the system, causing the system to adjust the
weights for application to the next record to be processed.
• This process occurs over and over as the weights are continually tweaked. During the
training of a network the same set of data is processed many times as the connection weights
are continually refined.

Preparation to get an ANN for image classification training:


• Decide on the number of output classes (meaning the number of image classes – for
example two for cat vs dog)
• Draw as many computation units as the number of output classes (congrats you just
create the Output Layer of the ANN)
• Add as many Hidden Layers as needed within the defined architecture (for instance vgg16
or any other popular architecture). Tip – Hidden Layers are just a set of neighboured
Compute Units, they are not linked together.
• Stack those Hidden Layers to the Output Layer using Neural Connections
• It is important to understand that the Input Layer is basically a layer of data ingestion
• Add an Input Layer that is adapted to ingest your data (or you will adapt your data format
to the pre-defined architecture)
• Assemble many Artificial Neurons together in a way where the output (axon) an Neuron
on a given Layer is (one) of the input of another Neuron on a subsequent Layer. As a

UNIT-5 V- Page 16 of 29
KCE-CSE –AI&ML 2023

consequence, the Input Layer is linked to the Hidden Layers which are then linked to the
Output Layer

• Backpropagation is a training algorithm used for training feedforward neural networks.


• It plays an important part in improving the predictions made by neural networks. This is
because backpropagation is able to improve the output of the neural network iteratively
• In a feedforward neural network, the input moves forward from the input layer to the
output layer. Backpropagation helps improve the neural network’s output. It does this by
propagating the error backward from the output layer to the input layer.

Feed Forward Networks


• A feedforward network consists of an input layer, one or more hidden layers, and an output
layer. The input layer receives the input into the neural network, and each input has a weight
attached to it.
• The weights associated with each input are numerical values. These weights are an indicator
of the importance of the input in predicting the final output. For example, an input associated
with a large weight will have a greater influence on the output than an input associated with
a small weight.
• When a neural network is first trained, it is first fed with input. Since the neural network isn’t
trained yet, we don’t know which weights to use for each input. And so, each input is
randomly assigned a weight. Since the weights are randomly assigned, the neural network
will likely make the wrong predictions. It will give out the incorrect output.

UNIT-5 V- Page 17 of 29
KCE-CSE –AI&ML 2023

• When the neural network gives out the incorrect output, this leads to an output error. This
error is the difference between the actual and predicted outputs. A cost function
measures this error.
• The cost function (J) indicates how accurately the model performs. It tells us how far-off
our predicted output values are from our actual values.
• It is also known as the error. Because the cost function quantifies the error, we aim to
minimize the cost function.
• What we want is to reduce the output error. Since the weights affect the error, we will need
to readjust the weights. We have to adjust the weights such that we have a combination
of weights that minimizes the cost function.

• Essentially, backpropagation aims to calculate the negative gradient of the cost


function. This negative gradient is what helps in adjusting of the weights. It gives us an
idea of how we need to change the weights so that we can reduce the cost function.
• Backpropagation uses the chain rule to calculate the gradient of the cost function. The chain
rule involves taking the derivative. This involves calculating the partial derivative of each
parameter. These derivatives are calculated by differentiating one weight and treating
the other(s) as a constant. As a result of doing this, we will have a gradient.
• Since we have calculated the gradients, we will be able to adjust the weights

Processing of ANN depends upon the following three building blocks −


Network Topology
Adjustments of Weights or Learning
Activation Functions

Network Topology
A network topology is the arrangement of a network along with its nodes and connecting lines. According
to the topology, ANN can be classified as the following kinds −
Feedforward Network
• It is a non-recurrent network having processing units/nodes in layers and all the nodes in a layer
UNIT-5 V- Page 18 of 29
KCE-CSE –AI&ML 2023

are connected with the nodes of the previous layers.


• The connection has different weights upon them. There is no feedback loop means the signal can
only flow in one direction, from input to output.

Single layer feedforward network − The concept is of feedforward ANN having only one weighted
layer. In other words, we can say the input layer is fully connected to the output layer.

Multilayer feedforward network − The concept is of feedforward ANN having more than one weighted
layer. As this network has one or more layers between the input and the output layer, it is called hidden
layers.

Feedback Network
As the name suggests, a feedback network has feedback paths, which means the signal can flow in both
directions using loops. This makes it a non-linear dynamic system, which changes continuously until it
reaches a state of equilibrium. It may be divided into the following types −

Recurrent networks − They are feedback networks with closed loops. Following are the two types of
recurrent networks.
1. Fully recurrent network − It is the simplest neural network architecture because all nodes are
connected to all other nodes and each node works as both input and output.

Jordan network − It is a closed loop network in which the output will go to the input again as feedback
as shown in the following diagram.
UNIT-5 V- Page 19 of 29
KCE-CSE –AI&ML 2023

Adjustments of Weights or Learning


Learning, in artificial neural network, is the method of modifying the weights of connections between the
neurons of a specified network. Learning in ANN can be classified into three categories namely supervised
learning, unsupervised learning, and reinforcement learning.
Supervised Learning
As the name suggests, this type of learning is done under the supervision of a teacher. This learning
process is dependent

Unsupervised Learning
• As the name suggests, this type of learning is done without the supervision of a teacher. This
learning process is independent.
• During the training of ANN under unsupervised learning, the input vectors of similar type are
combined to form clusters. When a new input pattern is applied, then the neural network gives an
output response indicating the class to which the input pattern belongs.
• There is no feedback from the environment as to what should be the desired output and if it is
correct or incorrect

Reinforcement Learning
• During the training of network under reinforcement learning, the network receives some
feedback from the environment

UNIT-5 V- Page 20 of 29
KCE-CSE –AI&ML 2023

25.2. GRADIENT DESCENT OPTIMIZATION

Feed forward network -Gradient Descent


• The weights are adjusted using a process called gradient descent.
• Gradient descent is an optimization algorithm that is used to find the weights that
minimize the cost function.
• Minimizing the cost function means getting to the minimum point of the cost function. So,
gradient descent aims to find a weight corresponding to the cost function’s minimum point.
• To find this weight, we must navigate down the cost function until we find its minimum
point
• To navigate the cost function, we need two things: the direction in which to navigate and
the size of the steps for navigating.
• The Direction
The direction for navigating the cost function is found using the gradient.
• The Gradient
To know in which direction to navigate, gradient descent uses backpropagation. More
specifically, it uses the gradients calculated through backpropagation. These gradients are
used for determining the direction to navigate to find the minimum point. Specifically, we aim
to find the negative gradient. This is because a negative gradient indicates a decreasing slope.
A decreasing slope means that moving downward will lead us to the minimum point. For
example:

• The Step Size


The step size for navigating the cost function is determined using the learning rate.
• Learning Rate
The learning rate is a tuning parameter that determines the step size at each iteration of
gradient descent. It determines the speed at which we move down the slope.
• The step size plays an important part in ensuring a balance between optimization time and
UNIT-5 V- Page 21 of 29
KCE-CSE –AI&ML 2023

accuracy. The step size is measured by a parameter alpha (α). A small α means a small step
size, and a large α means a large step size. If the step sizes are too large, we could miss the
minimum point completely. This can yield inaccurate results. If the step size is too small, the
optimization process could take too much time. This will lead to a waste of computational
power.

• The step size is evaluated and updated according to the behavior of the cost function.
• The higher the gradient of the cost function, the steeper the slope and the faster a
model can learn (high learning rate).
• A high learning rate results in a higher step value, and a lower learning rate results in a
lower step value.
• If the gradient of the cost function is zero, the model stops learning.
Descending the Cost Function
• Navigating the cost function consists of adjusting the weights. The weights are adjusted
using the following formula:

to obtain the new weight, we use the gradient, the learning rate, and an initial weight.

Adjusting the weights consists of multiple iterations. We take a new step down for each iteration
and calculate a new weight. Using the initial weight and the gradient and learning rate, we can
determine the subsequent weights.
Let’s consider a graphical example of this:

From the graph of the cost function, we can see that:


• To start descending the cost function, we first initialize a random weight.
• Then, we take a step down and obtain a new weight using the gradient and learning
rate. With the gradient, we can know which direction to navigate. We can know the step
size for navigating the cost function using the learning rate.
• We are then able to obtain a new weight using the gradient descent formula.
• We repeat this process until we reach the minimum point of the cost function.
• Once we’ve reached the minimum point, we find the weights that correspond to the
minimum of the cost function.
UNIT-5 V- Page 22 of 29
KCE-CSE –AI&ML 2023

Steps in the gradient descent algorithm:


1. Randomly initialize the model parameters,𝜃^0
2. Compute the gradient of the loss function at the initial parameters 𝜃^0: 𝛻ℒ(𝜃^0 )
3. Update the parameters as: 𝜃^𝑛𝑒𝑤=𝜃^0−𝛼𝛻ℒ(𝜃^0 )
4. Where α is the learning rate
5. Go to step 2 and repeat

Gradient descent algorithm stops when a local minimum of the loss surface is reached
GD does not guarantee reaching a global minimum
However, empirical evidence suggests that GD works well for NNs

Problems with Gradient Descent


Besides the local minima problem, the GD algorithm can be very slow at plateaus, and it can get
stuck at saddle points

TOPIC 26.

STOCHASTIC GRADIENT DESCENT, ERROR BACKPROPAGATION, FROM SHALLOW NETWORKS TO


DEEP NETWORKS.

Gradient, in plain terms means slope or slant of a surface. So gradient descent literally means
descending a slope to reach the lowest point on that surface

UNIT-5 V- Page 23 of 29
KCE-CSE –AI&ML 2023

In the above graph, the lowest point on the parabola occurs at x = 1. The objective of gradient
descent algorithm is to find the value of “x” such that “y” is minimum. “y” here is termed as the
objective function that the gradient descent algorithm operates upon, to descend to the lowest
point.

Gradient descent is an iterative algorithm, that starts from a random point on a function and
travels down its slope in steps until it reaches the lowest point of that function.”

The steps of the algorithm are

1. Find the slope of the objective function with respect to each parameter/feature. In other
words, compute the gradient of the function.
2. Pick a random initial value for the parameters. (To clarify, in the parabola example,
differentiate “y” with respect to “x”. If we had more features like x1, x2 etc., we take the
partial derivative of “y” with respect to each of the features.)
3. Update the gradient function by plugging in the parameter values.
4. Calculate the step sizes for each feature as : step size = gradient * learning rate.
5. Calculate the new parameters as : new params = old params -step size
6. Repeat steps 3 to 5 until gradient is almost 0.

Stochastic”, in plain terms means “random”. SGD randomly picks one data point from the whole
data set at each iteration to reduce the computations enormously.

It is also common to sample a small number of data points instead of just one point at each step
and that is called “mini-batch” gradient descent. Mini-batch tries to strike a balance between the
goodness of gradient descent and speed of SGD.

ERROR BACK PROPAGATION


• Backpropagation is one of the important concepts of a neural network. Task is to classify
data best. For this, update the weights of parameter and bias. In the linear regression model,
use gradient descent to optimize the parameter. Similarly use gradient descent algorithm
using Backpropagation.

• For a single training example, Backpropagation algorithm calculates the gradient of the error
function. Backpropagation can be written as a function of the neural network.
Backpropagation algorithms are a set of methods used to efficiently train artificial neural
networks following a gradient descent approach which exploits the chain rule.

• The main features of Backpropagation are the iterative, recursive and efficient method
through which it calculates the updated weight to improve the network until it is not able to
perform the task for which it is being trained. Derivatives of the activation function to be
known at network design time is required to Backpropagation.

UNIT-5 V- Page 24 of 29
KCE-CSE –AI&ML 2023

A shallow neural network has only one hidden layer between the input and output layers, while
a deep neural network has multiple hidden layers. Deeper networks perform better than shallow
UNIT-5 V- Page 25 of 29
KCE-CSE –AI&ML 2023

networks. But only up to some limit: after a certain number of layers, the performance of deeper
networks plateaus

TOPIC 27.
UNIT SATURATION (AKA THE VANISHING GRADIENT PROBLEM) – RELU, HYPERPARAMETER
TUNING, BATCH NORMALIZATION, REGULARIZATION, DROPOUT

In some cases, during training, the gradients can become either very small (vanishing gradients) of very
large (exploding gradients)
They result in very small or very large update of the parameters
Solutions: change learning rate, ReLU activations, regularization, LSTM units in RNNs

REGULARIZATION

Effect of the decay coefficient 𝜆


Large weight decay coefficient → penalty for weights with large values

UNIT-5 V- Page 26 of 29
KCE-CSE –AI&ML 2023

UNIT-5 V- Page 27 of 29
KCE-CSE –AI&ML 2023

HYPER PARAMETER TUNING

UNIT-5 V- Page 28 of 29
KCE-CSE –AI&ML 2023

ILLUSTRATION OF 5-FOLD CROSS VALIDATION

UNIT-5 V- Page 29 of 29

You might also like