AI - Spring2025 Final
AI - Spring2025 Final
Welcome to our course on Artificial Intelligence (AI)! This course is designed to explore the
core strategies and foundational concepts that power intelligent systems, solving complex,
real-world problems across various domains. From navigating routes to optimizing industrial
processes, and from playing strategic games to making predictive models, the use of AI is
pivotal in creating efficient and effective solutions.
These lecture notes have been prepared by Ipshita Bonhi Upoma (Lecturer, BRAC Uni-
versity) based on the book "Artificial Intelligence: A Modern Approach" by Peter Norvig
and Stuart Russell. The credit for the initial typesetting of these notes goes to Arifa Alam.
The purpose of these notes is to provide additional examples using toy problems to enhance
the understanding of how basic AI algorithms work and the types of problems they can solve.
This is a work in progress, and as it is the first time preparing such notes, many parts
need citations and rewriting. Several graphs, algorithms and images are directly borrowed
from Russell and Norvig’s book. The next draft will address these citations.
As this is the first draft, please inform us of any typos, mathematical errors, or other tech-
nical issues so they can be corrected in future versions.
Our Reference book (Artificial Intelligence: A Modern Approach 3rd Ed.): https://drive.
google.com/file/d/16pK--SRAKZzkVs8NcxxYCDFfxtvE0pmZ/view?usp=sharing
Links to all resources and forms, deadlines for section 13, 17, 18 will be updated in this google
3
Artificial Intelligence
sheet: https://docs.google.com/spreadsheets/d/10GerKB3PEfhvhGsyHHSCvIOJmBkGkY4uCtn0HPYJLHk
edit?usp=sharing
Marks Distribution
• Class Task 5%
• Lab 25%
• Mid 25%
• Final 30%
1. Classwork on the lecture will be given after the lecture and attendance will be counted
based on the classworks.
2. Feel free to bring coffee/ light snacks for yourself. Make sure to not cause any distrac-
tions in the class.
3. If you are not enjoying the lecture you are free to leave the lecture, but in no way you
should do anything that disturbs me or the other students.
7. Cheating in any form will not be tolerated and will result in a 100% penalty.
8. If bonus assignments are given, the marks of bonus will be added after completion of
all other assessments.
11. No grace marks will be given for any grade bump. Such requests will not be considered.
Note To Students 3
Contents 5
I Classical AI 11
1 Introduction: Past, present, future 13
1.1 Some of the earliest problems that were solved using Artificial Intelligence. . 13
1.2 Problems we are trying to solve using these days: . . . . . . . . . . . . . . . 15
2 Solving Problems with AI 17
2.1 Solving problems with artificial intelligence . . . . . . . . . . . . . . . . . . . 17
2.1.1 Searching Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.2 Constraint Satisfaction Problems (CSP) . . . . . . . . . . . . . . . . 18
2.1.3 Machine Learning Techniques . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Some keywords you will hear every now and then . . . . . . . . . . . . . . . 19
2.3 Properties of Task Environment . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Types of Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.1 Learning Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5 Problem Formulation and choice of strategies . . . . . . . . . . . . . . . . . . 23
2.6 Steps of Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.7 Examples of problem formulation . . . . . . . . . . . . . . . . . . . . . . . . 25
3 Informed Search Algorithms 29
3.1 Heruistic Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1.1 Key Characteristics of Heuristics in Search Algorithms . . . . . . . . 30
3.1.2 Why do we use heuristics? . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Heuristic functions to solve different problems . . . . . . . . . . . . . . . . . 31
3.3 Greedy Best First Search- Finally an algorithm . . . . . . . . . . . . . . . . 35
3.3.1 Algorithm: Greedy Best-First Search . . . . . . . . . . . . . . . . . . 37
3.3.2 Key Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4 A* Search algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4.1 Algorithm: A* Search . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5 Condition on heuristics for A* to be optimal: . . . . . . . . . . . . . . . . . . 42
3.6 How to choose a better Heuristic: . . . . . . . . . . . . . . . . . . . . . . . . 48
3.6.1 Why is a dominant heuristic for efficient? . . . . . . . . . . . . . . . . 49
4 Local Search 51
5
CONTENTS Artificial Intelligence
4.1 LocalSearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.1.1 The Need for Local Search . . . . . . . . . . . . . . . . . . . . . . . . 52
4.1.2 Examples of problems that can be solved by local search: . . . . . . . 52
4.1.3 Some Constraint Satisfaction Problems can also be solved using local
search strategies: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 Local Search Algorithms: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2.1 State-Space and Objective Function: . . . . . . . . . . . . . . . . . . 54
4.3 Hill-Climbing Search: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3.1 Examples: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3.2 Key Characteristic: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3.3 Drawbacks: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.3.4 Remedies to problems of Hill-Climbing Search Algorithm: . . . . . . 60
4.3.5 Variants of Hill Climbing: . . . . . . . . . . . . . . . . . . . . . . . . 60
4.4 Introduction to Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . 61
4.5 How does simulated annealing navigate the solution space: . . . . . . . . . . 63
4.5.1 Probability of Accepting a New State . . . . . . . . . . . . . . . . . . 63
4.5.2 Cooling Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.5.3 Random Selection of Neighbors . . . . . . . . . . . . . . . . . . . . . 64
4.5.4 Mathematical Convergence . . . . . . . . . . . . . . . . . . . . . . . . 64
4.6 Example Problem: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.6.1 Traveling Salesman Problem (TSP) . . . . . . . . . . . . . . . . . . . 64
4.7 Genetic algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.7.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.7.2 Explanation of the Pseudocode . . . . . . . . . . . . . . . . . . . . . 66
4.7.3 Evaluate Fitness: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.8 Mutation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.8.1 Purpose of Mutation . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.8.2 How Mutation Works . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.8.3 Common Types of Mutation . . . . . . . . . . . . . . . . . . . . . . . 69
4.8.4 Mutation Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.8.5 Diversity in Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . 70
4.8.6 Advantages and Applications . . . . . . . . . . . . . . . . . . . . . . 70
4.9 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.9.1 8-Queen Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.9.2 Traveling Salesman Problem (TSP) . . . . . . . . . . . . . . . . . . . 75
4.9.3 0/1 Knapsack Problem . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.9.4 Graph Coloring Problem: . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.9.5 Max-cut Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.10 Application of Genetic Algorithm in Machine Learning . . . . . . . . . . . . 85
4.10.1 Problem Setup: Feature Selection for Predictive Modeling . . . . . . 85
4.10.2 Genetic Algorithm Process for Feature Selection . . . . . . . . . . . . 85
4.10.3 Problem Setup: Hyperparameter Optimization for a Neural Network 86
4.10.4 Genetic Algorithm in AI Games: . . . . . . . . . . . . . . . . . . . . 88
4.10.5 Genetic algorithms in Finance . . . . . . . . . . . . . . . . . . . . . . 89
II Modern AI 103
1 Machine Learning Basics 105
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
1.2 Why do we need machine learning . . . . . . . . . . . . . . . . . . . . . . . . 105
1.3 Paradigms of Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . 106
1.3.1 Major Paradigms of Machine Learning . . . . . . . . . . . . . . . . . 107
1.4 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
1.5 Steps of Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
1.6 Types of Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 109
1.7 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
1.8 Hypothesis in Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . 110
1.9 Hypothesis Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
2 Probability theory 115
2.1 Probability Theory in AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
2.2 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
2.3 Basic Probability Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
2.4 Bayes’ Rule: Derivation and Examples . . . . . . . . . . . . . . . . . . . . . 119
2.4.1 Derivation of Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . 119
2.4.2 Significance of Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . 120
2.5 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
2.5.1 Notation for Discrete Random Variables . . . . . . . . . . . . . . . . 121
2.5.2 Properties of Discrete Random Variables . . . . . . . . . . . . . . . . 122
2.5.3 Basic Vector Notation . . . . . . . . . . . . . . . . . . . . . . . . . . 122
2.5.4 Discrete Random Variable with Categorical Outcomes . . . . . . . . . 123
2.6 Data to Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
2.6.1 Marginal Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
2.6.2 Marginal Probability Distribution . . . . . . . . . . . . . . . . . . . . 125
2.6.3 Joint Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
2.6.4 Joint Probability Distribution . . . . . . . . . . . . . . . . . . . . . . 125
2.7 Understanding Conditional Probability using Probability Distribution Table 125
2.7.1 Absolute independence . . . . . . . . . . . . . . . . . . . . . . . . . . 126
2.7.2 Conditional Independence . . . . . . . . . . . . . . . . . . . . . . . . 126
2.7.3 Example: Checking Conditional Independence . . . . . . . . . . . . . 128
2.8 Law of Total Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Classical AI
11
CHAPTER 1
INTRODUCTION: PAST, PRESENT,
FUTURE
AI research quickly pushed forward, especially in Bayesian networks and Markov Decision
Processes, which allowed machines to reason under uncertainty. Games also became a testing
ground for AI. In 1951, Turing designed a theoretical chess program, followed soon by Arthur
Samuel’s checkers program, one of the earliest examples of a machine that could learn from
experience—a major step towards machine learning.
AI wasn’t just about games, though. In 1956, The Logic Theorist program, created by Allen
Newell, Herbert Simon, and Cliff Shaw, was designed to mimic human reasoning, even prov-
ing mathematical theorems from Principia Mathematica. This breakthrough demonstrated
AI’s ability to solve complex logical tasks. Meanwhile, machine translation experiments,
like the Georgetown project in 1954, successfully converted Russian sentences into English,
showcasing AI’s potential for language processing.
Even early speech recognition made strides in this decade, with Bell Labs developing Audrey,
a system that recognized spoken numbers. These advancements in the 1950s laid a strong
foundation for AI’s growth.
13
Introduction: Past, present, future Artificial Intelligence
The 1970s saw AI face skepticism. Reports like the Lighthill Report highlighted AI’s limi-
tations, but this decade also saw graph theory grow in importance, allowing for knowledge
representation in early expert systems.
The 1980s brought significant advances with expert systems that made decisions using rule-
based logic, as well as the backpropagation algorithm, which greatly improved neural net-
works.
In the 1990s, AI reached new heights. IBM’s Deep Blue made history by defeating a world
chess champion, and statistical learning models like Support Vector Machines gained popu-
larity, leading to a data-driven approach in AI.
The 2000s introduced game theory for multi-agent systems, and deep learning brought re-
newed interest in neural networks for complex pattern recognition.
In the 2010s, AI achieved remarkable feats with IBM’s Watson and Google DeepMind’s Al-
phaGo, showcasing AI’s ability to understand language and excel in strategic games, demon-
strating the depth of AI’s problem-solving abilities.
Large Language Models (LLMs) like GPT and BERT are significant developments in natural
language processing. Starting with early neural network foundations in the 2000s, advance-
ments such as Word2Vec laid the groundwork for sophisticated word embeddings. The 2017
introduction of the Transformer architecture revolutionized NLP by efficiently handling long-
range dependencies in text.
In 2018, Google’s BERT and OpenAI’s GPT used the Transformer model to understand
and generate human-like text, with BERT improving context understanding through bidi-
rectional training, and GPT enhancing generative capabilities. Recent iterations like GPT-3
and GPT-4 have scaled up in size and performance, expanding their application range from
content generation to conversational AI.
Today, in the 2020s, AI is focusing on issues like fairness and bias and exploring the po-
tential of quantum computing to revolutionize the field even further.
Healthcare—
Disease Diagnosis: AI algorithms analyze medical imaging data to detect and
diagnose diseases early, such as cancer or neurological disorders.
Transportation—
Autonomous Vehicles: AI powers self-driving cars, aiming to reduce human error
in driving and increase road safety.
Finance—
Fraud Detection: AI systems analyze transaction patterns to identify and prevent
fraudulent activities in real time.
Retail—
Customer Personalization: AI enhances customer experience by providing person-
alized recommendations based on past purchases and browsing behaviors.
Education—
Adaptive Learning Platforms: AI tailors educational content to the learning styles
and pace of individual students, improving engagement and outcomes.
Case Prediction: AI analyzes past legal cases to predict outcomes and pro-
vide guidance on legal strategies.
Solving any problem first requires abstraction/problem formulation of the problem so that
the problem can be tackled using an algorithmic solution. Based on that abstraction we
choose a suitable strategy to solve the problem.
Real-world AI challenges are rarely straightforward. They often need to be broken down
into smaller parts, with each part solved using a different strategy. For example, in creating
an autonomous vehicle, informed search may help us find a route to destination, adversarial
search helps us predict other drivers’ actions, while machine learning helps the vehicle un-
derstand road signs.
Thankfully in this course, we’ll focus on learning each strategy separately. This approach
lets us dive deep into each area without worrying about combining them.
17
Solving Problems with AI Artificial Intelligence
Local Search
Why— Local search methods are crucial for tackling optimization problems where
finding an optimal solution might be too time-consuming. These methods, which in-
clude Simulated Annealing and Hill Climbing, are invaluable for tasks such as resource
allocation and scheduling where a near-optimal solution is often sufficient.
Adversarial search
Why—Adversarial search techniques are essential for environments where agents com-
pete against each other, such as in board games or market competitions. Understanding
strategies like Minimax and Alpha-Beta Pruning allows one to predict and counter op-
ponents’ moves effectively.
Decision Tree
Why—Decision trees are introduced due to their straightforward approach to solving
classification and regression problems. They split data into increasingly precise subsets
using simple decision rules, making them suitable for tasks from financial forecasting
to clinical decision support.
Gradient Descent
Why—The gradient descent algorithm is essential for optimizing machine learning
models, particularly in training deep neural networks. Its ability to minimize error
functions makes it indispensable for developing applications like voice recognition sys-
tems and personalized recommendation engines.
2.2 Some keywords you will hear every now and then
Agent: In AI, an agent is an entity that perceives its environment through sensors
and acts upon that environment using actuators. It operates within a framework of
objectives, using its perceptions to make decisions that influence its actions.
Rational Agent: A rational agent acts to achieve the best outcome or, when there is
uncertainty, the best expected outcome. It is "rational" in the sense that it maximizes
its performance measure, based on its perceived data from the environment and the
knowledge it has.
Task Environment: In AI, the environment refers to everything external to the agent
that it interacts with or perceives to make decisions and achieve its goals. It includes
all factors, conditions, and entities that can influence or be influenced by the agent’s
actions.
Partially Observable: Here, the agent only has partial information about the
environment due to limitations in sensor capabilities or because some information
is inherently hidden. Agents must often infer or guess the missing information to
make decisions.For instance, in a poker game, players cannot see the cards of their
opponents.
Episodic: The agent’s experience is divided into distinct episodes, where the
action in each episode does not affect the next. Each episode consists of the agent
perceiving and then performing a single action. An example is image classification
tasks where each image is processed independently.
Sequential: Actions have long-term consequences, and thus the current choice
affects all future decisions. Agents need to consider the overall potential future
outcomes when deciding their actions. Navigation tasks where an agent (like a robot or
self-driving car) must continuously make decisions based on past movements exemplify
this.
Static: The environment does not change while the agent is deliberating. This
simplicity allows the agent time to make a decision without worrying about the
environment moving on. An example is a Sudoku puzzle, where the grid waits inertly
as the player strategizes.
Dynamic: The environment can change while the agent is considering its ac-
tions. Agents need to adapt quickly and consider the timing of actions. For example,
in automated trading systems where market conditions can change in the midst of
computations.
Discrete: Possible states, actions, and outcomes are limited to a set of dis-
tinct, clearly defined values. For example, a chess game has a finite number of possible
moves and positions.
Cooperative: Agents work together towards a common goal, which may in-
volve communication and shared tasks. For example, a collaborative robotics setting
where multiple robots work together to assemble a product.
These agents act solely based on the current perception, ignoring the rest of the
perceptual history. They operate on a condition-action rule, meaning if a condition is
met, an action is taken.
Example: A room light that turns on when it detects motion. It does not re-
member past movements; it only responds if it detects motion currently.
These agents maintain some sort of internal state that depends on the percept
history, allowing them to make decisions in partially observable environments. They
use a model of the world to decide their actions.
Goal-Based Agents
These agents act to achieve goals. They consider future actions and evaluate
them based on whether they lead to the achievement of a set goal.
Example: A navigation system in a car that plans routes not only based on the
current location but also on the destination the user wants to reach.
Utility-Based Agents:
These agents aim to maximize their own perceived happiness or satisfaction, ex-
pressed as a utility function. They choose actions based on which outcome provides
the greatest benefit according to this utility.
Example: An investment bot that decides to buy or sell stocks based on an al-
gorithm designed to maximize the expected return on investment, weighing various
financial indicators and market conditions.
A self-driving car is a learning agent that adapts and improves its driving decisions based on
accumulated driving data and experiences.
Performance Element: This part of the agent controls the car, making real-time driving
decisions such as steering, accelerating, and braking based on current traffic conditions and
sensor inputs.
Learning Element: It processes the data gathered from various sensors and feedback from
the performance element to improve the decision-making algorithms. For example, it learns
to recognize stop signs better or understand the nuances of merging into heavy traffic.
Critic: It evaluates the driving decisions made by the performance element. For instance, if
a particular maneuver led to a near-miss, the critic would flag this as suboptimal.
Problem Generator: This might simulate challenging driving conditions that are not fre-
quently encountered, such as slippery roads or unexpected obstacles, to prepare the car for
a wider range of scenarios.
Over time, by learning from both successes and failures, a self-driving car improves its ca-
pability to drive safely and efficiently in complex traffic environments, demonstrating how
learning agents adapt and enhance their performance based on experience.
Figure 2.1: Initial and Goal State of a 8-Puzzle Game (Russel and Norvig, Artificial Intel-
ligence: A Modern Approach)
List out all possible actions that can be taken from any given state.
Example: In an online booking system, actions could include selecting dates, choosing
a room type, and adding guest information.
In a navigation problem, for example, these actions could be the different paths or
turns one can take at an intersection.
For a sorting algorithm, actions might be the comparisons or swaps between elements.
◦ Problem Formulation:
• Goal: To find the quickest route from a starting point (origin) to a destination
(end point) while considering current traffic conditions.
• States: Each state represents a geographic location within the city’s road
network.
• Initial State: The specific starting location of the vehicle.
• Actions: From any given state (location), the actions available are the set of
all possible roads that can be taken next.
• Transition Model: Moving from one location to another via a chosen road
or intersection.
◦ Problem Formulation:
• Goal: To optimize the power output while minimizing fuel usage and adhering
to safety regulations.
• States: Each state represents a specific configuration of the power plant’s
operational settings (e.g., temperature, pressure levels, valve positions).
• Initial State: The current operational settings of the plant.
• Actions: Adjustments to the operational settings such as increasing or de-
creasing temperature, adjusting pressure, and changing the mix of fuel used.
• Transition Model: Changing from one set of operational settings to another.
• Goal Test: A set of operational conditions that meet all efficiency, safety,
and regulatory requirements.
• Path Cost: Typically involves costs related to fuel consumption, wear and
tear on equipment, and potential safety risks. The cost function aims to
minimize these while maximizing output efficiency.
◦ Heuristic Used:
• Efficiency Metrics: Estimations of how changes in operational settings will
affect output efficiency and resource usage. This might include predictive
models based on past performance data.
◦ Problem Formulation:
• Goal: To assign time slots and rooms to university classes in a way that no
two classes that share students or instructors overlap, and all other constraints
are satisfied.
• States: Each state represents an assignment of classes to time slots and
rooms.
• Initial State: No courses are assigned to any time slots or rooms.
◦ Problem Formulation:
• Goal: To accurately diagnose diseases based on symptoms, patient history,
and test results.
• States: Each state represents a set of features associated with a patient,
including symptoms presented, medical history, demographic data, and results
from various medical tests.
• Initial State: The initial information gathered about the patient, which
includes all initial symptoms and available medical history.
• Actions: jActions are not typically modeled in decision trees as they are used
for classification rather than processes involving sequential decisions.
• Transition Model: Not applicable for decision trees since the process does
not involve moving between states.
• Goal Test: The diagnosis output by the decision tree, determining the most
likely disease or condition based on the input features.
• Path Cost: In decision trees, the cost is not typically measured in terms of
path, but accuracy, specificity, and sensitivity of the diagnosis can be consid-
ered as metrics for evaluating performance.
◦ Features Used:
• Symptoms: Patient-reported symptoms and observable signs.
• Test Results: Quantitative data from blood tests, imaging tests, etc.
Note: This lecture closely follows Chapter 3.6 (Heuristic Functions), 3.5 (Informed Search)
to 3.5.1 (Greedy Best First Search), 3.5.2 to 3.5.4 (A* Search), 3.6.1 (Effect of Heuristic
on accuracy and performance) of Russel and Norvig, Artificial Intelligence: A Modern Ap-
proach. The images are also borrowed from these chapters
As computer science students, you are already familiar with various search algorithms such as
Breadth-First Search, Depth-First Search, and Dijkstra’s/Best-First Search. These strategies
fall under the category of Uninformed Search or Blind Search, which means they rely solely
on the information provided in the problem definition.
29
Informed Search Algorithms Artificial Intelligence
For example, consider a map of Romania where we want to travel from Arad to Bucharest.
The map indicates that Arad is connected to Zerind by 75 miles, Sibiu by 140 miles, and
Timis, oara by 118 miles. Using a blind search strategy, the next action from Arad would be
chosen based solely on the distances to these connected cities. This approach can be slower
and less efficient as it may explore paths that are irrelevant to reaching the goal efficiently.
In this course, we will focus on informed search strategies, also known as heuristic search.
Informed Search uses additional information—referred to as heuristics—to make educated
guesses about the most promising direction to pursue in the search space. This approach
often results in faster and more efficient solutions because it avoids wasting time on less likely
paths. We will study Greedy Best-First Search and A* search extensively. But first, let’s
explore the concept of heuristics.
Estimation: A heuristic function estimates the cost to reach the goal from a current
node. This estimate does not need to be exact but should never overestimate.
Returning to the example of traveling to Bucharest from Arad: A heuristic function can
estimate the shortest distance from any city in Romania to the goal. For instance, we might
use the straight-line distance as a measure of the heuristic value for a city. The straight-
line distance from Arad to Bucharest is 366 miles, although the optimal path from Arad to
Bucharest actually spans 418 miles. Therefore, the heuristic value for Arad is 366 miles. For
each node (in this problem, city) in the state space (in this problem, the map of Romania) the
heuristic value will be their straight line distance from the goal state (in this case Bucharest).
Guidance: The heuristic guides the search process, helping the algorithm prioritize
which nodes to explore next based on which seem most promising—i.e., likely to lead
to the goal with the least cost.
Efficiency: By providing a way to estimate the distance to the goal, heuristics can
significantly speed up the search process, as they allow the algorithm to focus on more
promising paths and potentially disregard paths that are unlikely to be efficient.
8-Puzzle Game
The number of misplaced tiles (blank not included): For the figure above, all
eight tiles are out of position, so the start state has h1 = 8.
The sum of the distances of the tiles from their goal positions: Be-
cause tiles cannot move along diagonals, the distance is the sum of the horizontal and
vertical distances- sometimes called the city-block distance or Manhattan distance.
Figure 3.1: Initial and Goal State of a 8-Puzzle Game (Russel and Norvig, Artificial Intel-
ligence: A Modern Approach)
Travel Time: Estimating the time needed to reach the goal based on average
speeds and road types. t = vd where d is the distance and v is the average speed.
Traffic Patterns: Using historical or real-time traffic data to estimate the fastest
route. Could involve a weighting factor, w based on traffic data, modifying the travel
time: tadjusted = t × w .
Material Count: Sum of the values of all pieces. For example, in chess, pawns = 1,
knights/bishops = 3, rooks = 5, queen = 9.
Page Rank: Evaluating the number and quality of inbound links to estimate
the page’s importance.
Domain Authority: The reputation and reliability of the website hosting the
information. Often a proprietary metric, but generally a combination of factors like
link profile, site age, traffic, etc.
Let us see how this works for route-finding problems in Romania; we use the straight-
line-distance heuristic, which we will call hSLD . If the goal is Bucharest, we need to know
the straight-line distances to Bucharest, which are shown in the figure below. For example,
hSLD (Arad) = 366. Notice that the values of hSLD cannot be computed from the problem
description itself (that is, the ACTIONS and RESULT functions). Moreover, it takes a certain
amount of world knowledge to know that hSLD is correlated with actual road distances and
is, therefore, a useful heuristic.
The next figure shows the progress of a greedy best-first search using hSLD to find a path
from Arad to Bucharest. The first node to be expanded from Arad will be Sibiu because the
heuristic says it is closer to Bucharest than is either Zerind or Timisoara. The next node to
be expanded will be Fagaras because it is now closest according to the heuristic. Fagaras in
turn generates Bucharest, which is the goal. For this particular problem, greedy best-first
search using hSLD finds a solution without ever expanding a node that is not on the solution
path. The solution it found does not have optimal cost, however: the path via Sibiu and
Fagaras to Bucharest is 32 miles longer than the path through Rimnicu Vilcea and Pitesti.
This is why the algorithm is called “greedy”—on each iteration it tries to get as close to a
goal as it can, but greediness can lead to worse results than being careful.
Greedy best-first graph search is complete in finite state spaces, but not in infinite ones. The
worst-case time and space complexity is O(| V |). With a good heuristic function, however,
the complexity can be reduced substantially, on certain problems reaching O(bm).
Completeness and Optimality: Greedy Best-First Search does not guarantee that
the shortest path will be found, making it neither complete nor optimal. It can get
stuck in loops or dead ends if not careful with the management of the visited set.
Data Structures: The algorithm typically uses a priority queue for the frontier and
a set for the visited nodes. This setup helps in efficiently managing the nodes during
the search process.
Greedy Best-First Search is particularly useful when the path’s exact length is less
important than quickly finding a path that is reasonably close to the shortest possible.
It is well-suited for problems where a good heuristic is available.
• Input:
– cost(current, neighbor): A function that returns the cost of moving from current
to neighbor.
– heuristic(node): A function that estimates the cost from node to the goal.
• Output:
• Procedure
– Initialize
– Search:
□ If the priority queue is exhausted without reaching the goal, return None.
Path Reconstruction: The path is reconstructed from the cameFrom map, which records
where each node was reached from.
Notice that Bucharest first appears on the frontier at step (e), but isn’t selected for ex-
pansion as it isn’t the lowest-cost node at that moment, with a cost of 450 compared to
Pitesti’s lower cost of 417. The algorithm prioritizes exploring potentially cheaper routes,
such as through Pitesti, before settling on higher-cost paths. By step (f), a more cost-effective
path to Bucharest, costing 417, becomes available and is subsequently selected as the optimal
solution.
• Cost Focus: Prioritizes nodes with the lowest estimated total cost, f (n) = g(n)+h(n)
• Resource Efficiency: Avoids exploring more expensive paths when cheaper options
are available.
• Adapts Based on New Information: Adjusts path choices dynamically as new cost
information becomes available.
A* Algorithm is always complete, meaning that if a solution exists then the algorithm will
find the path.
However, the A* algorithm only returns optimal solutions when the heuristic has some specific
properties.
• Assume the optimal path cost is C∗ but A∗ returns a path with a cost greater than
C∗.
• There must be a node n on the optimal path that A∗ did not expand.
• If f (n) which is the estimated cost of the cheapest solution through n were less or equal
to C∗ then n would have been expanded.
• By definition and admissibility, f (n) is g(n) + h(n) and should be less or equal to
g ∗ (n)+ cost(n to goal) = C∗ as n is on the optimal path and h(n) is less than or equal
cost(n to goal) due to admissibility
– S to A = 2
– A to G = 3
– S to B = 1
– B to G = 10
1
S B
2 10
A G
3
Heuristic, h(node) Estimates to the Goal, G:
• h(B) = 2 admissible.
• h(G) = 0 admissible.
A* Algorithm Execution:
• Start at S:
f (S) = g(S) + h(S) = 0 + 3 = 3.
– For A:
g(A) = 2, so, f (A) = g(A) + h(A) = 2 + 10 = 12.
– For B:
g(B) = 1, so, f (B) = g(B) + h(B) = 1 + 2 = 3
• Node B selected from the queue for expansion because of lower f-value despite
leading to a higher cost path.
– For G via B
• Node G selected from the queue for expansion because of lower f-value.
The inadmissibility of the the heuristic causes the algorithm to prefer a sub-optimal path.
• Consistency ensures that the f-value (total estimated cost) of a node n calculated
as f (n) = g(n) + h(n) does not decrease as the algorithm progresses from the
start node to the goal. This is because for any node n and its successor n′ :
g(n′ ) = g(n) + c(n, n′ )
• Simplifying, we find:
• This prevents the algorithm from revisiting nodes unnecessarily, thereby ensuring
efficiency in path finding.
• Given the consistency condition, once the goal node g is reached and its f (g)
calculated, there can be no other path to g with a lower f-value that has not
already been considered.
• Since h(g) = 0 (by definition at the goal), f (g) = g(g), meaning that the path
cost g(g) represents the total minimal cost to reach the goal from the start node.
4. Optimality Guarantee:
• The search terminates when the goal is popped from the priority queue for expan-
sion, and due to the non-decreasing nature of f-values, this means that no other
path with a lower cost can exist that has not already been evaluated.
By adhering to these principles, A* search with a consistent heuristic not only finds a solution
but ensures it is the optimal one.
Example: Checking inconsistency
– S to A = 2
– A to G = 3
– S to B = 1
– B to G = 10
2
S B
10 2 15
A G
10
• Heuristic (h) Estimates to the Goal (G):
– h(S) = 12
– h(A) = 7
– h(B) = 10
– h(G) = 0
• Checking consistency for each node: First we find the optimal path to Goal from
each node.
For any given node, n the heuristic, h(n) for that node is lower or equal than the optimal
path-cost from n. So we can say that the given heuristics are admissible. An inadmissible
heuristic automatically leads to inconsistent heuristic. But admissible heuristic does not
guarantee consistency.
• Next, from Node A, there is a direct path to Goal node, G with cost 10 which is the
optimal path. There is also a path via B. So to be consistent,
• Next, from Node B, there is a direct path to Goal node, G with cost 15 and a path via
nod A costing 12. In this case, G and A are the child of B. So to be consistent,
N + 1 = 1 + b + b2 + b3 + ... + bd
where N is the total number of nodes generated in the search tree and d is the depth
of the shallowest solution.
The EBF gives an insight into the efficiency of the search process, influenced heav-
ily by the heuristic used:
• Lower EBF: A lower EBF suggests that the heuristic is effective, as it leads to
fewer nodes being expanded. This usually indicates a more directed and efficient
search.
• Higher EBF: A higher EBF suggests a less effective heuristic, as more nodes
are being generated, indicating a broader search, which is generally less efficient.
Calculating the EBF can help evaluate the practical performance of a heuristic. An
ideal heuristic would reduce the EBF to the minimum necessary to find the optimal
solution, indicating a highly efficient search strategy.
In the above figure, Stuart and Russel generated random 8-puzzle problems and solved
them with an uninformed breadth-first search and with A* search using both and re-
porting the average number of nodes generated and the corresponding effective branch-
ing factor for each search strategy and for each solution length. The results suggest
that h2 is better than h1 and both are better than no heuristic at all.
2. Dominating heuristic: For two heuristic functions, h1 and h2 , we say that h1 domi-
nates h2 if for every node n in the search space, the following condition holds:
h1 (n) ≥ h2 (n) and there is at least one node n where h1 (n) > h2 (n) then we say that
h1 dominates h2 or in other words, h1 is the dominating heuristic.
• Possibility 1:
C ∗ − g(n) ≤ h1 ≤ h2 ,
in which case, node n will not be expanded at all, whichever heuristic we use.
• Possibility 2:
h1 (n) ≤ h( 2) ≤ C ∗ − g(n).
in this case, node, n will be expanded for both the heuristics.
• Possibility 3:
h1 (n) ≤ C ∗ − g(n) ≤ h2 (n),
where, node n will be expanded when h1 is used but not when h2 is used.
So, we see that the using h1 may expand the same nodes that are expanded when h2 is
used. But it may end up expanding more unnecessary nodes than those expanded by
using the dominant heuristic h2 .
You may want to go back to the last example in section 3.3 and notice that the number
of nodes generated by using the dominant heuristic is significantly less for the 8-puzzle
problem.
So, given that the heuristic is consistent and the computation time is not too long,
it is generally a better idea to use higher valued heuristic.
Note: This lecture closely follows Chapter 4 to 4.1.1 (Local Search and Hill Climbing), 4.1.2
(Simulated Annealing) and 4.1.4 (Evolutionary algorithm) of Russel and Norvig, Artificial
Intelligence: A Modern Approach.
On the other hand, some problems focus solely on generating a goal state or a good state,
regardless of the specific actions taken. For instance, in a simplified knapsack problem, the
solution involves selecting a set of items that maximizes reward without exceeding the weight
limit. The process does not concern itself with the order in which items are checked or se-
lected to achieve the maximum reward. Local search strategies can be used to solve such
problems. It is especially useful in problems with very large search spaces.
Local search algorithms offer a practical solution for tackling large and complex problems.
Unlike global search methods that attempt to cover the entire state space, local search begins
with an initial setup and gradually refines it. It makes incremental adjustments to the solu-
tion, exploring nearby possibilities through methods like Hill Climbing, Simulated Annealing,
and Genetic Algorithms. These techniques adjust parts of the current solution to systemat-
ically explore the immediate area, known as the "neighborhood," for better solutions. This
approach is particularly effective in environments with numerous potential solutions (local
optima), helping to find an optimal solution more efficiently.
51
Local Search Artificial Intelligence
• Local search also complements hybrid algorithms, combining its strengths with
other strategies for enhanced performance across various problems. For instance,
Genetic Algorithms use local search within their crossover and mutation phases to fine-
tune solutions.
In summary, local search provides a flexible and efficient approach for refining solutions
incrementally, making it a critical tool in modern computational problem-solving. These
strategies bridge the gap between the theoretical optimality of informed search and the
practical needs of real-world applications.
Traveling Salesman Problem (TSP): In the TSP, the goal is to find the shortest
possible route that visits a list of cities and returns to the origin city. A local search
algorithm like Simulated Annealing or 2-opt (a simple local search used in TSP) can
iteratively improve a given route by swapping the order of visits to reduce the overall
travel distance.
Knapsack Problem: As previously mentioned, the goal here is to maximize the value
of items placed in a knapsack without exceeding its weight capacity. Local search
techniques can be used to iteratively add or remove items from the knapsack to find a
combination that offers the highest value without breaching the weight limit.
Max Cut Problem: This is a problem in which the vertices of a graph need to be
divided into two disjoint subsets to maximize the number of edges between the subsets.
Local search can adjust the placement of vertices in subsets to try and maximize the
number of edges that cross between them.
Graph Coloring: This problem requires assigning colors to the vertices of a graph
so that no two adjacent vertices share the same color, using the minimum number of
colors. Local search can explore solutions by changing the colors of certain vertices to
reduce conflicts or the number of colors used.
Job Scheduling Problems: In job scheduling, the task is to assign jobs to resources
(like machines or workstations) in a way that minimizes the total time to complete all
jobs or maximizes throughput. Job scheduling can be viewed as a CSP when the task
is to assign start times to various jobs subject to constraints such as job dependencies
(certain jobs must be completed before others can start), resource availability (jobs
requiring the same resource cannot overlap), and deadlines. Local search can be used to
iteratively shift jobs between resources or reorder jobs to find a more efficient schedule.
Vehicle Routing Problem Similar to TSP but more complex, this problem involves
multiple vehicles and aims to optimize the delivery routes from a central depot to
various customers. This problem can also be modeled as a CSP, where constraints
might include vehicle capacity limits, delivery time windows, and the requirement that
each route must start and end at a depot. Local search can adjust routes by reassigning
customers to different vehicles or changing the order of stops to minimize total distance
or cost.
Local search algorithms are ideal for these and many other problems because they can provide
high-quality solutions efficiently, even when the search space is extremely large and complex.
They are particularly valuable when exact methods are computationally infeasible, and an
approximate solution is acceptable. Local search algorithms operate by searching from a
start state to neighboring states, without keeping track of the paths, nor the set of states
that have been reached. That means they are not systematic—they might never explore
a portion of the search space where a solution actually resides. However, they have
two key advantages:
2. They can often find reasonable solutions in large or infinite state spaces for
which systematic algorithms are unsuitable.
3. Simulated Annealing
4. Genetic Algorithm
Maximization (Hill Climbing): Here, the goal is to find the highest point in the
landscape, akin to climbing to the peak of a hill. The algorithm iteratively moves to
higher elevations, seeking to locate the highest peak, which represents the optimal so-
lution.
Objective function: Minimize the total travel distance for a salesman need-
ing to visit multiple cities and return to the starting point.
Local Search Strategy: Start with a random route and iteratively improve
it by swapping the order of cities if it results in a shorter route (seeking lower
valleys in terms of distance).
Objective function: Minimize the number of pairs of queens that are attacking
each other either horizontally, vertically, or diagonally. The ideal goal is to reduce
this number to zero, indicating that no queens are threatening each other.
Local search strategy: Start with a random instance of the board and it-
eratively improving the placement of queens on the board.
4.3.1 Examples:
Simplified Knapsack Problem Setup:
Items (value, weight):
Item 1: Value = $10, Weight = 2 kg
Item 2: Value = $15, Weight = 3 kg
Item 3: Value = $7, Weight = 1 kg
Item 4: Value = $20, Weight = 4 kg
Item 5: Value = $8, Weight = 1 kg
Knapsack Capacity: 7 kg
Objective: Maximize the total value of the items in the knapsack such that the total
weight does not exceed 7 kg.
The knapsack can be represented as a string of 1s and 0s, where a ’1’ at i’th position indicates
that the corresponding (i’th) item has been included, and a ’0’ means the item has not been
taken.
item. For instance, if an item is not in the knapsack, consider adding it; if it is in the knap-
sack, consider removing it or replace an item with another item not in the solution.
Evaluate and Select: Calculate the total value and weight for each neighbor. If a neighbor
exceeds the knapsack’s capacity, discard it.
Choose the neighbor with the highest value that does not exceed the weight capacity.
Iteration: Repeat the process of generating and evaluating neighbors from the current
solution.
If no neighbors have a higher value than the current solution, terminate the algorithm.
Termination: The algorithm stops when it finds a solution where no neighboring con-
figurations offer an improvement.
Step 2: Generate Neighbors Add Item 4: 11110, Total Value = $52, Total Weight
= 10 kg
Add Item 5: 11101, Total Value = $40, Total Weight = 7 kg
Replace Item 2 with Item 4: 10110, Total Value = $37, Total weight: 7kg/
. . . . . . .. There can be other possible neighbors.
Step 3: Evaluate and Select
Discard Neighbor: 11110 as it exceeds knapsack weight limit.
Best neighbor: 11101 Total Value = 40, Total Weight = 7 kg
Step 4: Iteration (next steps)
Current configuration: 11101
Add Item 4: 11111, Total Value = $60, Total Weight = 11 kg Replace items with Item 4:
All will exceed the weight
Evaluation:
All neighbors are discarded. No better valued neighbor.
Terminates as no better valued neighbor is found.
4.3.3 Drawbacks:
Hill climbing algorithms may encounter several challenges that can cause them to get stuck
before reaching the global maximum. These challenges include:
1. Local Maxima: A local maximum is a peak that is higher than each of its neighboring
states but is lower than the global maximum. Hill climbing algorithms that reach a
local maximum are drawn upward toward the peak but then have nowhere else to go
because all nearby moves lead to lower values.
Example:
8-queen problem:
Scenario: For an instance of the 8-puzzle, 13528647, there are 5 attacking pairs or in
other words, 23 non-attacking pairs. This configuration is a local maximum because it’s
better than all immediate neighboring configurations, but it is not the global solution
since it’s not conflict-free.
Problem: The Hill Climbing algorithm would stop here because all single-move al-
ternatives lead to worse configurations, increasing the number of threats. Despite the
presence of better configurations (global maxima with zero threats), the algorithm gets
stuck.
2. Plateaus: A plateau is an area of the state-space landscape where the elevation (or the
value of the objective function) remains constant. Hill climbing can become stuck on
a plateau because there is no upward direction to follow. Plateaus can be particularly
challenging when they are flat local maxima, with no higher neighboring states to es-
cape to, or when they are shoulders, which are flat but eventually lead to higher areas.
On a plateau, the algorithm may wander aimlessly without making any progress. Sce-
nario: Imagine a large part of the chessboard setup where several queens are placed in
such a manner that moving any one of them doesn’t change the number of conflicts—it
remains constant. This flat area in the search landscape is a plateau.
For Example, the instance, 13572864 has 3 attacking pairs. Swapping the last two
queens might not immediately lead to an increase in non-attacking pairs, resulting in
a plateau where many configurations have an equal number of non-attacking pairs.
Problem: The Hill Climbing algorithm would find it difficult to detect any better
move since all look equally non-promising.
On a plateau, every move neither improves nor worsens the situation, causing Hill
Climbing to wander aimlessly without clear direction toward improvement. This lack
of gradient (change in the number of conflicts) can trap the algorithm in non-productive
cycles, preventing it from reaching configurations that might lead to the global maxi-
mum.
3. Ridges: Ridges are sequences of local maxima that make it very difficult for greedy
algorithms like hill climbing to navigate effectively. Because the algorithm typically
makes decisions based on immediate local gains, it struggles to cross over ridges that
require a temporary decrease in value to ultimately reach a higher peak.
These challenges highlight the limitations of hill climbing algorithms in exploring com-
plex landscapes with multiple peaks and flat areas, making them susceptible to getting
stuck without reaching the best possible solution.
Scenario: Consider a situation where moving from one configuration to another better
configuration involves moving through a worse one. For example, the instance 13131313
creates a ridge because small moves from this configuration typically result in fewer non-
attacking pairs, requiring several coordinated moves to escape this pattern, which hill
climbing does not facilitate well.
Problem: Hill Climbing may fail to navigate such transitions because it does not allow
for a temporary increase in the objective function (number of conflicts in this case).
Ridges in the landscape can make it very challenging for the algorithm to find a path
to the optimal solution since each step must immediately provide an improvement.
How it works: This method randomly chooses among uphill moves rather than always
selecting the steepest ascent. The probability of choosing a move can depend on the
steepness of the ascent.
Performance: It typically converges more slowly than the standard hill climbing
because it might not take the steepest path. However, it can sometimes find better so-
lutions in complex landscapes where the steepest ascent might lead straight to a local
maximum.
How it works: A variant of stochastic hill climbing, this strategy generates suc-
cessors randomly and moves to the first one that is better than the current state.
Advantages: This is particularly effective when a state has a vast number of succes-
sors, as it reduces the computational overhead of generating and evaluating all possible
moves.
How it works: This approach involves performing multiple hill climbing searches
from different randomly generated initial states. This process repeats until a satisfac-
tory solution is found.
The technique is inspired by the physical process of annealing in metallurgy, where met-
als are heated to a high temperature and then cooled according to a controlled schedule to
achieve a more stable crystal structure.
Simulated Annealing is an optimization technique that simulates the heating and gradual
cooling of materials to minimize defects and achieve a stable state with minimum energy.
It starts with a randomly generated initial solution and a high initial temperature, which
allows the acceptance of suboptimal solutions to explore the solution space widely.
The algorithm iterates by generating small modifications to the current solution, evaluat-
ing the cost changes, and probabilistically deciding whether to accept the new solution
based on the current temperature.
The process concludes once a stopping criterion, like a specified number of iterations, a
minimal temperature, or a quality threshold of the solution, is met.
Key Parameters:
• Stopping Temperature: Low enough to stop the process once the system is presumed
to have stabilized.
• ∆E = E(S ′ ) − E(S) is the change in the objective function (cost or energy) from the
current state S to the new state S ′ .
• If ∆E > 0 (meaning S’ is worse), the new solution is accepted with a probability less
than 1. This probability decreases as ∆E increases or as T decreases.
T (k) = T0 · αk
where:
- T0 is the initial temperature.
- α is a constant such that 0 < α < 1, often close to 1.
- k is the iteration index.
The choice of α and T0 influences the convergence of the algorithm. A slower cooling
(higher α) allows more thorough exploration of the solution space but takes longer to con-
verge.
2. Heating: Set an initial high temperature to allow significant exploration. For instance,
start with a temperature of 100.
4. Cooling: Reduce the temperature based on a cooling schedule, e.g., multiply the
temperature by 0.95 after each iteration.
5. Termination: Repeat the iteration process until the temperature is low enough or
a fixed number of iterations is reached. Assume the stopping condition is when the
temperature drops below 1 or after 1000 iterations.
Example Calculation
Assuming the first iteration starts with the initial route and the randomly generated neighbor
as described:
This process is repeated, with the temperature decreasing each time, until the termina-
tion conditions are met. The result should be a route that approaches the shortest possible
loop connecting all five cities.
Genetic algorithms mimic the process of natural evolution, embodying the survival of the
fittest among possible solutions. The core idea is derived from the biological mechanisms of
reproduction, mutation, recombination, and selection. These biological concepts are trans-
lated into computational steps that help in finding optimal or near-optimal solutions to
problems across a wide range of disciplines including engineering, economics, and artificial
intelligence.
4.7.1 Algorithm
Selection:
This step involves choosing the fitter individuals to reproduce. Selection can be done in
various ways, such as Truncation Selection, tournament selection, roulette wheel selection,
or rank selection. In this course we are using truncation selection, where we select the fittest
3/4th of the population.
Truncation Selection
Description: Only the top-performing fraction of the population is selected to reproduce.
Procedure: Rank individuals by fitness, then select the top x% to become parents of the
next generation.
Pros and Cons: Very straightforward and ensures high-quality genetic material is passed on,
but can quickly reduce genetic diversity.
Crossover (Recombination):
Pairs of individuals are crossed over at a randomly chosen point to produce offspring. The
crossover rate determines how often crossover will occur. Two common types of crossover
techniques are single-point crossover and two-point (or double-point) crossover.
• Single-Point Crossover:
In single-point crossover, a single crossover point is randomly selected on the parent
chromosomes. The genetic material (bits, characters, numbers, depending on the en-
coding of the solution) beyond that point in the chromosome is swapped between the
two parents. This results in two new offspring, each carrying some genetic material
from both parents.
Procedure:
Select a random point on the chromosome. The segments of the chromosomes after
this point are swapped between the two parents.
Example:
Suppose we have two binary strings:
Parent 1: 110011
Parent 2: 101010
Assuming the crossover point is after the third bit, the offspring would be:
Offspring 1: 110010 (first three bits from Parent 1, last three bits from Parent 2)
Offspring 2: 101011 (first three bits from Parent 2, last three bits from Parent 1)
• Two-Point Crossover:
Two-point crossover involves two points on the parent chromosomes, and the genetic
material located between these two points is swapped between the parents. This can
introduce more diversity compared to single-point crossover because it allows the central
segment of the chromosome to be exchanged, potentially combining more varied genetic
information from both parents.
Procedure:
Select two random points on the chromosome, ensuring that the first point is less than
the second point.
Swap the segments between these two points from one parent to the other.
Example:
Continuing with the same parent strings:
Parent 1: 110011
Parent 2: 101010
Let’s choose two crossover points, between the second and fifth bits. The offspring
produced would be:
Offspring 1: 100010 (first two bits from Parent 1, middle segment from Parent 2, last
bit from Parent 1)
Offspring 2: 111011 (first two bits from Parent 2, middle segment from Parent 1, last
bit from Parent 2)
4.8 Mutation
With a certain probability (mutation rate), mutations are introduced to the offspring to
maintain genetic diversity within the population.
The purpose of mutation is to maintain and introduce genetic diversity into the population
of candidate solutions, helping to prevent the algorithm from becoming too homogeneous
and getting stuck in local optima.
3. Explore New Areas: It enables the algorithm to explore new areas of the solution
space that may not be reachable through crossover alone.
as a string of bits, characters, numbers, or other data structures, depending on the problem
being solved.
• Procedure: Each bit in a binary string has a small probability of being flipped
(0 changes to 1, and vice versa).
• Example: A binary string ‘110010‘ might mutate to ‘110011‘ if the last bit is
flipped.
• Example: In a string of integers [4, 12, 7, 1], the third element 7 might mutate
to 9.
3. Swap Mutation:
• Procedure: Two genes are selected and their positions are swapped. This is
often used in permutation-based encodings.
• Example: In an array [3, 7, 5, 8], swapping the second and fourth elements results
in [3, 8, 5, 7].
4. Scramble Mutation:
• Procedure: A subset of genes is chosen and their values are scrambled or shuffled
randomly.
• Example: In an array [3, 7, 5, 8, 2], scrambling the middle three elements might
result in [3, 5, 8, 7, 2].
• Procedure: Each gene has a fixed probability of being replaced with a uniformly
chosen value within a predefined range.
• Example: In an array of real numbers [0.5, 1.3, 0.9], the second element 1.3 might
mutate to 1.1.
Generation Update: The algorithm decides which individuals to keep for the next gener-
ation. This can be a mix of old individuals (elitism) and new offspring.
Return Best Solution: The best solution found during the evolution is returned.
Note: For exam problems if you are asked to simulate, unless otherwise instructed, start
with 4 chromosomes in the population, select best 3 at each step, crossover between the best
and the other two selected
It is important for the initial population to be diversed. Otherwise, similar type of chro-
mosomes will crossover to and produce offsprings with little change in their fitness. This
will lead to quick convergence of the algorithm and the chances of finding a solution with
maximum fitness will be very low.
4.9 Examples
4.9.1 8-Queen Problem
The 8-queens problem involves placing eight queens on an 8x8 chessboard so that no two
queens threaten each other. This means no two queens can share the same row, column, or
diagonal.
Chromosome Representation:
Each chromosome can be represented as a string or array of 8 integers, each between 1 and
8, representing the row position of the queen in each column represented by the index of the
string or array.
Fitness Function:
Fitness is calculated based on the number of non-attacking pairs of queens. The maximum
score for 8 queens is 28 (i.e., no two queens attack each other).
Calculating the number of attacking pairs: The 8-queen problem has the following
conditions.
Column-wise conflict Now, due to the representation of the configuration where we have
ensured that no pairs can share the same column as the indices of the array are unique.
Row-wise conflict Now, the values in each index represent the row where the queen is
placed. Now, if we find the same value in multiple indices that means there are queens shar-
ing that row.
Take for example, the configuration: [8, 7, 7, 3, 7, 1, 4, 4]. Here, the value 7 is repeated
thrice and the value 4 is repeated twice. This means, there are 3 queens on the 7th row and
2 queens in the 4th row.
Counting the conflicts in 7th row: 3 queens will form 3 C2 = 3(3 − 1)/2 = 3 pairs.
Counting the conflicts in 4th row: 2 queens will form 2 C2 = 2(2 − 1)/2 = 1 pairs.
Diagonal Conflicts Two types of diagonals are formed in a square board we call them
Major ("/" shaped) diagonal, and Minor ("\" shaped) diagonal. Major diagonal conflict
If two queens, Q1 and Q2 share the same major diagonal conflict abs(Q1[column]−Q1[row]) =
abs(Q2[column] − Q2[row]).
Going back to the configuration: [8, 7, 7, 3, 7, 1, 4, 4], the queen in 7th row, 2nd col-
umn is in the same diagonal as the queen in 1st row, 6th column and the queen in 7th row,
3rd column is in conflict with the queen in 4th row, 8th column. Or in other words 2 queens
share the 5th (|7-2| = |1-6|) Major diagonal making (2 C2 = 2(2 − 1)/2 = 1) conflicting pair
and 2 queens share the 4th (|7-3| = |4-8|) Major diagonal making another (1) conflicting pair.
So, there are two attacking pairs along the major diagonal.
Minor diagonal conflict If two queens, Q1 and Q2 share the same major diagonal conflict
Q1[column] + Q1[row] = Q2[column] + Q2[row].
Going back to the configuration: [8, 7, 7, 3, 7, 1, 4, 4], the queen in 8th row, 1st col-
umn is in the same diagonal as the queen in 7th row, 2nd column, the queen in 3rd row, 4th
column is in conflict with the queen in 1st row, 6th column and the queen in 7th row, 5th
column is in conflict with the queen in 4th row, 8th column. Or in other words 2 queens
share the 9th (|8+1|=|7+2|) minor diagonal making 1 (2 C2 = 2(2 − 1)/2) conflicting pair,
2 queens share the 7th (|3+4| = |1+6|) minor diagonal making another (2 C2 = 2(2 − 1)/2)
conflicting pair and 2 queens share the 12th (|7+5| = |4+8|) minor diagonal making another
(2 C2 = 2(2 − 1)/2) conflicting pair.
So, there are three attacking pairs along the minor diagonal.
For the example of [8, 7, 7, 3, 7, 1, 4, 4] configuration of the 8-queen problem, the max
possible non-attacking pairs is 28 and the total no. of attack we calculated is 9. So the
number of non-attacking pairs in this configuration is 28 − 9 = 19.
Iteration 1
Initialization of Population:
We randomly generate a small population of 4 individuals for simplicity. To keep track of the
individual with the best fitness so far we initially set best-so-far = null and best-fitness-so-far
= 0 and update when we find better individuals.
best-so-far = [];
best-fitness-so-far = 0;
max-fitness = 28;
Step 1: Population
Initial Population
Chromosome 1: 42736851
Chromosome 2: 27418536
Chromosome 3: 53172864
Chromosome 4: 71428536
Chromosomes Fitness
Chromosome 1: 42736851 26
Chromosome 2: 27418536 24
Chromosome 3: 53172864 23
Chromosome 4: 71428536 24
As we have found a chromosome with better fitness value we have saved before, we update,
best-so-far = [4 2 7 3 6 8 5 1];
best-fitness-so-far = 26;
However, the fitness is not the highest possible value of 28, so we continue to the next step.
Step 3: Selection
Select the top 3/4th of chromosomes based on their fitness. This results in selecting Chro-
mosome 1, 2 and 3.
Chromosomes Fitness Selection
Selected
Chromosomes Fitness
Chromosome 1: 42736851 26
Chromosome 2: 27418536 24
Chromosome 3: 71428536 24
Step 5: Mutation
For each offspring, We randomly choose an index and change the value of the row to a number
chosen randomly between 1 to 8. As the number is chosen randomly it may happen that the
value remains same as we see for offspring 3 and 4.
2 7 4 1 | 8 5 3 6 2 7 4 1 6 8 5 1 2 7 4 1 6 8 5 3
4 2 7 | 3 6 8 5 1 4 2 7 2 8 5 3 6 4 2 7 2 8 5 3 6
7 1 4 | 2 8 5 3 6 7 1 4 3 6 8 5 1 7 1 4 3 6 8 5 1
New Population
Chromosome 1: 42738516
Chromosome 2: 27416853
Chromosome 3: 42718536
Chromosome 4: 27436851
Fitness Function:
The fitness of each chromosome is calculated as the inverse of the total route distance. The
shorter the route, the higher the fitness.
1
Fitness =
Route Distance
Example Calculation: For the chromosome [A, B, C, D, E]:
Distance (A − −B) : 2
Distance (B − −C) : 6
Distance (C − −D) : 12
Distance (D − −E) : 15
Distance (E − −A to complete the loop): 7
Total Distance: 2 + 6 + 12 + 15 + 7 = 42
1 1
Fitness = =
Total Distance 42
Iteration 1
Initialization of Population:
We start with a population of four randomly generated routes as combination of the cities.
We keep track of the best found route and its fitness.
best-so-far = [];
best-fitness-so-far = 0;
Initial Population
Chromosome 1: ABCDE
Chromosome 2: BEDCA
Chromosome 3: EDBAC
Chromosome 4: BDCEA
Step 3: Selection
Select top 3/4th based on fitness: Chromosome 4, Chromosome 1 and Chromosome 2. Chro-
mosome 4 has a higher fitness value than the best-fitness-so-far we have. So, we make
an update.
best-so-far = [B D C E A];
best-fitness-so-far = 1/39;
Selected
Chromosomes Fitness
1
Chromosome 1: ABCDE
42
1
Chromosome 2: BEDCA
45
1
Chromosome 3: BDCEA
39
A B C | D E A B C E A A B C E D
Swap Mutation
B D C | E A B E D C A B E A C D
B E D | C A B D C E A B D E C A
Step 5: Mutation
Offspring 1 is mutated by randomly changing the second city from D to A. Offspring 2 is
mutated by changing the fifth city from A to D.
On the other hand, Offspring 3 and Offspring 4 are mutated by swapping cities. In Offspring
3, the third and fifth cities are swapped. In Offspring 4, the third and fourth cities are
swapped.
Note: It is better to use one specific type of mutation in your simulation. Always mention
what type of mutation you are using.
Chromosome 1: BACDE
Chromosome 2: ABCED
Chromosome 3: BEACD
Chromosome 4: BDECA
Step 7: Repeat
This process is repeated from step 2 over several generations to find the chromosome with
the highest fitness, representing the shortest possible route that visits each city exactly once
and returns to the starting point.
Constraints:
Maximum weight the knapsack can hold: 15 kg Items:
Step 1: Initialization
Generate four random chromosomes (combination of the items) making the initial population.
We keep track of the best found route and its fitness.
best-so-far = [];
best-fitness-so-far = 0;
Initial Population
Chromosome 1: 1101
Chromosome 2: 1010
Chromosome 3: 0110
Chromosome 4: 1001
Chromosome 1: 1101 53
Chromosome 2: 1010 46
Chromosome 3: 0110 30
Chromosome 4: 1001 39
capacity.
Step 3: Selection
Select the top 3/4th of chromosomes based on their value.
best-so-far = [1101];
best-fitness-so-far = 53;
Chromosomes Fitness
Chromosome 1: 1101 53
Chromosome 2: 1010 46
Chromosome 4: 1001 39
Step 4: Crossover
Perform crossover between the chromosome with the highest fitness value with the two other
chromosomes from the selected chromosomes at the third bit.
1 0 | 1 0 1 0 1 1 1 0 1 0
1 1 | 0 1 1 0 0 1 1 0 0 0
1 0 | 0 1 1 1 0 1 1 1 0 0
Step 5: Mutation
Apply a mutation by flipping a random bit in each offspring.
New Population
Chromosome 1: 1110
Chromosome 2: 1010
Chromosome 3: 1000
Chromosome 4: 1100
Step 7: Repeat
This process is repeated over several generations to find the chromosome with the highest
fitness, representing the maximum value found.
Graph Description
Consider a simple graph with 5 vertices (A, B, C, D, E) and the following edges: AB, AC,
BD, CD, DE.
Initial Population
We randomly generate a small population of 4 solutions. We keep track of the best found
route and its fitness.
best-so-far = [];
best-fitness-so-far = 0;
Initial Population
Chromosome 1: 12312
Chromosome 2: 23123
Chromosome 3: 12132
Chromosome 4: 31233
Fitness Function
Fitness is determined by the number of properly colored edges (i.e., edges connecting vertices
of different colors). The maximum fitness for this graph is 5 (one for each edge).
Problem Setup
Consider a graph with 5 vertices (A, B, C, D, E) and the following edges with given weights:
AB = 3, AC = 2, BC = 1, BD = 4, CD = 2, DE = 3
The goal is to find a division of these vertices into two sets that maximizes the sum of the
weights of the edges that have endpoints in each set.
Fitness Function
Fitness is determined by the sum of the weights of the edges between the two sets. For a
chromosome, calculate the sum of weights for edges where one endpoint is ’0’ and the other
is ’1’.
Initial Population
Generate an initial population of chromosomes randomly, where each chromosome has a
different combination of features selected (1s) and not selected (0s).
Fitness Function:
The fitness of each chromosome (feature subset) is determined by the performance of a
predictive model trained using only the selected features. Common performance metrics
include accuracy, area under the ROC curve, or F1-score, depending on the problem specifics.
Optionally, the fitness function can also penalize the number of features to maintain model
simplicity.
Step 2: Evaluation
• For each chromosome, train a model using only the features selected by that chromo-
some. Evaluate the model’s performance on a validation set.
Step 3: Selection
• Select chromosomes for reproduction. Techniques like tournament selection or roulette
wheel selection can be used, where chromosomes with higher fitness have a higher
probability of being selected.
Step 4: Crossover
• Perform crossover between pairs of selected chromosomes to create offspring. A common
method is one-point or two-point crossover, where segments of parent chromosomes are
swapped to produce new feature subsets.
Step 5: Mutation
• Apply mutation to the offspring with a small probability. This could involve flipping
some bits from 0 to 1 or vice versa, thus adding or removing features from the subset.
Step 6: Replacement
• Form a new generation by replacing some of the less fit chromosomes in the population
with the new offspring. This could be a generational replacement or a steady-state
replacement (where only the worst are replaced).
Iteration
• Repeat the process for a number of generations or until a stopping criterion is met
(such as no improvement in fitness for a certain number of generations).
Initial Population:
Generate an initial population of chromosomes, each encoding a different combination of
these hyperparameters.
Fitness Function:
The fitness of each chromosome is evaluated based on the validation accuracy of the neural
network configured with the hyperparameters encoded by the chromosome. Optionally, the
fitness function can also include terms to penalize overfitting or excessively complex models.
Step 2: Evaluation
• For each chromosome, construct a neural network with the specified hyperparameters,
train it on the training data, and then evaluate it on a validation set. The validation
set accuracy serves as the fitness score.
Step 3: Selection
• Select chromosomes for reproduction based on their fitness. High-fitness chromosomes
have a higher probability of being selected. Techniques like tournament selection or
rank-based selection are commonly used.
Step 4: Crossover
• Perform crossover operations between selected pairs of chromosomes to create offspring.
Crossover can be one-point, two-point, or uniform (where each gene has an independent
probability of coming from either parent).
Step 5: Mutation
• Apply mutation to the offspring chromosomes at a low probability. Mutation might
involve changing one of the hyperparameters to another value within its range (e.g.,
changing the learning rate from 0.01 to 0.001).
Step 6: Replacement
• Replace the least fit chromosomes in the population with the new offspring, or use
other replacement strategies like elitism where some of the best individuals from the
old population are carried over to the new population.
Iteration
• Repeat the evaluation, selection, crossover, and mutation steps for several generations
until the performance converges or a maximum number of generations is reached.
• Number of layers: 3
By the end of these iterations, the genetic algorithm might produce a strategy that effectively
balances aggressive and defensive play, adapts to different opponent moves, and optimizes
piece positioning throughout the game.
Assets: Stocks (technology, healthcare, finance), government bonds, corporate bonds, gold,
oil.
Objective: Maximize the Sharpe ratio, considering historical returns and volatility data
for each asset class.
Problem Setup:
Portfolio Optimization for Investment Management.
Suppose you are an investment manager looking to create a diversified investment portfolio.
You want to determine the optimal allocation of funds across a set of available assets (e.g.,
stocks, bonds, commodities) to maximize returns while controlling risk, subject to various
constraints like budget limits or maximum exposure to certain asset types.
Initial Population:
Generate an initial population of chromosomes, each encoding a different allocation strategy,
ensuring that each portfolio adheres to the budget constraint (i.e., the total allocation sums
to 100%).
Fitness Function:
The fitness of each chromosome (portfolio) is typically evaluated based on its expected re-
turn and risk (often quantified as variance or standard deviation). A common approach to
measure fitness is to use the Sharpe ratio, which is the ratio of the excess expected return of
the portfolio over the risk-free rate, divided by the standard deviation of the portfolio returns.
After running the genetic algorithm for several generations, the GA might find an optimal
portfolio that, for example, allocates 20% to technology stocks, 15% to healthcare stocks,
10% to finance stocks, 20% to government bonds, 15% to corporate bonds, 10% to gold, and
10% to oil. This portfolio would have the highest Sharpe ratio found within the constraints
set by the algorithm.
Note: This lecture closely follows (sometimes, directly borrowed from) Chapter 5 to 5.2.4
of Russel and Norvig, Artificial Intelligence: A Modern Approach.
5.1 Introduction
Now we would like to focus on competitive environments, in which two or more agents have
conflicting goals. The agents in these environments work against each other, minimizing the
evaluation value of each other. To tackle such environments, we use Adversarial search. We
will study Minimax search, using which we can find the optimal move for an agent in the
restricted game environment. We will also study alpha-beta pruning where including prun-
ing we can make the search more efficient by ignoring portions that do not help us find the
optimal solution.
Instead of handling real life problems which have a lot of uncertainty and are difficult to
handle we focus on games. Examples of such games are chess, Go, poker etc. We will use
much easier games than these to demonstrate the adversarial nature.
• Two players: Only two agents working against each other. Example: Player A and
Player B.
• Turn taking: Agents take turns alternatively. Example: After Agent (Player) A plays
a move (chooses their action), Agent (Player) B will get to play their move (choose their
action) and so on.
91
Lecture 6: Adversarial Search/Games Artificial Intelligence
• Zero-Sum Games: The total sum of points of the players in the game is zero. If one
player wins (+1 points) that would mean the opponent player loses ( -1 points). The
total sum of the points in the game will be zero. When there is a draw, the point for
both players is zero, meaning the sum will be zero.
We will use the term move as a synonym for action, and position as a synonym for state.
Max aims to maximize its own game points. In a zero-sum game, the total payoff for all
players is constant, meaning one player’s gain is exactly equal to another’s loss. Therefore,
Max seeks to achieve the highest possible outcome for itself, knowing that this will occur at
the expense of the other player.
Max evaluates all possible moves by considering what its opponent might choose, coun-
tering their moves. It chooses the option that leads to the highest evaluated payoff under
optimal play.
On the other hand, Min aims to win by minimizing Max’s points. While choosing its
move, Min selects the move that will reduce the good options for Max, limiting Max’s
game points and forcing it towards an outcome where Max’s game points are minimal. This
strategy minimizes Min’s loss as well.
For example, in a simple game like tic-tac-toe, Max aims to align three of its marks in a row,
thereby maximizing its chance of winning and ensuring the other player’s loss. Conversely,
Min attempts to block Max’s efforts to align three marks, while also seeking opportunities
to create a threat that Max must respond to, thereby steering the game towards a draw or
a win for Min.
max and min alternatively choose moves, each striving to reach their respective goals. The
minimax algorithm provides a way to determine the best possible move for a player (assum-
ing both players play optimally) by minimizing the possible loss for a worst-case (maximum
loss) scenario.
These concepts are crucial in designing agents capable of thinking several steps ahead, an-
ticipating and countering the opponent’s strategies effectively.
• RESULT(s, a): The transition model, specifying the state that results from executing
action a in state s.
• IS-TERMINAL: A test to determine if state s is terminal, returning true if the game has
concluded and false otherwise. States where the game has ended are called terminal
states. Example, Win, Lose, Draw.
The initial state, the action function, and the result function together define the state space
graph. In this graph, the vertices represent states and the edges represent moves. A state
can be reached by multiple paths. From the search tree, we can construct the game tree that
traces every sequence of moves all the way to the terminal state.
The figure below illustrates part of the game tree for tic-tac-toe (noughts and crosses). Start-
ing from the initial state, max has nine possible moves. Players alternate turns, with max
placing an x and min placing an o, until reaching terminal states where either one player
has three in a row or the board is completely filled. The number on each leaf node represents
the utility value of the terminal state from max’s perspective; higher values benefit max and
disadvantage min (hence the players’ names).
For tic-tac-toe, the game tree is relatively small—fewer than 9! = 362, 880 terminal nodes
(with only 5, 478 distinct states). But for chess, there are over 1040 nodes, so the game tree
is best thought of as a theoretical construct that we cannot realize in the physical world.
Consider the trivial game in Figure. The possible moves for max at the root node are labeled
a1 , a2 , a3 , and so on. The possible replies to for min are b1 , b2 , b3 , and so on. This particular
game ends after one move each by max and min.
NOTE: In some games, the word “move” means that both players have taken an action;
therefore the word ply is used to unambiguously mean one move by one player, bringing us
one level deeper in the game tree. The utilities of the terminal states in this game range from
2 to 14.
Given a game tree, the optimal strategy can be determined by working out the minimax
value of each state in the tree, which we write as minimax(s). The minimax value is the
utility (for max) of being in that state, assuming that both players play optimally from there
to the end of the game. The minimax value of a terminal state is just its utility. In a non-
terminal state, max prefers to move to a state of maximum value when it is max’s turn to
move, and min prefers a state of minimum value (that is, minimum value for max and thus
maximum value for min). So we have:
Utility(s, max)
if Is-Terminal(s)
Minimax(s) = maxa∈Actions Minimax(Result(s, a)) if To-Move(s) = max
mina∈Actions Minimax(Result(s, a)) if To-Move(s) = min
The exponential complexity makes minimax impractical for complex games. For instance,
chess has a branching factor of about 35, and the average game has a depth of about 80 plies.
This means we would need to search almost 10123 states, which is not feasible.
Algorithm 6 Minimax-Search
function Minimax-Search(game, state) returns an action
player ← game.To-Move(state)
value, move ← Max-Value(game, state)
return move
In the minimax algorithm, every possible move and counter-move is explored to determine
the optimal strategy. However, this exhaustive search can become computationally expen-
sive, especially for complex games with large search spaces. Alpha-beta pruning addresses
this issue by eliminating branches in the game tree that do not influence the final decision.
Let’s revisit the two-ply game tree from the previous figure. This time, we will carefully
track what we know at each step. The steps are detailed in the figure below. The result
is that we can determine the minimax decision without needing to evaluate two of the leaf
nodes.
Stages in calculating the optimal decision for the game tree in the previous figure are shown
below. At each point, we display the range of possible values for each node.
(a) The first leaf under B has a value of 3. Thus, B, a min node, has a value of at most 3.
(b) The second leaf under B has a value of 12. Since min would avoid this move, B’s value
remains at most 3.
(c) The third leaf under B has a value of 8. After evaluating all of B’s successor states,
B’s value is exactly 3. This means the root value is at least 3, because max can choose
a move worth 3 at the root.
(d) The first leaf under C has a value of 2. Therefore, C, a min node, has a value of at
most 2. Since B is worth 3, max would never choose C. Hence, there is no need to
examine the other successors of C. This demonstrates alpha-beta pruning.
(e) The first leaf under D has a value of 14, making D worth at most 14. This is still higher
than max’s best alternative (i.e., 3), so we need to continue exploring D’s successors.
Now, we also know the root’s value is at most 14.
(f) The second successor of D has a value of 5, so we continue exploring. The third
successor is worth 2, so D’s exact value is 2. Max’s decision at the root is to move to
B, yielding a value of 3.
Another way is to see this as a simplification of the Minimax formula. Consider the two
unevaluated successors of node C in the figure above. Let these successors have values x and
y. Then the value of the root node is given by
In other words, the value of the root and hence the minimax decision are independent of the
values of the leaves x and y and therefore they can be pruned.
Algorithm 7 Alpha-beta Search Algorithm
function Alpha-Beta-Search(game,state) returns an action
player ←To-Move(state)
value, move ← Max-Value(game, state, −∞, +∞)
return move
Alpha–beta pruning can be used on trees of any depth, and it often allows for pruning
entire subtrees, not just leaves. The basic principle is this: consider a node n in the tree (see
Figure below) where Player can choose to move to n. If Player has a better option either at
the same level (e.g., m′ in the figure below) or higher up in the tree (e.g., m in the figure
below), Player will not move to n. Once we learn enough about n by examining some of its
successor state to make this decision, we can prune n.
Minimax search uses a depth-first approach, so we only consider the nodes along a single
path in the tree at any given time. Alpha-beta pruning uses two additional parameters in
Max-Value(state, α, β), which set bounds on the backed-up values along the path.
α = the value of the best (i.e., highest-value) choice we have found so far at any choice
point along the path for max. Think: α = “ at least.”
β = the value of the best (i.e., lowest-value) choice we have found so far at any choice
point along the path for min. Think: β = “at most".
As the search progresses, these values are updated to reflect the highest and lowest scores
that MAX and MIN can achieve, respectively. When a move is found that makes a branch
less favorable than previously examined branches, that branch is pruned, meaning it is not
explored further.
This pruning does not affect the final result of the minimax algorithm but significantly
reduces the number of nodes evaluated, making it more efficient. In the best case, alpha-beta
m
pruning can reduce the time complexity from O(bm ) to O(b 2 where b is the branching factor
and m is the maximum depth of the tree.
Alpha-beta pruning enables more effective decision-making in complex games, allowing for
deeper searches and better strategic planning within practical time limits.
Key Points
• Branch Ordering: The worst case occurs when the best moves are always consid-
ered last. Alpha-beta pruning relies on evaluating the most promising moves first to
maximize pruning. If the least promising moves are evaluated first, fewer branches are
pruned.
• No Pruning: When the algorithm does not prune any branches, it evaluates every
possible move at each depth level, leading to the maximum number of nodes being
explored.
• Time Complexity: In the worst case, the time complexity of alpha-beta pruning is
the same as that of the minimax algorithm without pruning, which is O(bm ) Here, b is
the branching factor (the number of legal moves at each point), and m is the maximum
depth of the tree.
• Killer Moves: Killer moves are specific moves that have previously caused significant
pruning in similar positions and are tried early in the search to maximize pruning
opportunities. These moves are stored and reused, assuming that moves which were
effective in the past are likely to be effective again.
less memory) and breadth-first search (finding the optimal solution). It provides a way
to use the best move from shallow searches to improve move-ordering in deeper searches.
Modern AI
103
CHAPTER 1
MACHINE LEARNING BASICS
NOTE: The following chapters closely follows the textbook by Peter/Norvig and
the previous slides of CSE422.
1.1 Introduction
Machine learning is a subfield of artificial intelligence (AI) that focuses on developing algo-
rithms and models that enable computers to learn patterns from data and make decisions or
predictions without being explicitly programmed for specific tasks.
In simple terms, machine learning allows systems to improve their performance over time
by using past experiences (i.e., data) to generalize and adapt to new situations.
– New email: The system has never seen this exact message, but it recognizes
similar keywords and sender behavior from past spam.
105
Machine Learning Basics Artificial Intelligence
– Action Machine learning can help predict likely drop or recovery patterns, iden-
tify safer investment sectors and can recommend adjustment of portfolio based
on learned behavior from similar past events.
• Unknown Solutions Some tasks are easy for humans but very hard to explicitly pro-
gram using traditional rules. These tasks often involve complex patterns, uncertainty,
or high variability—things that machine learning handles well by learning from data
rather than following fixed instructions.
Example: Face Recognition
– Hard to program: Every person writes differently, even the same letter can
look very different.
Rote learning is simple memorization, like a model storing exact inputs and outputs—for
example, a chatbot recalling predefined replies.
Induction involves generalizing from examples; a spam filter learns patterns in labeled emails
to classify new ones.
Clustering is an unsupervised method that groups similar data, such as organizing cus-
tomers into segments based on purchase behavior.
Analogy and discovery use representation mapping—like a system recognizing that the
relationship between Earth and Moon is similar to Jupiter and its moons.
Genetic algorithms apply evolutionary search, creating and evolving solutions over gener-
ations, useful in optimizing investment strategies, parameter tuning etc.
Finally, reinforcement learning is reward-based; for instance, a game-playing agent learns
to win by receiving points and improving its moves over time.
• Unsupervised learning: Works with data that has no labels, helping to find hidden
patterns—like clustering customers based on buying behavior without knowing their
categories.
In supervised learning, the model learns from input-output pairs, also known as feature-
label pairs.
• Input (Features): These are the measurable variables that form the input vector,
often denoted by X.
Each training example is represented as a pair:(X, Y ) where X is the input vector con-
taining features x1 , x2 , .... and Y is the corresponding output label.
1. Collect Data: Gather labeled data where each input has a corresponding correct
output.
2. Preprocess Data: Clean the data, handle missing values, normalize features, and
encode categorical variables if necessary.
3. Split the Data: Divide the dataset into training and test sets (optionally a validation
set).
4. Train the Model: Use the training set to teach the model to learn the mapping from
input to output.
5. Evaluate the Model: Use the test (or validation) set to measure performance using
appropriate metrics.
7. Deploy the Model: Use the trained model to make predictions on new, unseen data.
• Regression:
• Classification:
2. Choosing Candidate Models: Based on task complexity, data size, and inter-
pretability. Common models include:
• Linear/Logistic Regression
• Decision Trees
• Neural Networks
5. Select the Best Model: Choose the model that balances performance and complexity.
h:X→Y
The goal is to find the best hypothesis h from a set called the hypothesis space H, which
includes all models that the learning algorithm can select from.
Example: In linear regression, a possible hypothesis could be:
h(x) = w1 x1 + w2 x2 + . . . + wn xn + b
1. Define a loss function – Measures the error between predicted and true outputs.
• Example: Mean Squared Error (MSE) for regression, Cross-Entropy for classifi-
cation.
2. Train the model – Minimize the loss function on training data using optimization
techniques (e.g., gradient descent).
3. Validate – Use a separate validation set to test the generalization ability of the hy-
pothesis.
4. Select – Choose the hypothesis with the best performance on the validation set.
5. Test – Evaluate the final hypothesis on unseen test data to estimate real-world perfor-
mance.
Supervised learning aims to find the best hypothesis that accurately maps inputs to outputs
by learning from labeled data.
Bias
• Bias is the error due to overly simplistic assumptions in the model.
Variance
• Variance is the error from excessive sensitivity to small fluctuations in the training
data.
Overfitting
• Caused by a model that is too complex and fits the noise in the training data (high
variance). Example: 1.4 subplot c and subplot d.
• Low generalization error: It performs well on both the training set and unseen test
data.
Example
• A good fit curve for noisy sine-wave data 1.4 would follow the overall shape of the sine
wave (subplot b) while ignoring random noise in individual data points.
A good fit is one that generalizes well — it learns the signal, not the noise.
Models like Bayesian networks, Gaussian mixtures, and Markov models use probability the-
ory to represent complex relationships and uncertainty in data. Techniques like Bayesian
inference enable models to update predictions as new data is observed, refining the model
over time.
S = {Heads, Tails}.
Event (E): A subset of the sample space. An event may consist of one or more outcomes.
For example, getting heads in a coin toss is an event.
Probability of an Event (P(E)): The measure of how likely an event is to occur. Proba-
bility values range from 0 (impossible event) to 1 (certain event).
115
Probability theory Artificial Intelligence
S = {Heads, Tails}.
E = {Tails}.
Probability of an Event P (E): The probability of getting “Tails” on the toss (assuming a
fair coin).
1
P (E) = .
2
E = {3, 4, 5, 6}.
4 2
P (E) = = .
6 3
Value 1 2 3 4 5 6
No. of
95 105 110 94 97 99
times appeared
Table 2.1: Number of times each value appeared in 600 rolls of a die
110 + 94 + 97 + 99 400 2
P (E) = = = .
600 600 3
0 ≤ P (A) ≤ 1. (2.2)
Complement Rule: The probability that event E does not occur is:
P (E c ) = 1 − P (E) (2.4)
where E c denotes the complement of event E (i.e., all outcomes where E does not
happen).
Addition Rule: For two events A and B, the probability that either A or B occurs is:
Multiplication Rule: If two events A and B are independent (i.e., the outcome of
one does not affect the other), the probability that both events occur is:
You want to find the probability that a student attended the extra review session given that
they passed the test.
Chain Rule: We can further use the conditional probability to find the joint probablity
of multiple events, x1 , x2 , . . . xn .
P (A ∩ B)
P (A|B) = (2.11)
P (B)
P (A ∩ B)
P (B|A) = (2.12)
P (A)
P (B|A) · P (A)
P (A|B) = (2.14)
P (B)
• P (B|A) = 0.80 (the likelihood: the probability that the sky will be cloudy given that
it rains),
• P (B) = 0.60 (the probability that the sky is cloudy, regardless of whether it rains or
not).
We want to calculate P (A|B) (Posterior Probability), the probability that it will rain given
that the sky is cloudy.
Bayes’ Rule tells us:
Thus, the probability that it will rain tomorrow given that the sky is cloudy is 0.40, or 40%.
• A = Positive sentiment
We want to calculate P (A|B), the probability that a tweet has a positive sentiment, given
that the word "good" appears.
From observed data, we know:
• P (B|A) = 0.60 (likelihood: the probability that the word "good" appears in a positive
tweet),
• P (B) = 0.50 (probability that the word "good" appears in any tweet, regardless of
sentiment).
Thus, the probability that the tweet has a positive sentiment, given that the word "good"
appears, is 0.84 or 84%.
Possible Values
The possible values that a discrete random variable can take are denoted by lowercase letters,
such as x1 , x2 , x3 , . . ..
F (xi ) = P (X ≤ xi ) (2.17)
which sums the probabilities for all values less than or equal to xi .
F (Sunny) = P (X ≤ Sunny)
= P (X = Sunny) = 0.5
F (Cloudy) = P (X ≤ Cloudy)
= P (X = Sunny) + P (X = Cloudy)
= 0.5 + 0.3 = 0.8
F (Rainy) = P (X ≤ Rainy)
= P (X = Sunny) + P (X = Cloudy) + P (X = Rainy)
= 0.5 + 0.3 + 0.2 = 1
Table 2.3: Probability distribution from Survey Data for TV Show Viewership by Gender
This is the value on the cell for TBBT on the Total column.
P (T F ) = 0.05
P (G ∩ M )
P (G|M ) =
P (M )
0.16
=
0.46
= 0.347
equivalently,
P (A) = P (A|B) and P (B) = P (B|A) (2.28)
Intuition
Without conditioning on C, the variables A and B may be dependent. However, once C is
known, learning about A gives no extra information about B, and vice versa. This means
Now,
Applications
Conditional independence is a fundamental concept in:
• Naive Bayes Classifier: Assumes features are conditionally independent given the
class.
P (M ∩ GOT ) = 0.16
P (M ) = 0.46
P (GOT ) = 0.40
P (M ) ∗ P (GOT ) = 0.46 ∗ 0.40 = 0.184
It states that the probability of an event Y is the sum of the conditional probabilities of
Y given each possible outcome of another variable Z, weighted by the probability of each
outcome of Z.
X
P (Y = i) = P (Y = i|Z = z)P (z)
z
• Z = 1: Sunny
• Z = 2: Cloudy
• Z = 3: Rainy
P (Y = 1 | Z = 1) = 0.1,
P (Y = 1 | Z = 2) = 0.3,
P (Y = 1 | Z = 3) = 0.9.
P (Z = 1) = 0.4,
P (Z = 2) = 0.3,
P (Z = 3) = 0.3.
P (Y = 1) = P (Y = 1 | Z = 1)P (Z = 1) + P (Y = 1 | Z = 2)P (Z = 2)
+ P (Y = 1 | Z = 3)P (Z = 3)
Substituting the values:
P (Y = 1) = (0.1)(0.4) + (0.3)(0.3) + (0.9)(0.3)
P (Y = 1) = 0.04 + 0.09 + 0.27 = 0.4
Naive Bayes Classification is a probabilistic model used in supervised learning, where it learns
from labeled training data to classify new, unseen data points. It applies Bayes’ Theorem
with the naive assumption that all input features are conditionally independent given the
class label.
The predicted class is the one with the highest posterior probability.
131
Naive Bayes Classification Artificial Intelligence
• Only 3 probabilities, P (x1 |Spam), P (x2 |Spam), P (x3 |Spam) need estimation.
• Likelihoods P (xi | Ck )
Given:
• HIV prevalence, P (HIV ) = 0.008
Conclusion: Even with a positive result, the probability of having HIV is low due to the
low prior probability.
Prior Probabilities:
Total Spam 50
P (S) = = = 0.5
Total Emails 100
Total Non-spam 50
P (¬S) = = = 0.5
Total Emails 100
Likelihood Probabilities:
Spam emails with "Free" 30
P (Free|S) = = = 0.6
Total Spam 50
Comparing the results, we can say that there is higher probability that the email
is spam if all three words "Free", "Win" and "Money" are present in the email.
Dataset:
Predict whether to play tennis given: Outlook = Sunny, Temperature = Hot, Humidity =
High, Windy = False
5
No 5 P (N) = = 0.35
14
Conditional Probabilities
Outlook
2 3
Sunny 5
9 5
4 0
Overcast 4
9 5
3 2
Rainy 5
9 5
Temperature
2 2
Hot 4
9 5
4 2
Mild 6
9 5
3 1
Cool 4
9 5
Humidity
3 4
High 7
9 5
6 1
Normal 7
9 5
Windy
6 2
False 8
9 5
3 3
True 6
9 5
Final Decision
We compute the posterior probabilities using Bayes’ Rule:
• Multinomial Naive Bayes: - For count data (e.g., word counts). - Used in text
classification.
3.9.2 Limitations
• Assumes independence of features.
4.1 Introduction
A Decision Tree is a supervised learning model used for both classification and regression
tasks. It is tree-structured with:
• Branches representing the outcomes of those tests. Example: 4.1 Yes, No, Sunny,
Overcast etc.
141
Decision Tree Artificial Intelligence
k
X
H(D) = − Pi log2 Pi , (4.1)
i=1
Extending this idea to the general case where the classes C1 , C2 , . . . , Ck have arbitrary
probabilities P1 , P2 , . . . , Pk , we can measure the total entropy as:
k
X
H(D) = − Pi log2 Pi
i=1
We are measuring information in units of bits, which is why we use base 2 in the
logarithm.
• Nats, which use the natural logarithm (loge or ln), commonly used in physics
and information theory when working with continuous distributions.
In the context of classification, entropy reflects the unpredictability of the class label.
• High entropy (close to 1 bit for binary classification) means classes are equally
mixed, making the outcome uncertain.
• Low entropy (0) means all samples belong to one class, hence fully predictable.
High entropy in the training data indicates a rich diversity of examples across different classes.
This is good for training because it provides the model with sufficient information to learn
meaningful decision boundaries.
However, when splitting data at a node, high entropy is undesirable—it means the split has
not made the data more pure or homogeneous. The goal of each split in a decision tree is to
reduce entropy (increase purity), so a good split is one that creates low-entropy subsets.
Figure 4.2: Visual comparison of an impure split (left) vs a pure split (right).
Example in fig: 4.2 we can see, in the impure split, both child nodes contain a mix of classes
(Yes in green, No in red), indicating high entropy. In the pure split, the data is cleanly
separated by class, resulting in low entropy and a more informative split.
• High entropy in training data ensures the model encounters all class types.
• Low entropy in split subsets ensures the decision tree is making clear distinc-
tions that improve classification.
m
X #of instances where X = Xj
H(D|X) = H(D|X = Xj ) (4.3)
j=1
Total instances in X
k
X
H(D | Xj ) = − P (Ci |Xj ) log2 P (Ci |Xj ). (4.4)
i=1
Here, P (Ci |Xj ) is the probability of the i’th class, Ci in the subset where feature X = Xj .
Using all these, we can compute the Information Gain (IG) which measures reduction in
entropy.
IG(D, X) = H(D) − H(D|X) (4.5)
3 2
H(D|Weather) = · 0.918 + · 1 = 0.9508
5 5
Information Gain
IG(D, W eather) = 0.971 − 0.9508 = 0.0202
4 6 4
H(D|Temp) = ·1+ · 0.918 + · 0.811
14 14 14
≈ 0.286 + 0.393 + 0.232 = 0.911
IG(D, Temp) = 0.940 − 0.911 = 0.029
7 7
H(D|Humidity) = · 0.985 + · 0.592
14 14
= 0.5 · (0.985 + 0.592)
= 0.789
IG(D, Humidity) = 0.940 − 0.789 = 0.151
8 6
H(D|Wind) = · 0.811 + ·1
14 14
= 0.463 + 0.429 = 0.892
IG(D, Wind) = 0.940 − 0.892
= 0.048
• IG(Humidity) = 0.151
• IG(Wind) = 0.048
• IG(Temperature) = 0.029
Therefore, Outlook has the highest information gain and is selected as the root node.
• Humidity:
– IG = 0.971 - 0 = 0.971
• Wind:
– IG = 0.971 − 0 = 0.971
• Humidity:
– Weighted entropy:
2 3
H= · 1 + · 0.918 ≈ 0.951
5 5
• Temp:
– Weighted entropy:
3 2
H= · 0.918 + · 1 ≈ 0.951
5 5
Outlook
Sunny Overcast Rain
Humidity Yes Wind
No Yes Yes No
Figure 4.3: Final ID3 Decision Tree for Play Tennis Dataset
Class Frequency
A 3
B 2
C 1
Total instances: N = 6
Class probabilities:
3
PA = = 0.5
6
2
PB = ≈ 0.333
6
1
PC = ≈ 0.167
6
Entropy is calculated as:
For, k number of class labels, we will have to use probability distribution [P1 . . . Pk ], Pi
representing the probability of the i’th class label.
Outlook
Humidity:
Temp: Hot Play = Yes Temp: Mild
High
In this tree, the model makes decisions based on very specific combinations of attributes such
as Temp = Hot and Humidity = High or Wind = Strong and Humidity = High, rather than
broader, generalizable patterns. This level of detail may capture noise in the training dataset
rather than meaningful trends.
• But for Outlook = Sunny, Humidity = High, and Temp = Mild, the player did play.
If the tree tries to fit such fine distinctions, it may become too sensitive to slight variations
and overfit.
Underfitting
Underfitting occurs when a decision tree is too shallow or too simple to capture the underlying
structure of the data. It fails to learn the relationships between features and the target class,
resulting in poor performance on both the training and test sets.
An underfit model makes overly broad generalizations and may return the same prediction
across many different inputs. This can happen due to early stopping, excessive pruning, or
not including enough splits to isolate relevant patterns.
Example of Underfitting
Imagine trying to model a dataset using only the root node feature, without allowing the
tree to explore deeper levels. For instance, if we split solely on Outlook, but ignore important
distinctions made by Humidity or Wind, the tree might generalize:
While this may cover dominant patterns, it ignores the nuances (e.g., differences based on
humidity or wind strength), leading to high error on both known and new data.
Outlook
Solution
Allow the tree to grow deeper and use more features until training performance improves,
followed by pruning to improve generalization.
Post-Pruning
To combat overfitting, we apply pruning strategies:
• Post-pruning: Build the full tree and then remove branches that do not improve
accuracy on a validation set.
• Pre-pruning (early stopping): Stop tree construction early if further splits do not
add significant value.
Outlook
No Yes Yes No
• Ignore the instance: Remove records with missing values, especially if they form a
small portion of the dataset.
• Impute the missing value: Replace with the mean (for numerical attributes), mode
(for categorical attributes), or a more advanced method like KNN or regression-based
imputation.
Example: Given temperatures [60, 70, 80] and their associated play outcomes, try a thresh-
old like Temp ≤ 65 to split. If this improves the purity of child nodes, the threshold is
retained.
• These are suitable for tasks like predicting house prices, exam scores, or temperature.
• The tree is constructed by splitting the data at each node based on a feature and a
threshold that minimizes prediction error — typically measured using metrics like Mean
Squared Error (MSE) or Mean Absolute Error (MAE).
• Unlike classification trees that output a class label, regression tree leaves output a
real-valued prediction (usually the average value of target variables in that node).
• Suppose we’re predicting house prices based on area, number of bedrooms, and location
score.
• A node might split on the rule "Area ≤ 2000 sq ft", separating smaller and larger
homes.
• The leaf nodes would return values such as $150,000 for smaller homes and $320,000
for larger ones, based on the average price within each group.
5.1 Introduction
Linear regression is a supervised learning algorithm in machine learning and statistics. It is
used to model the relationship between a dependent variable (Y ) and one or more independent
variables (x1 , x2 ...)by fitting a linear equation to observed data.
y = w 0 + w 1 x1 + w 2 x2 + · · · + w n xn + ϵ (5.1)
Or in vectorized form:
y = w⊤ x + ϵ (5.2)
Here,
• w0 is the intercept.
157
Linear Regression and Gradient Descent Artificial Intelligence
To find the optimal weights w, we compute the gradient of L(w) with respect to w and set
it to zero:
∂L
=0 (5.5)
∂w
this provides the condition for the minimum.
Solving this system (either analytically or using optimization methods like gradient descent)
yields the weight vector w that minimizes the prediction error across the training data.
Let us define the weight vector, W = ⟨w0 , w1 and define the hypothesis (linear model) with
these weights,
hw (x) = w1 x + w0 (5.6)
can be used to predict the value of the output variable. We are given the following data of
students’ exam scores based on the number of hours they studied:
Analytical solution
∂L ∂L
= 0 and
dw1 ∂w0
m m
∂ X ∂ X
(y − (w1 x + w0 )) = 0 and
2
(y − (w1 x + w0 ))2 = 0.
∂w0 i=1 ∂w0 i=1
and found that the best-fit line for this data is:
ŷ = 47.57 + 4.57x.
75
70
Exam Score (y)
65
60
55
50 Actual Data
Best-Fit Line
45
0 1 2 3 4 5 6 7
Hours Studied (x)
considered optimal when the partial derivatives of the loss function L with respect to each
weight are zero:
∂L ∂L ∂L
= 0, = 0, . . . , =0 (5.8)
∂w1 ∂w2 ∂wm
Linear regression has a closed-form solution using the normal equation:
This solution directly computes the optimal weights that minimize the loss function.
However, computing this closed form solution to find the optimal solution can be complex
specially in high dimensional data.
While the analytical solution is faster for small, simple problems, we introduce gradient
descent algorithm for modern, large-scale, and complex machine learning models due to
its flexibility and efficiency.
(t+1) (t) ∂L
wi = wi − α | t (5.11)
∂wi wi =wi
Here, α is the learning rate that controls the step size, t represents the iteration number and
i refers to the specific parameter in the weight vector being updated.
5: Update weights: w ← w − α ∂w ∂L
6: end while
7: return W
Learning Parameter, α
The parameter α in Equation 5.11 is called the learning rate or step size. It controls how
quickly the algorithm updates the weights and how fast it converges to a minimum.
Choosing an appropriate learning rate is important. If α is too large, the algorithm may
overshoot or oscillate around the minimum and fail to converge. If it is too small, the
algorithm will take a long time to reach the minimum.
The learning rate can be a constant or a decaying parameter. A decaying learning rate helps
the algorithm explore more widely in the early stages and fine-tune near the minimum.
∂L
Computing Gradient of Loss ( ∂wk
) in Univariate Linear Regression
with Mean Square Error
Let us consider a univariate linear regression with mean square error as the loss function.
We already know,
ŷ = hw (x) = w0 + w1 x.
and m
1 X
L= (y − ŷ)2
m i=1
m
∂L ∂ 1 X
= (y − ŷ)2
∂w0 ∂w0 m i=1
m
2 X ∂
= (y − ŷ) (y − ŷ)
m i=1 ∂w0
m
2 X ∂ ŷ
=− (y − ŷ)
m i=1 ∂w0
m
2 X ∂
=− (y − ŷ) (w0 + w1 x)
m i=1 ∂w0
m
∂L 2 X
=− (y − ŷ) (5.13)
∂w0 m i=1
m
∂L ∂ 1 X
= (y − ŷ)2
∂w1 ∂w1 m i=1
m
2 X ∂
= (y − ŷ) (y − ŷ)
m i=1 ∂w1
m
2 X ∂ ŷ
=− (y − ŷ)
m i=1 ∂w1
m
2 X ∂
=− (y − ŷ) (w0 + w1 x)
m i=1 ∂w1
m
∂L 2 X
=− (y − ŷ)x1 (5.14)
∂w1 m i=1
∂L
Computing Gradient of Loss ( ∂wk
) in Multivariate Linear Regression
with Mean Square Error
m
∂L 2 X
=− (y − ŷ) (5.15)
∂w0 m i=1
m
∂L 2 X
=− (y − ŷ)xk (5.16)
∂wk m i=1
Here, xk denotes the k’th input variable and wk is its corresponding weight.
m
(t+1) (t) 1 X
wk = wk + α (y − ŷ)xk . (5.18)
m i=1
We compute the gradients of the MSE loss with respect to w0 and w1 as follows:
m m
∂L −2 X ∂L −2 X
= (yi − ŷi ), = (yi − ŷi )xi
∂w0 m i=1 ∂w1 m i=1
1 52 0 52 52 52
2 55 0 55 55 110
3 61 0 61 61 183
4 66 0 66 66 264
5 71 0 71 71 355
6 75 0 75 75 450
Total: 380 1414
6
X
(y − ŷ) = 380
i=1
6
X
(y − ŷ)xi = 1414
i=1
Let’s take α = .1, and with m = 6, the after the first iteration the updated weights will be
(2) (1) ∂L
w0 = w0 + α
∂w0
1
= 0 + (.1) · 380
6
and
(2) (1) ∂L
w1 = w1 + α
∂w1
1
= 0 + (.1) · 1414
6
(2) (2)
In the second iteration, we will use the updated weights, w0 and w1 to predict ŷ and
compute the gradient.
The gradient descent algorithm will repeat this process until convergence.
Geometric Intuition
Gradient descent can be viewed as a ball rolling downhill on the loss surface:
• The gradient points in the direction of steepest ascent.
• Each iteration moves the parameter vector w closer to the minimum of the loss function.
We can see that it has taken a convex form because of the quadratic nature of the Loss
function. We can identify the minima (point where the loss is minimum) at 47.57 and 4.57.
∂
w ←w−η Lfull batch (w)
∂w
Pros: Accurate direction of descent.
Cons: Can be slow and memory-intensive for large datasets.
• It forms the basis of more complex models like logistic regression and neural networks.
NOTE: This chapter is written based on chapter 19.6.4 and 19.6.5 of Artificial
Intelligence A Modern Approach by Stuart Russel and Peter Norvig
Our reference book shows an example of classification between two types of seismic events:
The goal is to learn a hypothesis, h that can correctly classify new points, (x1 , x2 ), labeling
earthquakes as 0 and explosions as 1. The linear separator for this example dataset is
−4.9 + 1.7x1 + x2 = 0
From the plots we can see that the data points for explosions are below the line, meaning
169
Linear Classifier and Logistic Regression Artificial Intelligence
and the data points for earthquakes are above the line, meaning
In Linear Classification with threshold we want to develop an equation for the decision
boundary.
Z = W · X + W0 (6.1)
by predicting the weights, W and W0 . This equation is then used to construct the hypothesis,
(
1 if Z > 0;
hW (x) = (6.2)
0 if Z < 0
here,
gradient descent will not work in this case. To solve this problem we introduce perceptron
learning. In this process the weights are only adjusted when the prediction is incorrect.
W0 = W0 + α · (y − hW (X)). (6.3)
Wj = Wj + α · (y − hW (X))Xj . (6.4)
Typically, perceptron rule uses a randomly chosen example from the data to compute the
updated weight.
A training curve plots how well the classifier is doing on the same training data as it keeps
learning, one update at a time.
When the data are not linearly separable, there is no perfect line that can divide the two
classes without mistakes. The perceptron algorithm is designed to adjust its weights whenever
it makes a mistake. Since mistakes can never be fully avoided in this case, the perceptron
keeps finding errors and keeps adjusting the weights forever, even if it often comes close to
the best possible solution. This stops the perceptron rule from converging.
Plot a of figure ?? shows how the perceptron learning rule improves over time when it is
trained on linearly separable data.
In this case, the curve shows that the perceptron eventually finds a perfect (zero-error)
straight-line separator between the two classes.
However, the learning process is not smooth—it jumps around before finally getting it right.
The number of updates can vary between different training runs. Plot b and c shows that
the perceptron rule fails to converge even after 10,000 updates when the dealing with noisy
data.
Plot c of ?? shows the training process when using a decaying learning rate. It still doesn’t
reach perfect convergence even after 100,000 updates, but it does much better than when
using a fixed learning rate.
Normally, the perceptron rule may not converge to a stable solution if the learning rate stays
fixed.
However, if we make the learning rate smaller over time — for example, using α = 1000+t
1000
(where t is the number of updates) — then the perceptron can be shown to converge to a
minimum-error solution, as long as the training examples are presented in a random order.
Still, finding the true minimum-error solution is a very hard problem (NP-hard), so we expect
that it will take many passes over the data before convergence happens.
The sigmoid function outputs the probability that an example belongs to Class 1.
Unlike the hard threshold in 6.2, logistic regression uses a continuous function for its model,
hw (X).
Figure 6.1
Therefore, logistic regression minimizes a continuous loss function (MSE or binary cross-
entropy), which is smooth and differentiable.
This allows the use of gradient descent, leading to more stable and gradual updates.
In contrast to perceptron learning, Logistic regression is able to find best-fit boundary even
with noisy linearly non-separable data.
(t+1) ∂L
(t)
w0 = w0 − α (6.7)
∂w0
(t+1) (t) ∂L
w1 = w1 − α (6.8)
∂w1
• Binary Cross-Entropy, L = − m1
Pm (i)
i=1 [y ln ŷ (i) + (1 − y (i) ) ln (1 − ŷ (i) )].
Here, ŷ = hw (Z) = 1
1+e−Z
Z = W0 + W1 X1 + W2 X2 . . . Wm Xm
So,
∂Z
=1 (6.9)
∂W0
and
∂Z
= Xj (6.10)
∂Wj
Where, Xj is the j’th input and Wj is the associated weight.
Computing ∂ ŷ
∂W0
:
∂ ŷ ∂ 1
=
∂W0 ∂W0 1 + e−z
−1 ∂
= −z 2
(1 + e−z )
(1 + e ) ∂W0
−1 ∂(e−z )
=
(1 + e−z )2 ∂W0
e−z ∂z
=
(1 + e−z )2 ∂W0
1 e−z ∂z
=
(1 + e−z ) (1 + e−z ) ∂W0
1 e−z + 1 − 1
=
(1 + e−z ) (1 + e−z )2
1 1 ∂z
= 1 −
(1 + e−z ) 1 + e−z ∂W0
1 1
= 1− [using 6.9]
(1 + e−z ) 1 + e−z
∂ ŷ
= ŷ(1 − ŷ) (6.11)
∂W0
∂ ŷ
= ŷ(1 − ŷ)Xj (6.12)
∂Wj
m
∂L ∂ X 1 (i)
= (y − ŷ (i) )2
∂W0 ∂W0 i=1 m
m
2 X (i) ∂ ŷ
=− (y − ŷ (i) )
m i=1 ∂W0
Similarly,
m
∂L 2 X (i) ∂ ŷ
=− (y − ŷ (i) ) (6.13)
∂Wj m i=1 ∂Wj
m
∂L 2 X (i)
=− (y − ŷ (i) )ŷ (i) (1 − ŷ (i) ) (6.14)
∂W0 m i=1
and m
∂L 2 X (i)
=− (y − ŷ (i) )ŷ (i) (1 − ŷ (i) )Xj (6.15)
∂Wj m i=1
Finally, we can write the update rule in the gradient descent algorithm for logistic regres-
sion with mean squared error loss.
m
(t+1) (t) 1 X (i)
W0 = W0 +α (y − ŷ (i) )ŷ (i) (1 − ŷ (i) ) (6.16)
m i=1
and
m
(t+1) (t) 1 X (i)
Wj = Wj +α (y − ŷ (i) )ŷ (i) (1 − ŷ (i) )Xj (6.17)
m i=1
Now,
m
∂L 1 X ∂
=− [y (i) ln ŷ (i) + (1 − y (i) ) ln (1 − ŷ (i) )] (6.19)
∂W0 m i=1 ∂W0
∂
Computing ∂W0 (y ln ŷ)
∂ y ∂ ŷ
(y ln ŷ) =
∂W0 ŷ ∂W0
y
= · ŷ(1 − ŷ) [Using 6.11]
ŷ
∂
(y ln ŷ) = y − y ŷ (6.20)
∂W0
∂
Computing ∂W0 ((1 − y) ln (1 − ŷ))
∂ 1 − y ∂(1 − ŷ)
((1 − y) ln (1 − ŷ)) =
∂W0 1 − ŷ ∂W0
1 − y ∂ ŷ
=−
1 − ŷ ∂W0
1−y
=− · ŷ(1 − ŷ) [Using 6.11]
1 − ŷ
∂
((1 − y) ln (1 − ŷ)) = y ŷ − ŷ (6.21)
∂W0
Using 6.20 and 6.21 in 6.19 we get
m
∂L 1 X (i)
=− y − ŷ (i) (6.22)
∂W0 m i=1
Finally, using 6.22 and 6.23 we get the update rule for gradient descent algorithm in logistic
regression with binary cross-entropy.
m
(new) (old) 1 X (i)
W0 = W0 +α (y − ŷ (i) ) (6.24)
m i=1
m
(new) (old) 1 X (i)
Wj = Wj +α (y − ŷ (i) )Xj (6.25)
m i=1
x1 x2 y
0.5 1.2 0
1.0 0.8 0
1.5 1.3 1
2.0 1.7 1
The logistic regression model is:
1
z = w0 + w1 x1 + w2 x2 , ŷi =
1 + e−z
Initial weights: w0 = 0, w1 = 0, w2 = 0. from the table:
Table 6.1: Gradient Computation at Initial Weights
6
X
(y − ŷ) = 0;
i=1
6
X
(y − ŷ)x1 = −1;
i=1
X6
(y − ŷ)x2 = −0.50.
i=1
7.1 Introduction
Neural networks are a foundational component of modern machine learning and artificial
intelligence. Inspired by the structure of the human brain, a neural network consists of
layers of interconnected units called neurons that process data in stages. These models
are particularly effective in tasks like image classification, natural language processing, and
regression analysis.
Here:
• b is a bias term
179
Neural Networks Artificial Intelligence
A perceptron takes several input values, applies weights to them, sums them together with
a bias term, and then passes the result through an activation function — in our case, a
step function in the original model. It can be used to separate linearly separable data.
Weight Matrix: The full set of weights between two layers can be written as a matrix W (l) ,
where each row corresponds to the weights leading into a single neuron in the current layer.
If the previous layer has n neurons and the current layer has m neurons, then W (l) is an
m × n matrix and the bias vector will have m entries each associated with one neuron.
(l) (l) (l) (l)
w11 w12 · · · w1n b1
w(l) w(l) · · · w(l) b(l)
W (l) = .21 22
. .
2n
. ; b(l) = 2.
.. .
. . . .
. ..
(l) (l) (l) (l)
wm1 wm2 · · · wmn m×n
bm m×1
Where:
(l−1)
• ai is the activation from neuron i in the previous layer
(l)
• wji is the weight from neuron i in layer l − 1 to neuron j in layer l
(l)
• bj is the bias for neuron j in layer l
more consisely,
z (l) = W (l) · a(l) + b(l) (7.5)
• W (l) : Weight matrix of shape m × n, where each row corresponds to a neuron in the
current layer, and each column to a neuron in the previous layer.
• a(l−1) : Activation vector from the previous layer, of shape n × 1. For the first layer the
activation vector is the original input vector, X with n attributes.
• b(l) : Bias vector for the current layer, with one bias per neuron, of shape m × 1.
• z (l) : Pre-activation vector of the current layer, of shape m×1, computed as the weighted
sum of previous activations plus bias.
• z (l) : The pre-activation vector computed from the weighted sum and bias.
• a(l) : The output activations from the current layer, used as input to the next layer (or
final prediction if it’s the output layer).
It introduces non-linearity into the model, allowing it to learn complex patterns. Common
activation functions include:
(
1 if z ≥ 0
• Step: step(z) =
0 otherwise
• Sigmoid: σ(z) = 1
1+e−z
maps input to (0,1)
ez −e−z
• Tanh: tanh(z) = ez +e−z
.
• ReLU: ReLU(z) = max(0, z).
Each has different properties affecting training dynamics and convergence.
For a given layer l, the weight matrix W (l) has dimensions m × n, where:
• n is the number of neurons in the previous layer l − 1
• m is the number of neurons in the current layer l
The output of the linear transformation is:
z (l) = W (l) a(l−1) + b(l) (7.7)
This vector z (l) is then passed through an activation function f to introduce non-linearity:
a(l) = f (z (l) ) (7.8)
The activation function thresholds the result of each neuron’s computation, allowing the
network to model logical decision boundaries like AND, OR, or XOR (with enough layers).
Thus, the weight matrix defines how signals propagate and interact, and the activation
function defines how these signals are interpreted at each neuron.
1
ŷ = f (z (2) ) = sigmoid(−0.2) =
1 + e−(−0.2)
In this example:
• The first layer transformed the input using weights and biases.
• The step activation introduced non-linearity, turning negative values to 0 and non-
negatives to 1.
• The second layer used the resulting activations to compute a final output.
Each activation a(l) is passed to the next layer until the final output is produced.
General Structure
Assume a network with L layers (excluding the input layer), where each layer l has n(l)
neurons.
Given input x ∈ Rn (e.g., a feature vector), the feedforward steps are as follows:
(0)
Here:
• a(l−1) ∈ Rn is the activation output from the previous layer (or input x if l = 1)
(l−1)
• Sigmoid: σ(z) = 1
1+e−z
Input vector:
5
x= , W = 0.4 0.6 , b = −3
7
Compute the linear combination:
Input vector:
3
x=
1
z = 0.7 · 3 + 1.5 · 1 − 2
= 2.1 + 1.5 − 2 = 1.6
1
y = σ(z) = ≈ 0.832
1 + e−1.6
Input vector:
0.8
x = 0.6
2
Output Layer:
Pre-activation vector, z (3) = 1.0 · 0.5785 − 1.5 · 0.7138 + 0.6 · 0.5875 + 0.1
= −0.0397
1
Apply sigmoid activation: ŷ = σ(z (3) ) = ≈ 0.4901
1 + e0.0397
The model predicts a 49.01% probability of loan approval.
1
σ(z) =
1 + e−z
e zj
ŷj = P z
ke
k
X
LCCE = − yj log(ŷj )
j
ŷ = W · X + b
N
1 X
LMSE = (yi − ŷi )2
N i=1
w(1) w(2)
x z (1) a(1) (z (1) ) z (2) a(2) (z (2) ) ŷ
For simplicity, let us consider L = 2, with activation functions, a(1) for the hidden layer and
a(2) for the output layer. The weight matrix for the hidden layer is w(1) and for the output
layer is w(2) .
The output layer has a single neuron. The linear equation to this neuron is given by:
Z1(L) = W (L) · a(L−1) + b(L) = w1(L) a(L−1)
1 + w2(L) a(L−1)
2 + · · · + wp(L) a(L−1)
p + b(L) (7.13)
Here, p is the number of neurons in the previous layer (layer L − 1), and each a(L−1)
i is the
activation output from neuron i in that layer.
∂L ∂L ∂aL ∂Z1
= , using chain rule. Recall, ŷ = a(L) (Z1 ),
∂wi(L) ∂aL ∂Z1 ∂wi
P (L)) (1) (2)
Recall that: z1(L) = i w1i ai + bj , so,
∂z1(L)
(L)
= ai(L−1) .
∂w1i
Hence,
∂L ∂L ∂aL (L−1)
= a
∂wi(L) ∂aL ∂Z1 i
and by using, ∂Z1
(L) =1
∂bi
∂L ∂L ∂aL
= ·1
∂bi(L) ∂aL ∂Z1
Generally, this can be written as:
∂L ∂L ∂aL (L−1)
= ·a ; (7.14)
∂W (L) ∂aL ∂Z1
∂L ∂L ∂a(L)
= . (7.15)
∂b(L) ∂a(L) ∂Z1
Finally, the update rules at the output layer, L can be written as:
∂L ∂a(L)
W (L) ← W (L) − α · (a(L−1) )⊤ (7.16)
∂a(L) ∂Z1
∂L ∂a(L)
b(L) ← b(L) − α . (7.17)
∂a(L) ∂Z1
Note: Here, The activation vector a(L) is transposed to turn it into a row vector so it can be
multiplied with the error. This gives a full matrix of weight updates, matching the shape of
W (L) .
Examples:
• Sigmoid activation:
1
ϕ(z) = ⇒ ϕ′ (z) = ϕ(z)(1 − ϕ(z))
1 + e−z
So,
da(i)
= a(i) (1 − a(i) ) (7.19)
dZ (i)
• ReLU activation:
(
1 if z > 0
ϕ(z) = max(0, z) ⇒ ϕ′ (z) =
0 if z ≤ 0
So,
(
da(i) 1 if Z (i) > 0
= (7.20)
dZ (i) 0 otherwise
1
L = (ŷ − y)2 (7.21)
2
dL
= ŷ − y (7.22)
da(L)
dL
7.7.5 Defining Error, dZ (l)
Let us define,
∂L ∂L ∂a(l)
= = D(l) (7.25)
∂Z (l) ∂a(l) ∂Z (l)
where, l is the layer.
The term dZdL(l) = Di(l) represents how the loss L changes with respect to the Zi(l) of the i’th
neuron at layer l.
The Error D(l) combines:
(2) (2)
Recall that ∂a∂L(2) ∂Z
∂a
(2) = D is already computed at the output layer. Now to compute ∂a(1) ,
2 ∂Z
you may recall that the output layer contains only one neuron and the linear equation at
that layer is written as:
X (2) (1)
Z1(2) = W (2) · a(1) = W1j aj + b21
j
Now,
∂Z1(2)
(1)
(2)
= W1k . (7.28)
∂ak
∂Zp(2)
(1)
(2)
= Wpk (7.29)
∂ak
∂Z (2)
= W (2) . (7.30)
∂a(1)
The derivative of the activation a(i) with respect to the input Z (i) at layer i is given by:
da(i)
= ϕ′ Z (i) (7.31)
dZ (i)
So, this will result to a column vector each containing the value of ϕ′ (Zi1 ). Lastly,
∂Z (l)
= a(l−1) . (7.32)
∂W (l)
∂L ∂a(1)
= D 2
· (W (2) ⊤
) ⊙ (a0 )⊤ (7.33)
∂W (1) ∂Z (1)
∂a(1)
= D2 · (W (2) )⊤ ⊙ x⊤ (7.34)
∂Z (1)
and,
∂L ∂a(1)
= D 2
· (W (2) ⊤
) . (7.35)
∂b(1) ∂Z (1)
and,
∂L
= D1(2) · W1i(2) ϕ′ (Zi(1) ) (7.38)
(1)
∂bi
Here:
• Wik(1) is the weight connecting the kth input xk to the ith neuron in the first hidden
layer.
• ∂L
(1) shows how this weight affects the total loss L.
∂Wik
• D1(2) · W1i(2) represents how much the ith neuron in the first layer influences the
output layer’s error.
This shows that, for any number of layers, the error at a given layer can be computed from
the error and weights of the next layer. This process of sending the error backward through
the network is called backpropagation.
∂a(l)
D(l) = D(l+1) · (W (l+1) )⊤ · (7.40)
∂z (l)
Here,
• The term (l+1)
is the backpropagation term. It represents the total
P (l+1)
p Dp Wpk
contribution of the kth neuron in layer l to the errors in all neurons p in the next
layer (l + 1), weighted by the corresponding weights Wpk (l+1)
.
(l)
∂ak
• ϕ′ = (l) is the derivative of the activation function at the k’th neuron of layer-l.
∂zk
∂L ∂a(l)
(l+1) T (l+1)
⊙ (l) a(l−1) (7.42)
= W δ
∂w(l) ∂z
∂L ∂a(l)
(l+1) T (l+1)
(7.43)
= W δ ⊙ (l)
∂b(l) ∂z
∂L (l−1)
(l)
= Di(l) ak (7.44)
∂wik
∂L
= Dl (7.45)
∂b(l)
and,
1
0.4256
a (1)
= σ(z ) = (1) 1+e0.3
1 ≈ (7.50)
1+e−0.75
0.6792
(1) (1)
w12 ← w12 − αD1 x2 (7.59)
= −0.3 − 0.1 · (−0.0085) · 2 (7.60)
= −0.2983 (7.61)
(1) (1)
b1 ← b11 − αD1 (7.62)
= 0.1 − 0.1 · (−0.0085) (7.63)
= 0.10085 (7.64)
Neuron 2:
(1) (1)
w21 ← w21 − αD2 x1 (7.65)
= −0.1 − 0.1 · 0.0051 · 1 = −0.10051 (7.66)
(1) (1)
w22 ← w22 − αD2 x2 (7.67)
= 0.4 − 0.1 · 0.0051 · 2 = 0.39898 (7.68)
(1) (1)
b2 ← b2 − αD2 (7.69)
= 0.05 − 0.1 · 0.0051 = 0.04949 (7.70)
Layer 1 delta:
!
(1)
X (2) (1)
Dk = Dp(2) Wpk · σ ′ (zk )
p
(1)
D1 = (−0.0085 · 0.2 + 0.0057 · (−0.4)) · σ ′ (−0.3)
= (−0.0017 − 0.00228) · 0.2445
= −0.00097
(1)
D2 = (−0.0085 · (−0.1) + 0.0057 · 0.2) · σ ′ (1.1)
= (0.00085 + 0.00114) · 0.1879
= 0.00037
(1)
D3 = (−0.0085 · 0.3 + 0.0057 · 0.1) · σ ′ (−0.2)
= (−0.00255 + 0.00057) · 0.2475
= −0.00049
∂L
(1)
= −0.1 · (−0.00097) · 1.0 = 0.000097
∂W1,1
∂L
(1)
= −0.1 · (−0.00097) · 2.0 = 0.000194
∂W1,2
∂L
(1)
= −0.1 · (−0.00097) = 0.000097
∂b1
Similarly, compute updates for all weights and biases.
• In the next forward pass, the updated weights and biases are used to make a new
prediction.
• If the prediction is still incorrect, new gradients are computed based on the updated
prediction, and weights are further updated.
• This cycle repeats over many training examples and epochs until the network con-
verges to a solution that minimizes the loss.
• Once training is complete, the final set of weights is frozen and used for inference
on unseen data.
Backpropagation trains a neural network by making small improvements to its weights with
each example it sees. These updates accumulate over time, gradually improving the network’s
predictions.
• Choose a learning rate η, number of epochs, and batch size (if applicable).
Step 4: Backpropagation
• Compute gradients of the loss w.r.t. output layer and recursively propagate errors
backward:
δ (L) = (ŷ − y) ⊙ σ ′ (z (L) )
δ (l) = (W (l+1) )T δ (l+1) ⊙ σ ′ (z (l) )
• After each pass through the entire dataset, increment the epoch counter.