Artificial Intelligence
Artificial Intelligence
E[ X ] = ∑ xP(X = x) = ∑ xp[x]
x x
Z ∞
E[ X ] = xp( x )dx
−∞
Here, P(·) represents probability for a random variable taking a value or falling within a range of
values, p(·) or p[·] represent probability density or mass functions for continuous or discrete observed
variables (realizations), respectively. X is a random variable, and x is a specific realization of the
random variable X. A realization of a random variable is the specific value that the variable takes in
a particular trial or experiment. The horizontal axis of a PDF or a PMF represents realization x. Are
P( X ), p( x ), p[ x ] the same? Except for P( X ) denoting the population and p( x ), p[ x ] denoting the
sample (or observed) values? Again if we are not trying to find the probability of a random variable, or
if we have not assigned a number for an event, then we can denote it like this:
number of Head outcomes
P( Head) =
all outcomes
1
——————————— Volume 1 ———————————
Optimization
Discrete optimization: Find the best discrete objects
min Cost(⃗
w)
⃗ ∈Rd
w
A demo problem: Computing edit distance. As inputs you have two strings s and t. You need to find
the minimum number of edits it takes to change s into t. By ‘edit’ here, we mean any of the three
operations: 1) insertion, 2) deletion, 3) substitution. How would you solve it? What is the optimal
strategy? Is this an optimization problem? how? if yes, discrete? or continuous?
Consider another problem: Imagine you’re managing a fleet of trucks, and fuel efficiency depends on
the speed at which the truck travels. Driving too fast wastes fuel due to air resistance, while driving
too slowly is inefficient due to the engine’s lower performance. The goal is to identify the optimal speed
that minimizes fuel usage which can be modeled as a/v + bv2 , with v being the speed. How would
you solve it? Is this also an optimization problem? how? if yes, discrete? or continuous?
Machine Learning
Machine learning can be seen from three perspectives. 1) Modeling: what is the type of the machine
learning model hw ( x )? 2) Learning: How are the model parameters determined given a set of training
data? 3) Inference: Given a trained model, what is the predicted output for the test data? Here, the
output y, called the target, can be discrete, continuous depending on the type of the problem. The
input x, generally called the feature, can also vary based on the problem.
Linear regression
If ⃗x represents a feature vector and y represents a continuous target variable, the linear regression model
can be expressed mathematically as:
⃗ = [ w1 , w2 , · · · , w d ] T
⃗ · ⃗x; w
yb = h(⃗x ) = w
2
Once, we have the model, the next step is to train the model, which means to compute the model
parameters of the model, given a set of training data {(⃗x (1) , y(1) ), (⃗x (2) , y(2) ), · · · , (⃗x (m) , y(m) )}. For
that, it is a common practice to define an optimization problem and then solve that problem to find
out the values of the model parameters. A commonly used loss (or cost) function is squared error loss
as follows:
1 m 1 m
2m i∑ 2m i∑
(i ) (i ) (i ) (i )
J (⃗
w) = h (⃗
x ) − y = ⃗
w · ⃗
x − y
=1 =1
1 m
2m i∑
(i ) (i )
min J (⃗
w) = min ⃗
w · ⃗
x − y
⃗
w ⃗
w =1
and the optimum model parameter vector (i.e., the goal solution in training) is
1 m
⃗ ∗ = arg min
w
⃗
w
∑
2m i=1
⃗ · ⃗x (i) − y(i)
w
One way to solve the above optimization problem is to use gradient descent. Gradient descent starts
⃗ , then iterative updates using the following update equation:
with random initializations of w
∂ 1 m
⃗ ←w
w ⃗ − η ∇w
⃗ J (⃗ ⃗ ←w
w) or, w ⃗ −η J (⃗ ⃗ − η ∑ h(⃗x ) − y
⃗ ←w
w) or, w (i ) (i )
⃗x (i)
⃗
∂w m i =1
The parameter η represents the gradient descent step-size. The iteration stops when ∆J (⃗ w) < ϵ.
Once we have the optimum model parameters w ∗
⃗ , the next and the final step is inference, which is
to use the model to predict the target of a test sample as follows:
⃗ ∗ · ⃗xtest
ybtest = h(⃗xtest ) = w
Linear classification
If ⃗x represents a feature vector and y represents a discrete target variable, the linear classification model
can be expressed mathematically as:
1
w · ⃗x ) or, yb = h(⃗x ) = σ(⃗
yb = h(⃗x ) = sign(⃗ w · ⃗x ), σ(z) = (sigmoid)
1 + e−z
where, w1 , · · · wd are the model parameters. Like sign(·) or σ(·) here, other nonlinear functions can
also be used.
The next step naturally is to train the model or compute the model parameters of the model, given
a set of training data {(⃗x (1) , y(1) ), (⃗x (2) , y(2) ), · · · , (⃗x (m) , y(m) )}. For that, like the regression example
above, it is a common practice to define an optimization problem first, and then solve that problem to
find out the values of the model parameters. Once we have the optimum model parameters w ⃗ ∗ , the
final step is inference, which is to use the model to predict the target of a test sample as follows:
3
Neural network: a cascade
In fact, we can expand the extension above a bit to build a concatenation of nonlinear computation
units, which will become a simple form of what is widely known as neural networks. The modeling
equation will then be following:
We can further expand the extension above to build a series of computational units, add more
components, architectures, which will merge toward a practically used neural network. Once we have
the model, the idea of training, optimization, and inference are similar as before (may not be exactly
the same).
Regularization
Add a penalty term with the cost function
λ
J (⃗
w) = Jregular (⃗
w) + w||2L2 , where, ||⃗
||⃗ w||2L2 = w12 + · · · + w2d .
2
Unsupervised learning
Here, ytrain also unknown. The goal to to find structures in data. One example of unsupervised learning
is clustering. Clustering is grouping of data samples based on their feature space. Let’s take a look at
a well known clustering algorithm: k-means clustering algorithm.
Let’s assume we have m number of data points and we have to find out K clusters. We have d
features, i.e., the feature vector ⃗x lives in a d-dimensional space, i.e., ⃗x ∈ Rd . Let ⃗µ(k) ∈ Rd be the
centroid of the k-th cluster. Our goal is to minimize the following loss function
m
J (⃗µ, z) = ∑ ||⃗x(i) − ⃗µ(zi ) ||2L2
i =1
Here, we need to choose the centroid ⃗µ and assignment z jointly. Here are the steps:
1
|{i : zi = k}| i:z∑=k
⃗µ(k) ← ⃗x (i) , k ∈ [1, K ].
i
Step 1 above will run only once. Next, we will run the updates in Steps 2 and 3 sequentially back and
forth (2 → 3 → 2 → 3 → · · · ) until convergence.
4
Search
So far, we’ve talked about machine learning models, which are reflex-based: one input leads to one
output (single action). Now, let’s discuss search models, which are state-based: a sequence of actions
with one input leading to multiple outputs over time (see Fig. 2).
For example, consider a pathfinding problem. Suppose you’re in Gulshan and want to reach Dhan-
mondi. Your goal might be to find the shortest path (objective). However, priorities can change
depending on a situation – maybe you now care more about finding the fastest path (another objec-
tive), even if it’s not the shortest. The solution depends on the defined objective, and we need to
consider the future impact of each action.
For instance, an empty road might seem like a good choice initially (locally optimal). But if it leads
to a road with heavy traffic due to construction, it’s a poor decision overall (globally suboptimal). The
key is to evaluate actions not just in the moment but also based on their future consequences. How
can we do that?
The farmer’s small boat can only carry him (F) and one item P
F
– either the G, the P, or the W – at a time. The situation
W
is tricky: If the G is left alone with the P, it will eat the P. G
If the W is left alone with the G, it will attack the G. What
B
should he do?
Search tree: starts from the initial state, then branches out based on all possible actions. A search
tree is a structure where each node represents a state, and the branches represent possible moves or
actions that lead to other states. Each action has a cost ($1 in this case). Any path that takes from
the initial state to the goal state is a valid path. The path that corresponds to the minimum total cost
is the solution. Can you now think of an algorithm that can provide these solutions?
A transportation problem
Imagine you are walking along a street divided into blocks, starting at block 1 and aiming to reach
block N. At each block, you have two options to move forward: a) you can either walk to the next
block, which costs 1 unit, or b) bike to the next block, which costs 2 units. Your goal is to reach block
5
FPGW | | Initial state In this problem, we start with everyone on the
left side of the river (initial state: FPGW ||),
FG --> : $1 F --> : $1
which is the initial state. The goal is to get
PW | | FG PGW | | F everyone safely to the right side of the river
F <-- : $1
(goal state: || FPGW). To achieve this, we
need to consider all the possible valid actions.
Solution 1: Solution 2:
FPW | | G For example, the farmer can go alone from
total cost = $7 total cost = $7
FP --> : $1 FW --> : $1
left to right (F →), or he can take the
goat with him from left to right (FG →).
W | | FPG P | | FGW
Similarly, the farmer can take the plant back
FG <-- : $1 FG <-- : $1 with him from right to left (FP ←), and
so on. The complete list of valid actions is:
FGW | | P FPG | | W
A = { F →, FP →, FG →, FW →, F ←
FW --> : $1 FP --> : $1 , FP ←, FG ←, FW ←}.
G | | FPW
Note that FPG → is not, for example,
F <-- : $1
a valid action, as it breaks the condition
FG | | PW stated in the problem. Also, actions in A
FG --> : $1
are not valid from every state and some may
lead to an invalid state, meaning they could
| | FPGW Goal state result in a state that violates the conditions.
As you can see in the tree above, there are two paths from the initial state to the goal state. Both
paths have the same total cost ($7), so either path can be considered the solution.
N while minimizing the total cost. To represent this, think of each block as a state s. At each state,
you must choose whether to walk or bike, determining the sequence of actions that leads to the lowest
total cost by the time you arrive at your destination, block N.
6
Algorithms for search
Backtracking search
Backtracking search is one of the simplest and probably one of the most intuitive algorithms that comes
to mind when solving a search tree. It’s a straightforward method used to explore all possible solutions
in a search tree. The goal of backtracking is to systematically explore this tree to find a solution, if one
exists.
The process starts at the root (or initital state) of the search tree and explores one branch at a time.
If you reach a point where no solution is possible (like a dead end), you “backtrack” to the previous
node and try a different branch. This way, you explore the tree depth-first, checking all possible paths
one by one. It’s like trying all the keys on a keyring one at a time.
The algorithm uses a queue data structure to keep track of nodes that need to be explored. It
begins by adding the starting node to the queue. Then, in each step, it removes the front node from
the queue, examines it, and adds its unvisited neighbors to the back of the queue. This ensures that
nodes are visited in order of their depth in the tree. While BFS guarantees finding the shortest path
in terms of the number of steps, it can consume significant memory, especially for trees or graphs with
many nodes at each level.
7
Depth First Search (DFS)
Depth-First Search (DFS) is an algorithm that explores a tree or graph by going as deep as possible
along each branch before backtracking. It starts at the root and explores one path completely, moving
down to the next node and continuing until it reaches a dead end or a solution. When it hits a dead
end, DFS backtracks to the last decision point and tries a different path. A flowchart for the DFS
algorithm is provided in Fig. 4.
back back
Figure 4: DFS Explore Seq.: s0 → s11 → s21 → s11 → s22 → s11 → s0 → s12 → s23 → s12 → s0 .
DFS uses a stack to keep track of which nodes to visit next. It starts by adding the starting node
to the stack, then picks the top node, checks it, and adds its neighbors to the stack. The algorithm
continues down the current path until it can’t go further, then it pops the stack to go back and explore
other branches. DFS is useful when you need to explore all possible solutions or paths, but it doesn’t
always find the shortest path and can take up a lot of memory if the tree is very deep.
8
expanding the node with the lowest cumulative cost, ensuring that the first time it reaches a node, it
does so with the least possible cost. A flowchart for the UCS algorithm is provided in Fig. 5.
5 2 5 2 5 2
1 7 1 7
5 2
4 5
1 4 1
7
Figure 5: UCS Explore Sequence: s0 → s12 → s23 → s11 → s21 . Solution: s0 , s11 , s21 : 6.
UCS uses a priority queue (often implemented as a min-heap) to keep track of the nodes to be
explored. Each node in the queue is prioritized based on its path cost, and the algorithm removes the
node with the lowest cost to explore it further. UCS guarantees finding the optimal solution if the
graph has non-negative edge weights, but it can be slower and require more memory than algorithms
like BFS, especially in graphs with many nodes or high branching factors. It is ideal for problems where
the goal is to find the least-cost path, such as in route planning or network optimization.
A* Search
A* Search is an algorithm that combines the strengths of Uniform Cost Search (UCS) with the concept
of heuristics to find the most efficient path in a graph. Like UCS, A* ensures that it explores the least
costly path by considering the cumulative cost of reaching a node. However, A* also uses a heuristic,
which is an estimate of the remaining cost to reach the goal from a node. This helps guide the search
more efficiently toward the goal. In A*, path costs are modified (penalized) based on heuristic values
of each node:
Here, C (·) denotes the new cost and C0 (·) denotes the old cost. Consider the tree on the left. Let’s
say the heuristic values are given. Using the modeified cost calculations, we redraw the tree (see the
tree on the right).
initial state Given heuristic values: h(s0 ) = 2, h(s11 ) = 3, h(s12 ) = initial state
1 1 1, h(s21 ) = 4, h(s22 ) = 0. 2 0
C (s0 → s11 ) = C0 (s0 → s11 ) + h(s11 ) − h(s0 ) = 2
1 1 C (s11 → s21 ) = C0 (s11 → s21 ) + h(s21 ) − h(s11 ) = 2 2 0
C (s0 → s12 ) = C0 (s0 → s12 ) + h(s12 ) − h(s0 ) = 0
goal state goal state
C (s12 → s22 ) = C0 (s12 → s22 ) + h(s22 ) − h(s12 ) = 0
Figure 6: UCS on the modified tree with new costs will now be more efficient.
9
A* uses a priority queue, where each node is assigned a priority based on the sum of two factors:
the path cost to reach the node (like UCS) and the heuristic estimate of the cost to reach the goal. The
algorithm always expands the node with the lowest total cost (path cost + heuristic). This combination
of cost and heuristic allows A* to prioritize paths that are both promising and efficient, making it faster
than UCS in many cases. A* is guaranteed to find the optimal solution if the heuristic is admissible (it
never overestimates the true cost). It is widely used in route planning, game AI, and other optimization
problems.
10
——————————— Volume 2 ———————————
Markov decision process (MDP)
A dice game: stay or quit?
How can you get to Mountain View on Friday night in the shortest time? You could bike, drive, take a
train, call an Uber, or fly. Each option, however, comes with uncertainties. For instance, if you decide
to drive, you might encounter traffic.
Traditional search problems are typically deterministic, assuming no uncertainty in outcomes. Real-
world scenarios like this one require a probabilistic or stochastic approach to account for the inherent
uncertainties. How can we adapt this problem to include these considerations? Now, say you are playing
a game, here are the rules:
Let us denote the dice outcome as a random variable D with domain (sample space) D =
{1, 2, 3, 4, 5, 6}. Let us pay attention to the following figure. It represents the state-diagram of the
problem. The probability of obtaining 1 or 2, i.e., P( D < 3) = 2/6 = 1/3. Similarly, the probability
of obtaining 3 or higher, i.e., P( D ≥ 3) = 4/6 = 2/3.
11
Let us denote the transition probability and reward by T (s, a, s′ ) and
R(s, a, s′ ), where s is the current state (either ‘in’ or ‘end’), s′ is the
next state (either ‘in’ or ‘end’), a is the action chosen (either ‘stay’
or ‘quit’) that takes you from s to s′ . Therefore,
T (in, stay, end) = 1/3, T (in, quit, end) = 1, T (in, stay, in) =
2/3 and R(in, stay, end) = $ 4, R(in, quit, end) =
$ 10, R(in, stay, in) = $ 4.
Policy evaluation
The value of a policy π at state s can be calculated by
(
0, end state
∑′ T (s, a, s′ ) R(s, a, s′ ) + γvπ (s′ )
vπ (s) = ; Qπ (s, a) =
Qπ (s, a), otherwise s
where s is the current state, s′ is the next state, a is the action chosen that takes from s to s′ . T (·)
and R(·) are the transition probabilities and rewards (mentioned before). γ here is called the discount
factor. For now, we will hold γ = 1. Now let’s evaluate the policies for our dice game above using this
formula. When the policy is to stay (π = a = stay), vπ (in) = Qπ (in, stay). Therefore,
vπ =stay (in) = ∑′ T (in, stay, s′ )[ R(in, stay, s′ ) + vπ=stay (s′ )]
s
= T (in, stay, in)[ R(in, stay, in) + vπ =stay (in)] + T (in, stay, end)[ R(in, stay, end) + vπ =stay (end)]
2 1 2
= [$ 4 + vπ =stay (in)] + [$ 4 + 0] = $ 4 + vπ =stay (in)
3 3 3
Therefore, vπ =stay (in) = $ 12. This means the value v for policy stay at state ‘in’ is $ 12. Let’s do
the same for policy quit (π = a = quit). We calculated this a little before using PMF and the value
v came out to be $ 10. Let’s try to find it again using this new formula and see if it matches the old
result:
vπ =quit (in) = ∑′ T (in, quit, s′ )[ R(in, quit, s′ ) + vπ=quit (s′ )]
s
= T (in, quit, in)[ R(in, quit, in) + vπ =quit (in)] + T (in, quit, end)[ R(in, quit, end) + vπ =quit (end)]
= 0[$ 0 + vπ =quit (in)] + 1[$ 10 + 0] = $ 10
Therefore, vπ =quit (in) = $ 10. This means the value v for policy quit at state ‘in’ is $ 10, which
matches the old result calculated using PMF.
Optimum policy
The optimum policy π ∗ (in) = arg maxπ vπ (in). As vπ =stay (in) > vπ =quit (in)($ 12 > $ 10), the
optimum policy π ∗ (in) = stay at state ‘in’.
Value iteration
Iterative solution. For policy evaluation, begin by initializing vπ =a (s) randomly (or maybe with just
zeros). Then perform the following update:
vπ =a (s) := Qπ =a (s, a), stopping criteria: ∆vπ =a (s) < ϵ.
12
For calculating the optimum value v∗ and optimum policy π ∗ , perform the following update:
v∗ (s) := max Qπ =a (s, a), π ∗ (s) := arg max Qπ =a (s, a), stopping criteria: ∆v∗ (s) < ϵ.
a a
u= ∑ ri =⇒ utility
i
training data → in; stay, 4, end. u = 4
training data → in; stay, 4, in; stay, 4, end. u = 4 + 4 = 8
training data → in; stay, 4, · · · , · · · , end. u = 4 + 4 + 4 + 4 = 16
4 + 8 + 16
average u = =Q
bπ
3
13
Botstrapping
b π (s, a) ← (1 − η ) Q
Q b π (s′ , a′ )]
b π (s, a) + η [r + γ Q SARSA
Also estimates Q
b π not Q
b opt .
Q-learning
Allows us to get Qb opt (not Q
b π ).
′
On each (s, a, r, s )
Qb opt ← (1 − η ) Q
b opt (s, a) + η [r + γV bopt (s′ )]
vbopt (s′ ) = max b opt (s′ , a′ )
Q
a′ ∈ Action set
14
——————————— Volume 3 ———————————
Now you (agent) will play against an active counterpart (opponent), as intelligent as you.
First, you will have to make assumptions on what I (opponent) might do. In other words, you
will have to assume the opponents policy. To formally donate a policy π, of either an agent or an
opponent, first you need to build the action set (set of all possible actions). Here, we are talking
about the opponents policy πopp , therefore, let’s build the action set for the opponent. What are all
possible actions of the opponent? Two actions: choosing the left ball (L) or choosing the right ball
(R). Therefore, action set of the opponent Aopp = { L, R}. Policy πopp is a vector containing the
probabilities of each of the actions in the action set Aopp .
πopp = [ P( L), P( R)], or πopp (s) = [ P( L|s), P( R|s)], πopp (s, L) = P( L|s)
more genrally ( f or both agent & opponent), π (s) = [ P( a|s)], ∀ a ∈ A; π (s, a) = P( a|s), ∀ a ∈ A
2. Now, assume that the opponent is playing adversarially, then πopp = [1, 0]. In that case,
For bin A, value v = E[u] = ∑ p[u]u = 1 × (−50) + 0 × 50 = −50.
For bin B, value v = E[u] = ∑ p[u]u = 1 × 1 + 0 × 3 = 1.
For bin C, value v = E[u] = ∑ p[u]u = 1 × (−5) + 0 × 15 = −5.
You are getting the maximum value if you choose bin B. Therefore, for this assumption, you will
choose bin B.
3. Now, assume that the opponent is playing randomly, sometimes picking L, sometimes picking R,
then say πopp = [1/2, 1/2]. In that case,
For bin A, value v = E[u] = ∑ p[u]u = (1/2) × (−50) + (1/2) × 50 = 0.
For bin B, value v = E[u] = ∑ p[u]u = (1/2) × 1 + (1/2) × 3 = 2.
For bin C, value v = E[u] = ∑ p[u]u = (1/2) × (−5) + (1/2) × 15 = 5.
15
Here, you are getting the maximum value if you choose bin C. Therefore, for this assumption,
you will choose bin C.
Therefore, your policy depends on your assumption of your opponent’s policy. Your assumptions can
be right or wrong, that is another discussion.
First, let us consider two-player, turn taking, zero-sum, fully observable games.
Zero-sum game = ∑{agent,opponent} utility = 0. Utility only comes at the end state, in game
playing.
Game evaluation
Both π ag and πopp given. Value of the game is given as
utility(s),
end state
V (s) = ∑ a π ag (s, a)V (s′ ), agent active
∑ a πopp (s, a)V (s′ ), opponent active
Say for example, π ag (s) = [1, 0, 0] and πopp (s) = [1/2, 1/2]. In other words, π ag (s, A) = 1, π ag (s, B) =
0, π ag (s, C ) = 0, πopp (s, L) = 1/2 and πopp (s, R) = 1/2.
Root Active Here, π ag (s) = [1, 0, 0], πopp (s) = [1/2, 1/2]
0 s0 agent
V (s AL ) = utility(s AL ) = −50
1 0
0 V (s B ) = ∑ πopp (sB , a)V (s′ )
a
0 sA 2 sB 5 sC opponent
1 1 1 1 1 1 = πopp (s B , L)V (s BL ) + πopp (s B , R)V (s BR ) = 2
2
∑ πag (s0, a)V (s′ ) = 0
2 2 2 2 2
V ( s0 ) =
s AL s AR s BL s BR sCL sCR a
Expectimax policy
π ag not given (need to find out) and πopp is assumed. Value of the game is given as
utility(s),
end state
∗
V (s) = maxa V (s ),′ agent active π ag (s) = arg max V (s′ )
a
′
∑ a πopp (s, a)V (s ), opponent active
16
Root Active Here, π ag (s) = [1, 0, 0], πopp (s) = [1/2, 1/2]
5 s agent
0
V (s BL ) = utility(s BL ) = 1
V (sC ) = ∑ πopp (sC , a)V (s′ )
a
0 sA 2 sB 5 sC opponent
1 1 1 1 1 1 = πopp (sC , L)V (sCL ) + πopp (sC , R)V (sCR ) = 5
2 2 2 2 2 2
V (s0 ) = max V (s′ )
a
s AL s AR s BL s BR sCL sCR
= max(V (s A ), V (s B ), V (sC )) = 5
-50 50 1 End 3 -5 15
Minimax policy
π ag not given (need to find out) and πopp not given (need to find out). But motives given for both
agent and opponent. Value of the game is then given as
utility(s),
end state
∗
′
V (s) = maxa V (s ), agent active π ag (s) = arg max V (s′ ), πopp
∗
(s) = arg min V (s′ )
a a
mina V (s′ ), opponent active
Root Active Here, π ag (s) = [1, 0, 0], πopp (s) = [1/2, 1/2]
1 s agent
0
V (s BL ) = utility(s BL ) = 1
V (sC ) = ∑ πopp (sC , a)V (s′ )
a
-50 s A 1 sB -5 sC opponent
= πopp (sC , L)V (sCL ) + πopp (sC , R)V (sCR ) = 5
V (s0 ) = max V (s′ )
a
s AL s AR s BL s BR sCL sCR
= max(V (s A ), V (s B ), V (sC )) = 5
-50 50 1 End 3 -5 15
Expectiminimax policy
Three player = agent, opponent, and random nature. π ag not given (need to find out), πopp not given
(need to find out), and πnature is assumed. But motives given for both agent and opponent. Value of
the game is then given as
utility(s), end state
max V (s′ ),
agent active
a ∗
V (s) = ′
π ag (s) = arg max V (s′ ), πopp
∗
(s) = arg min V (s′ )
min a V ( s ) , opponent active a a
′
∑ a πnature (s, a)V (s ), nature active
17
Speeding up minimax
Evaluation: you can evaluate Vminmax (s) on a depth-limited tree adding artificial values to the inter-
mediate state (as no real utility until end state).
For example, in chess game, the utility (and value) is +∞ or 0 or −∞ when checkmate, no real
values at intermediate states. But you can allocate artificial values at intermediate states based on how
many chess pieces an agent or opponent has at a particular state. How many pawns, rooks, knights,
etc. Is queen taken? and so on. For example,
Eval (s) = 10100 (K − K ′ ) + 9( Q − Q′ ) + 5( R − R′ ) + 3( B − B′ + N − N ′ ) + 1( P − P′ )
or, Eval (s) = (number of legal moves of agent − that of opponent), and so on.
If at any node,
1. (≥ αs ) and (≤ β s ) do not overlap → do not explore that node.
2. (≥ αs ) and (≤ β s ) overlap → keep exploring.
Thus, we are reducing the search space.
⃗ · ⃗ϕ(s) (linear), ∇w
⃗) = w
If v(s; w ⃗ ) = ⃗ϕ(s). Therefore, the update equation becomes,
⃗ v ( s; w
⃗ ) − (r + γv(s′ , w
⃗ − η [v(s; w
⃗ := w
w ⃗ ))]⃗ϕ(s)
18
Simultaneous games
Example: Rock-paper-scissors. Another similar game: Two-Finger Morra Game. Rules: Consider a
simultaneous game where the rule is:
- Player A and B show 1 or 2 fingers simultaneously.
- If both choose 1, A gets a reward of 2.
- If both choose 2, A gets a reward of 4.
- Otherwise, A gets a reward of -3.
Payoff Matrix: Let A be the agent and B be the opponent. The actions of A and B are denoted as
a ∈ {1, 2} and b ∈ {1, 2}, respectively. The payoff matrix V ( a, b) is given by:
B=1 B=2
V ( a, b) = A = 1 1 −3
A=2 −3 2
Pure Strategy: (Similar to deterministic policy) A single action a is chosen from the action set.
Mixed Strategy: (Probability distribution) A probability distribution π ( a) is defined over the action
set, satisfying:
0 ≤ π ( a) ≤ 1
For the Two-Finger Morra game, if a player always chooses 2 fingers, it corresponds to a pure strategy
with:
π (1) = 0, π (2) = 1.
If the strategy is to show 1 finger half the time and 2 fingers the other half, it is a mixed strategy:
π = π (1), π (2) = 1/2, 1/2 .
Given strategies of agent A and opponent B, how do we evaluate strategies? we compute the values:
V (π A , π B ) = ∑ π A (a)πB (b)V (a, b).
a,b
This is known as Von Neumann’s Theorem. Non-Zero Sum Games: utility of agent + utility of opponent ̸=
0.
19
Factor Graphs
A variable-based model (transition from states). The idea in states is to specify locally, optimize globally,
and find an action sequence.
Question: How to 3-color 7 provinces of Australia so that no two neighboring provinces have
the same color? The colors available are: { Red, Green, Blue } (domain). Now how to solve these
NT
The 7 provinces of Australia
NT QLD
QLD are: 1) WA (Western Aus-
WA
tralia), 2) NT (Northern Terri-
SA
WA
NSW tory), 3) SA (South Australia), SA NSW
problems? Each province here is treated as a variable, and solution to the problem: assigning values
(choice of colors here) to these variables while respecting constraints. Here,
Variables X = {WA, NT, SA, QLD, NSW, V IC, TAS}; Domaini = { R, G, B}.
Factors f 1 ( X ) = [WA ̸= NT ], f 2 ( X ) = [ NT ̸= QLD ], · · · so on · · ·
Out objective is to find an assignment x such that the weight is maximized. Therefore, the optimum
assignment
Dependent factor D(partial assignment, new variable) = set of factors depending on the new variable
and the partially assigned variables, e.g.,
This means that factors f determine the likelihood or preference of a certain color choice based on
dependencies.
20
P1 P2 P3 P1 must pick B and P3 prefers R. P1 and P2 must pick the same
? ? ?
R R R
or or or and P2 and P3 prefers to pick the same.
B B B
same prefer
B must must same prefers R f j ( P) ≥ 0 are factors → functions containing the constraints or
conditions.
P1 P2 P3
f 1 ( P1 ) = [ P1 == B], f 2 ( P1 , P2 ) = [ P1 == P2 ], f 3 ( P2 , P3 ) =
f1(P1) f2(P1,P2) f3(P2,P3) f4(P3) [ P2 == P3 ] + 2, f 4 ( P3 ) = [ P3 == R] + 1.
One thing we can do is to start with a partial assignment and the extend it. But we cannot do it
randomly.
X1 X2 X3
Chain structured factor fraph.
t1 t2 Variables { Xi }
Observation factor Oi ( xi )
O1 O2 O3 Transition factor ti ( xi , xi+1 )
Conditining: Goal - try to disconnect the graph and thus simplify the problem.
Markov Blanket: Markov Blanket of a set of variables is another set of variable that need to be
conditioned on to separate the first set from the graph. Markov Blanket of {V IC } = {SA, NSW }.
Markov Blanket of {WA, NT } = {SA, QLD }, and so on.
21
X1 X2 X1 Condition on X2 = B. Remove
f(x1,x2) g(x1) = f(x1,B)
{X2:B}
X2 , f , add g.
Elimination: Goal - same as conditioning: disconnect graph and simplify. But it considers all values
instead of conditioning on one value.
X1 X2 X1 Eliminate X2 , f , add h( x1 ) =
f(x1,x2) h(x1) =max f(x1,x2)
{X2:B}
maxx2 f ( x1 , x2 ).
x2
22
——————————— Volume 4 ———————————
Bayesian Network
Factor graphs + Probability.
Probability Review: Random variables (denoted X, Y, Z, etc.) are variables with unknown values, but
there is a known probability distribution over a random variable.
Joint distribution: Joint distribution P( X, Y, Z ) is the probability distribution over multiple random
variables, e.g., X, Y, Z, etc.
• Notation: P(S = s)
Probabilistic Inference
We consider a joint distribution P(S, R, T, A, . . . ) as a probabilistic database, an oracle or source of
truth. Suppose the joint distribution is P(S, R, T, A). What we are going to do is probabilistic inference.
P( R | T = 1, A = 0)
Here, R is the query variable. The expression is conditioned on evidence (T = 1, A = 0). S is marginal-
ized out (not a query, not a condition).
So, probabilistic inference is the process of finding the probability of some set of query variables,
conditioned on some set of conditioning variables, which are set to particular values. It is a combination
of marginalization and conditioning.
23
B E
Arrows here mean, B (or E) causes A, not the opposite. The joint distribution
is given by:
P( B = b, E = e, A = a) = p(b) p(e) p( a | b, e)
A
p(b), p(e), p( a | b, e) are local conditional distributions (need to be defined).
Formalization
The variables involved are: Earthquake (E), Burglary (B), Alarm (A). B and E are independent,
so there is no edge between them in a Bayesian network. However, both B and E are parents of A,
meaning they cause A, not the other way around. From this representation, we can perform an inference.
P( B = b, E = e) = ∑ P( B = b, E = e, A = a)
a
B E
= ∑ p(b) p(e) p( a | b, e) = p(b) p(e) ∑ p( a | b, e) = p(b) p(e)
a a
This shows that marginalization of a leaf node yields a Bayesian network without that
node.
C p(a) A
p(c)
Joint distribution:
p(i|a)
P(C = c, A = a, H = h, I = i ) = p(c) p( a) p(h | c, a) p(i | a)
H I
p(h|c,a)
Markov Model
24
H1 H2 H3 H4
Hidden markov Model (HMM): Here, Hi are hidden
nodes (not observed), Ej are observed nodes.
E1 E2 E3 E4
Y
Naive Bayes Model: Here, Y not observed,
W1 , W2 , · · · , WL are observed. In classification,
W1 WL given W (feature vector), find Y (class label).
W2
Generally,
Inference
P( X3 = x3 | X2 = 5) =?, ∀ x3
Probabilistic Inference
Input: Bayesian network P( X1 = x1 , · · · , Xn = xn ); Evidence E = e, E ⊆ X, Query Q ⊆ X.
Output: P( Q| E = e) or P( Q = q| E = e) for all q.
For examples, in a Hidden Markov Model (HMM), Hi ∈ {1, · · · , k } are actual but hidden data at
time i and Ei ∈ {1, · · · , k} are sensor reading - not actual but observable data at time i.
Start p(hi ); transition p(hi |hi−1 ); emission p(ei |hi ) → local conditional distributions.
We can ask many types of questions from here. Filtering question P( H3 | E1 = e1 , E2 = e2 , E3 = e3);
Smoothing question P( H3 | E1 = e1 , E2 = e2 , E3 = e3 , E4 = e4 , E5 = e5 ). Can have any arbitrary ques-
tion.
Now where do the local conditional distributions come from? These are similar to model parameters
and need to be learned from the training data. Thereforem it’s a supervised learning problem where,
the training data Dtrain are formed with example assignments to X, and parameters to be learned are
θ or local conditional probability distributions.
25
Example problem 1
Consider the simplest Bayesian network P( R = r ) = p(r ). Domain = {1, 2, 3, 4, 5}.
Therefore, here, the parameters are θ = ( p(1), p(2), · · · , p(5)).
If the training data (observed assignments to R) Dtrain = {1, 3, 4, 4, 4, 4, 4, 5, 5, 5} and the param-
eter θ = (1/10, 0/10, 1/10, 5/10, 3/10) = (0.1, 0, 0.1, 0.5, 0.3).
Example problem 2
Consider another Bayesian network P( G = g, R = r ) = p( g) p(r | g), G ∈ {d, c}, R ∈ {1, 2, 3, 4, 5}.
Let’s say the training data Dtrain = {(d, 4), (d, 4), (d, 5), (c, 1), (c, 5)}. Therefore, the model param-
eter
θ = ( p(d), p(c), p(1|d), p(2|d), p(3|d), p(4|d), p(5|d), p(1|c), p(2|c), p(3|c), p(4|c), p(5|c)), i.e.,
θ = (3/5, 2/5, 0, 0, 0, 2/3, 1/3, 1/2, 0, 0, 0, 1/2).
Example problem 3
Consider another Bayesian network P( G = g, A = a, R = r ) = p( g) p( a) p(r | g, a), G ∈ {d, c}, A ∈
{ ao , ai }, R ∈ {1, 2, 3, 4, 5}. Let’s say the training data Dtrain = {(d, ao , 3), (d, ai , 5), (d, ao , 1), (c, ao , 5), (c, ai , 4)}.
Therefore, the model parameter θ = { p( g), p( a), p(r | g, a)} → enumerate like before.
From the training data, p(d) = 3/5, p(c) = 2/5, p( ao ) = 3/5, p( ai ) = 2/5, p(1|d, ao ) =
1/2, p(3|d, ao ) = 1/2, p(5|d, ai ) = 1, p(5|c, ao ) = 1, p(4|c, ai ) = 1.
Logic
Logical models, inference, learning. If a + b = 5, a − b = 3, what is a =?
Deterministic and rule-based.
Logic language goals: 1) represent knowledge, 2) reason with it.
Ingredients of logic: 1) syntex (set of valid formulaes, valid expressions), 2) semantics (specify models
for each formula, meanings of expressions), 3) inference rules (operation of formulas)
26
Model: Model w in propositional logic is an assignement of truth values to propositional symbols.
Example: For propositional symbols A, B(n = 2), there will be 2n = 22 = 4 possible models
W : { A : 0, B : 0}, { A : 0, B : 1}, { A : 1, B : 0}, { A : 1, B : 1}.
Interpolation function I ( f , w): I takes a formula f amd model w as inputs and returns true (1) if w
satisfies f and f alse (0) if it doesn’t.
Example problem 1:
Given formula f = (¬ A ∧ B) ↔ C, model w : { A : 1, B : 1, C : 0}, what is I ( f , w) =?
I (¬ A, w) = 0
=⇒ I (¬ A ∧ B, w) = 0
=⇒ I ((¬ A ∧ B) ↔ C, w) = 1 ( ans.)
M(KB) = ∩ f ∈KB M ( f ).
Example problem 2:
If f = r ∨ w, calculate M( f ).
w=0 w=1
r=0 0 1
r=1 1 1
Example problem 3:
Given, f 1 = r, f 2 = r → w, calculate M (KB).
27
Therefore,
As you add new knowledge (new formula f ) to your knowledge base KB, the knowledge base shrinks
if f is new. If f is not new, KB doesn’t shrink.
Entailment
KB entails f (written KB |= f ) if M(KB) ⊆ M( f ). Example: r ∧ s |= s.
Contradiction
KB contradicts f i f f M(KB) ∩ M ( f ) = ∅. Example: r ∧ s contradicts ¬s.
Contingency
If f adds non-trivial information to KB
∅ ⊊ M (KB) ∩ M ( f ) ⊊ M(KB)
Example: KB = {r }, f = {s}.
Contradiction
KB contradicts f i f f KB entails ¬ f
Tell Function
For a KB and f : Entailment ⇒ already knew that, Contradiction ⇒ do not believe that, Contingency
⇒ learned something new.
Ask Function
For a KB and f : Entailment ⇒ yes, Contradiction ⇒ no, Contingency ⇒ don’t know.
28
Satisfiability
A knowledge base KB is satisfiable if M (KB) ̸= ∅
Not satisfiable → entailment (
If KB ∪ {¬ φ} is : satisfiable → contingency
satisfiable, and if KB ∪ { f } is:
not satisfiable → contradiction
Inference Rules
Given a KB, what new formula f can be derived?
r, r → w premises
=⇒
w conclusions
f1, . . . , fk
i.e., if f 1 , . . . , f k are true, g is also true
g
{ f : KB ⊢ f } ⊆ { f : KB |= f }
Nothing but truth. With premise → KB and conclusion → f . It’s sound if KB entails f:
M(KB) ⊆ M( f )
{ f : KB |= f } ⊆ { f : KB ⊢ f }
29
Definite Clause
P1 ∧ . . . ∧ Pk → q
If P1 . . . Pk hold, q holds.
Horn Clause
A definite clause ( P1 ∧ . . . ∧ Pk → q) or q goal clause ( P1 ∧ . . . ∧ Pk → False)
( P1 ∧ . . . ∧ Pk ) → q
De Morgan’s Laws
1)¬( p ∧ q) = ¬ p ∨ ¬q, 2)¬( p ∨ q) = ¬ p ∧ ¬q
Literal: Either p or ¬ p are literals, where p is a propositional symbol.
Clause: Disjunction of literals.
Horn Clause: A clause with at most one positive literal.
General Clauses
Can have any number of literals: ¬ A ∨ ¬ B ∨ C ∨ ¬ D ∨ E ∨ · · ·
f 1 ∨ . . . ∨ f h ∨ g, ¬ g ∨ v1 ∨ . . . ∨ vm
f 1 ∨ . . . ∨ f h ∨ v1 ∨ . . . ∨ v m
30
Conjunctive Normal Form (CNF)
A CNF formula is a conjunction of clauses:
( A ∨ B ∨ ¬C ) ∧ (¬ B ∨ D )
Each formula in the KB (knowledge base) is a clause. Every formula in propositional logic can be
converted to an equivalent CNF formula:
Solution:
(m → s) → b = ¬(m → s) ∨ b = ¬(¬m ∨ s) ∨ b
= (m ∧ ¬s) ∨ b = (m ∨ b) ∧ (¬s ∨ b) (by distributive law)
Answer:
(m ∨ b) ∧ (¬s ∨ b)
f ↔ g = ( f → g) ∧ ( g → f )
f → g = ¬f ∨ g
¬( f ∧ g) = ¬ f ∨ ¬ g
¬( f ∨ g) = ¬ f ∧ ¬ g
¬¬ f = f
f ∨ ( g ∧ h) = ( f ∨ g) ∧ ( f ∨ h)
References
31