0% found this document useful (0 votes)
21 views31 pages

Artificial Intelligence

The document discusses foundational concepts in probability, optimization, and machine learning, covering both discrete and continuous cases. It explains the modeling, learning, and inference processes in machine learning, detailing linear regression, classification, and neural networks, along with optimization techniques like gradient descent and regularization. Additionally, it introduces unsupervised learning through clustering and explores search problems with examples like the river crossing and transportation problems.

Uploaded by

sheikh.zucky
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views31 pages

Artificial Intelligence

The document discusses foundational concepts in probability, optimization, and machine learning, covering both discrete and continuous cases. It explains the modeling, learning, and inference processes in machine learning, detailing linear regression, classification, and neural networks, along with optimization techniques like gradient descent and regularization. Additionally, it introduces unsupervised learning through clustering and explores search problems with examples like the river crossing and transportation problems.

Uploaded by

sheikh.zucky
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Preliminaries

For discrete case



∑ p[x] = 1, P( X = x ) = p[ x ]
−∞

E[ X ] = ∑ xP(X = x) = ∑ xp[x]
x x

For continuous case


Z ∞ Z b
p( x )dx = 1, P( a ≤ X ≤ b) = p( x )dx
−∞ a

Z ∞
E[ X ] = xp( x )dx
−∞

Here, P(·) represents probability for a random variable taking a value or falling within a range of
values, p(·) or p[·] represent probability density or mass functions for continuous or discrete observed
variables (realizations), respectively. X is a random variable, and x is a specific realization of the
random variable X. A realization of a random variable is the specific value that the variable takes in
a particular trial or experiment. The horizontal axis of a PDF or a PMF represents realization x. Are
P( X ), p( x ), p[ x ] the same? Except for P( X ) denoting the population and p( x ), p[ x ] denoting the
sample (or observed) values? Again if we are not trying to find the probability of a random variable, or
if we have not assigned a number for an event, then we can denote it like this:
number of Head outcomes
P( Head) =
all outcomes

1
——————————— Volume 1 ———————————
Optimization
Discrete optimization: Find the best discrete objects

min Cost( p); D =discrete set


p∈D

Continuous optimization: Find the best vector of real numbers

min Cost(⃗
w)
⃗ ∈Rd
w

A demo problem: Computing edit distance. As inputs you have two strings s and t. You need to find
the minimum number of edits it takes to change s into t. By ‘edit’ here, we mean any of the three
operations: 1) insertion, 2) deletion, 3) substitution. How would you solve it? What is the optimal
strategy? Is this an optimization problem? how? if yes, discrete? or continuous?

Consider another problem: Imagine you’re managing a fleet of trucks, and fuel efficiency depends on
the speed at which the truck travels. Driving too fast wastes fuel due to air resistance, while driving
too slowly is inefficient due to the engine’s lower performance. The goal is to identify the optimal speed
that minimizes fuel usage which can be modeled as a/v + bv2 , with v being the speed. How would
you solve it? Is this also an optimization problem? how? if yes, discrete? or continuous?

Machine Learning
Machine learning can be seen from three perspectives. 1) Modeling: what is the type of the machine
learning model hw ( x )? 2) Learning: How are the model parameters determined given a set of training
data? 3) Inference: Given a trained model, what is the predicted output for the test data? Here, the
output y, called the target, can be discrete, continuous depending on the type of the problem. The
input x, generally called the feature, can also vary based on the problem.

Modeling Learning Inference

hw(x)=? xtrain w|h(.)=? hw(x)


>

x y ytrain xtest ytest=?

Figure 1: Three paradigms: Modeling, Learning, Inference.

Linear regression
If ⃗x represents a feature vector and y represents a continuous target variable, the linear regression model
can be expressed mathematically as:

⃗ = [ w1 , w2 , · · · , w d ] T
⃗ · ⃗x; w
yb = h(⃗x ) = w

where, w1 , · · · wd are the model parameters of the lienar regression model.

2
Once, we have the model, the next step is to train the model, which means to compute the model
parameters of the model, given a set of training data {(⃗x (1) , y(1) ), (⃗x (2) , y(2) ), · · · , (⃗x (m) , y(m) )}. For
that, it is a common practice to define an optimization problem and then solve that problem to find
out the values of the model parameters. A commonly used loss (or cost) function is squared error loss
as follows:
1 m   1 m  
2m i∑ 2m i∑
(i ) (i ) (i ) (i )
J (⃗
w) = h (⃗
x ) − y = ⃗
w · ⃗
x − y
=1 =1

The optimization problem is then

1 m  
2m i∑
(i ) (i )
min J (⃗
w) = min ⃗
w · ⃗
x − y

w ⃗
w =1

and the optimum model parameter vector (i.e., the goal solution in training) is

1 m  
⃗ ∗ = arg min
w

w

2m i=1
⃗ · ⃗x (i) − y(i)
w

One way to solve the above optimization problem is to use gradient descent. Gradient descent starts
⃗ , then iterative updates using the following update equation:
with random initializations of w

∂ 1 m  
⃗ ←w
w ⃗ − η ∇w
⃗ J (⃗ ⃗ ←w
w) or, w ⃗ −η J (⃗ ⃗ − η ∑ h(⃗x ) − y
⃗ ←w
w) or, w (i ) (i )
⃗x (i)

∂w m i =1

The parameter η represents the gradient descent step-size. The iteration stops when ∆J (⃗ w) < ϵ.
Once we have the optimum model parameters w ∗
⃗ , the next and the final step is inference, which is
to use the model to predict the target of a test sample as follows:

⃗ ∗ · ⃗xtest
ybtest = h(⃗xtest ) = w

Linear classification
If ⃗x represents a feature vector and y represents a discrete target variable, the linear classification model
can be expressed mathematically as:
1
w · ⃗x ) or, yb = h(⃗x ) = σ(⃗
yb = h(⃗x ) = sign(⃗ w · ⃗x ), σ(z) = (sigmoid)
1 + e−z
where, w1 , · · · wd are the model parameters. Like sign(·) or σ(·) here, other nonlinear functions can
also be used.
The next step naturally is to train the model or compute the model parameters of the model, given
a set of training data {(⃗x (1) , y(1) ), (⃗x (2) , y(2) ), · · · , (⃗x (m) , y(m) )}. For that, like the regression example
above, it is a common practice to define an optimization problem first, and then solve that problem to
find out the values of the model parameters. Once we have the optimum model parameters w ⃗ ∗ , the
final step is inference, which is to use the model to predict the target of a test sample as follows:

w∗ · ⃗xtest ) or, σ(⃗


ybtest = h(⃗xtest ) = sign(⃗ w∗ · ⃗xtest )

3
Neural network: a cascade
In fact, we can expand the extension above a bit to build a concatenation of nonlinear computation
units, which will become a simple form of what is widely known as neural networks. The modeling
equation will then be following:

yb = h(⃗x ) = ∑ v j σ(⃗w j · ⃗x)


j

We can further expand the extension above to build a series of computational units, add more
components, architectures, which will merge toward a practically used neural network. Once we have
the model, the idea of training, optimization, and inference are similar as before (may not be exactly
the same).

Regularization
Add a penalty term with the cost function
λ
J (⃗
w) = Jregular (⃗
w) + w||2L2 , where, ||⃗
||⃗ w||2L2 = w12 + · · · + w2d .
2

Unsupervised learning
Here, ytrain also unknown. The goal to to find structures in data. One example of unsupervised learning
is clustering. Clustering is grouping of data samples based on their feature space. Let’s take a look at
a well known clustering algorithm: k-means clustering algorithm.
Let’s assume we have m number of data points and we have to find out K clusters. We have d
features, i.e., the feature vector ⃗x lives in a d-dimensional space, i.e., ⃗x ∈ Rd . Let ⃗µ(k) ∈ Rd be the
centroid of the k-th cluster. Our goal is to minimize the following loss function
m
J (⃗µ, z) = ∑ ||⃗x(i) − ⃗µ(zi ) ||2L2
i =1

Here, we need to choose the centroid ⃗µ and assignment z jointly. Here are the steps:

1. Randomly initialize the centroids ⃗µ(k) , k ∈ [1, K ].

2. Assign each data point to the closest centroid:

zi ← arg min ||⃗x (i) − ⃗µ(k) ||2L2 , i ∈ [1, m].


k∈[1,K ]

3. Recalculate centroids, set ⃗µ(k) equal to average of all assigned points:

1
|{i : zi = k}| i:z∑=k
⃗µ(k) ← ⃗x (i) , k ∈ [1, K ].
i

Here, | · | denotes the cardinality of a set.

Step 1 above will run only once. Next, we will run the updates in Steps 2 and 3 sequentially back and
forth (2 → 3 → 2 → 3 → · · · ) until convergence.

4
Search
So far, we’ve talked about machine learning models, which are reflex-based: one input leads to one
output (single action). Now, let’s discuss search models, which are state-based: a sequence of actions
with one input leading to multiple outputs over time (see Fig. 2).

Machine Learning Search


(reflex based) (state based)

x h(.) y x h(.) y1, y2, y3, ....

Figure 2: Machine Learning: single action. Search: sequence of actions.

For example, consider a pathfinding problem. Suppose you’re in Gulshan and want to reach Dhan-
mondi. Your goal might be to find the shortest path (objective). However, priorities can change
depending on a situation – maybe you now care more about finding the fastest path (another objec-
tive), even if it’s not the shortest. The solution depends on the defined objective, and we need to
consider the future impact of each action.
For instance, an empty road might seem like a good choice initially (locally optimal). But if it leads
to a road with heavy traffic due to construction, it’s a poor decision overall (globally suboptimal). The
key is to evaluate actions not just in the moment but also based on their future consequences. How
can we do that?

A river crossing problem


Consider this problem. A farmer (F) needs to cross a river with three of his companions: a Goat (G),
a Plant (P), and a Wolf (W). He can go back and forth aws many times as he wants, but for each
crossing, he has to pay $1. He is also a little low in cash, which is why he wants to get to the other
side of the river with the minimum expense possible.

The farmer’s small boat can only carry him (F) and one item P
F
– either the G, the P, or the W – at a time. The situation
W
is tricky: If the G is left alone with the P, it will eat the P. G
If the W is left alone with the G, it will attack the G. What
B
should he do?

Search tree: starts from the initial state, then branches out based on all possible actions. A search
tree is a structure where each node represents a state, and the branches represent possible moves or
actions that lead to other states. Each action has a cost ($1 in this case). Any path that takes from
the initial state to the goal state is a valid path. The path that corresponds to the minimum total cost
is the solution. Can you now think of an algorithm that can provide these solutions?

A transportation problem
Imagine you are walking along a street divided into blocks, starting at block 1 and aiming to reach
block N. At each block, you have two options to move forward: a) you can either walk to the next
block, which costs 1 unit, or b) bike to the next block, which costs 2 units. Your goal is to reach block

5
FPGW | | Initial state In this problem, we start with everyone on the
left side of the river (initial state: FPGW ||),
FG --> : $1 F --> : $1
which is the initial state. The goal is to get
PW | | FG PGW | | F everyone safely to the right side of the river
F <-- : $1
(goal state: || FPGW). To achieve this, we
need to consider all the possible valid actions.
Solution 1: Solution 2:
FPW | | G For example, the farmer can go alone from
total cost = $7 total cost = $7
FP --> : $1 FW --> : $1
left to right (F →), or he can take the
goat with him from left to right (FG →).
W | | FPG P | | FGW
Similarly, the farmer can take the plant back
FG <-- : $1 FG <-- : $1 with him from right to left (FP ←), and
so on. The complete list of valid actions is:
FGW | | P FPG | | W
A = { F →, FP →, FG →, FW →, F ←
FW --> : $1 FP --> : $1 , FP ←, FG ←, FW ←}.
G | | FPW
Note that FPG → is not, for example,
F <-- : $1
a valid action, as it breaks the condition
FG | | PW stated in the problem. Also, actions in A
FG --> : $1
are not valid from every state and some may
lead to an invalid state, meaning they could
| | FPGW Goal state result in a state that violates the conditions.

As you can see in the tree above, there are two paths from the initial state to the goal state. Both
paths have the same total cost ($7), so either path can be considered the solution.

N while minimizing the total cost. To represent this, think of each block as a state s. At each state,
you must choose whether to walk or bike, determining the sequence of actions that leads to the lowest
total cost by the time you arrive at your destination, block N.

Walking takes from


s → s + 1, biking
takes from s → 2s. ...
Objective: Go from 1 2 3 4 5 6 blocks N
1 → N with mini-
mum total cost.
Say, N = 10. One possible solution: walk → walk → walk → walk → bike, ∑ cost = 6.
Another possible solution: walk → bike → walk → bike, ∑ cost = 6. Not a solution:
bike → bike → bike → walk → walk, as ∑ cost = 8 > 6 (minimum cost).

6
Algorithms for search
Backtracking search
Backtracking search is one of the simplest and probably one of the most intuitive algorithms that comes
to mind when solving a search tree. It’s a straightforward method used to explore all possible solutions
in a search tree. The goal of backtracking is to systematically explore this tree to find a solution, if one
exists.
The process starts at the root (or initital state) of the search tree and explores one branch at a time.
If you reach a point where no solution is possible (like a dead end), you “backtrack” to the previous
node and try a different branch. This way, you explore the tree depth-first, checking all possible paths
one by one. It’s like trying all the keys on a keyring one at a time.

Breadth First Search (BFS)


Breadth-First Search (BFS) is a tree search algorithm that explores all nodes at the same depth level
before moving to the next level. It systematically examines the search tree level by level, starting from
the root and exploring all neighboring nodes before progressing deeper. This makes BFS a great choice
for finding the shortest path in an unweighted graph or when searching for a solution that is closest to
the root. A flowchart for the BFS algorithm is provided in Fig. 3.

Queue Queue Queue Queue

Queue Queue Queue

Figure 3: BFS Explore Sequence: s0 → s11 → s12 → s21 → s22 → s23 .

The algorithm uses a queue data structure to keep track of nodes that need to be explored. It
begins by adding the starting node to the queue. Then, in each step, it removes the front node from
the queue, examines it, and adds its unvisited neighbors to the back of the queue. This ensures that
nodes are visited in order of their depth in the tree. While BFS guarantees finding the shortest path
in terms of the number of steps, it can consume significant memory, especially for trees or graphs with
many nodes at each level.

7
Depth First Search (DFS)
Depth-First Search (DFS) is an algorithm that explores a tree or graph by going as deep as possible
along each branch before backtracking. It starts at the root and explores one path completely, moving
down to the next node and continuing until it reaches a dead end or a solution. When it hits a dead
end, DFS backtracks to the last decision point and tries a different path. A flowchart for the DFS
algorithm is provided in Fig. 4.

back back

Stack Stack Stack Stack Stack Stack

back back back

Stack Stack Stack Stack Stack

Figure 4: DFS Explore Seq.: s0 → s11 → s21 → s11 → s22 → s11 → s0 → s12 → s23 → s12 → s0 .

DFS uses a stack to keep track of which nodes to visit next. It starts by adding the starting node
to the stack, then picks the top node, checks it, and adds its neighbors to the stack. The algorithm
continues down the current path until it can’t go further, then it pops the stack to go back and explore
other branches. DFS is useful when you need to explore all possible solutions or paths, but it doesn’t
always find the shortest path and can take up a lot of memory if the tree is very deep.

Depth-Limited Search and Iterative Deepening DFS


Depth-Limited Search (DLS) is a variation of Depth-First Search (DFS) where the algorithm is restricted
to exploring nodes only up to a certain depth, preventing it from going too deep and potentially using
excessive memory or time. This helps avoid getting stuck in infinite loops or exploring overly deep paths
in large search spaces. However, the limitation is that DLS might miss the solution if it’s deeper than
the specified limit.
Iterative Deepening Depth-First Search (IDDFS) combines the advantages of DFS and DLS by
performing multiple depth-limited searches with increasing depth limits. It starts by performing a DFS
with a depth limit of 1, then repeats the search with a limit of 2, then 3, and so on, until it finds
the solution. This approach ensures that IDDFS will eventually explore all nodes, similar to BFS,
but without requiring the large memory overhead. IDDFS is particularly useful when the solution is
unknown, and memory is limited, as it combines the memory efficiency of DFS with the completeness
of BFS.

Uniform Cost Search (UCS)


Uniform Cost Search (UCS) is an algorithm used to find the shortest path in a weighted graph, where
the cost of moving between nodes may differ. Unlike BFS, which assumes all steps have the same cost,
UCS takes the path cost into account and always explores the least costly path first. It does this by

8
expanding the node with the lowest cumulative cost, ensuring that the first time it reaches a node, it
does so with the least possible cost. A flowchart for the UCS algorithm is provided in Fig. 5.

5 2 5 2 5 2

1 7 1 7

5 2
4 5
1 4 1
7

4 5 Reached goal state: : initial state


STOP
: goal state

Figure 5: UCS Explore Sequence: s0 → s12 → s23 → s11 → s21 . Solution: s0 , s11 , s21 : 6.

UCS uses a priority queue (often implemented as a min-heap) to keep track of the nodes to be
explored. Each node in the queue is prioritized based on its path cost, and the algorithm removes the
node with the lowest cost to explore it further. UCS guarantees finding the optimal solution if the
graph has non-negative edge weights, but it can be slower and require more memory than algorithms
like BFS, especially in graphs with many nodes or high branching factors. It is ideal for problems where
the goal is to find the least-cost path, such as in route planning or network optimization.

A* Search
A* Search is an algorithm that combines the strengths of Uniform Cost Search (UCS) with the concept
of heuristics to find the most efficient path in a graph. Like UCS, A* ensures that it explores the least
costly path by considering the cumulative cost of reaching a node. However, A* also uses a heuristic,
which is an estimate of the remaining cost to reach the goal from a node. This helps guide the search
more efficiently toward the goal. In A*, path costs are modified (penalized) based on heuristic values
of each node:

C (s1 → s2 ) = C0 (s1 → s2 ) + h(s2 ) − h(s1 ), h(s) = heuristic value at state s.

Here, C (·) denotes the new cost and C0 (·) denotes the old cost. Consider the tree on the left. Let’s
say the heuristic values are given. Using the modeified cost calculations, we redraw the tree (see the
tree on the right).

initial state Given heuristic values: h(s0 ) = 2, h(s11 ) = 3, h(s12 ) = initial state

1 1 1, h(s21 ) = 4, h(s22 ) = 0. 2 0
C (s0 → s11 ) = C0 (s0 → s11 ) + h(s11 ) − h(s0 ) = 2
1 1 C (s11 → s21 ) = C0 (s11 → s21 ) + h(s21 ) − h(s11 ) = 2 2 0
C (s0 → s12 ) = C0 (s0 → s12 ) + h(s12 ) − h(s0 ) = 0
goal state goal state
C (s12 → s22 ) = C0 (s12 → s22 ) + h(s22 ) − h(s12 ) = 0
Figure 6: UCS on the modified tree with new costs will now be more efficient.

9
A* uses a priority queue, where each node is assigned a priority based on the sum of two factors:
the path cost to reach the node (like UCS) and the heuristic estimate of the cost to reach the goal. The
algorithm always expands the node with the lowest total cost (path cost + heuristic). This combination
of cost and heuristic allows A* to prioritize paths that are both promising and efficient, making it faster
than UCS in many cases. A* is guaranteed to find the optimal solution if the heuristic is admissible (it
never overestimates the true cost). It is widely used in route planning, game AI, and other optimization
problems.

10
——————————— Volume 2 ———————————
Markov decision process (MDP)
A dice game: stay or quit?
How can you get to Mountain View on Friday night in the shortest time? You could bike, drive, take a
train, call an Uber, or fly. Each option, however, comes with uncertainties. For instance, if you decide
to drive, you might encounter traffic.
Traditional search problems are typically deterministic, assuming no uncertainty in outcomes. Real-
world scenarios like this one require a probabilistic or stochastic approach to account for the inherent
uncertainties. How can we adapt this problem to include these considerations? Now, say you are playing
a game, here are the rules:

For each round m = 1, 2, · · · , you can


stay or quit. If quit, you will get $10; if
stay, you will get $4 and I roll a 6-sided
dice. If the dice results in 1 or 2, the
game is over. Otherwise, you can move
to the next round. What would you do?

Optimum policy (π)


Imagine you’re starting the game in Round 1. At this point, what’s the best policy (π)? Should you
stay or quit? To answer this, we first need to define what we mean by ‘best.’ A policy is considered
the best if it maximizes the rewards you accumulate in the game. Let’s introduce some terminology to
formalize this idea. Each choice you make results in a reward, denoted as r, and the total reward you
collect over all rounds is called the utility, u, expressed as u = ∑ r. The best policy, then, is the one
that maximizes u.
However, the game is stochastic, meaning there’s an element of randomness. Even if you follow
the same policy π, the resulting u may differ because it’s a random variable (therefore, let’s denote
utility as U). To account for this uncertainty, we consider the expected utility, denoted as E[U ], and
determine which policy maximizes this expectation.
To clarify further, let’s introduce a parameter called the value, denoted as v, where v = E[U ].
Thus, the task of finding the best policy boils down to maximizing v with respect to the policy π. The
policy that achieves this is what we call the optimal policy for the current state. For the choice quit,
we can easy find v, but for the choice stay, not that easy.
Let’s say our policy is to quit. In this case, we can define a probability mass p[u]
function (PMF) for U, as the probability of U = 10 is always 1. Therefore, 1
v = E[U ] = ∑ uP(U = u) = ∑ up[u] = 1 × $10 = $10. Okay, what if our 0.5
policy now is to stay. Do we have a PMF for it? Actually we don’t (at least
0 $ 10 u
not easy to find). Then how will we calculate v now?

Let us denote the dice outcome as a random variable D with domain (sample space) D =
{1, 2, 3, 4, 5, 6}. Let us pay attention to the following figure. It represents the state-diagram of the
problem. The probability of obtaining 1 or 2, i.e., P( D < 3) = 2/6 = 1/3. Similarly, the probability
of obtaining 3 or higher, i.e., P( D ≥ 3) = 4/6 = 2/3.

11
Let us denote the transition probability and reward by T (s, a, s′ ) and
R(s, a, s′ ), where s is the current state (either ‘in’ or ‘end’), s′ is the
next state (either ‘in’ or ‘end’), a is the action chosen (either ‘stay’
or ‘quit’) that takes you from s to s′ . Therefore,
T (in, stay, end) = 1/3, T (in, quit, end) = 1, T (in, stay, in) =
2/3 and R(in, stay, end) = $ 4, R(in, quit, end) =
$ 10, R(in, stay, in) = $ 4.

Policy evaluation
The value of a policy π at state s can be calculated by
(
0, end state
∑′ T (s, a, s′ ) R(s, a, s′ ) + γvπ (s′ )
 
vπ (s) = ; Qπ (s, a) =
Qπ (s, a), otherwise s

where s is the current state, s′ is the next state, a is the action chosen that takes from s to s′ . T (·)
and R(·) are the transition probabilities and rewards (mentioned before). γ here is called the discount
factor. For now, we will hold γ = 1. Now let’s evaluate the policies for our dice game above using this
formula. When the policy is to stay (π = a = stay), vπ (in) = Qπ (in, stay). Therefore,
vπ =stay (in) = ∑′ T (in, stay, s′ )[ R(in, stay, s′ ) + vπ=stay (s′ )]
s
= T (in, stay, in)[ R(in, stay, in) + vπ =stay (in)] + T (in, stay, end)[ R(in, stay, end) + vπ =stay (end)]
2 1 2
= [$ 4 + vπ =stay (in)] + [$ 4 + 0] = $ 4 + vπ =stay (in)
3 3 3
Therefore, vπ =stay (in) = $ 12. This means the value v for policy stay at state ‘in’ is $ 12. Let’s do
the same for policy quit (π = a = quit). We calculated this a little before using PMF and the value
v came out to be $ 10. Let’s try to find it again using this new formula and see if it matches the old
result:
vπ =quit (in) = ∑′ T (in, quit, s′ )[ R(in, quit, s′ ) + vπ=quit (s′ )]
s
= T (in, quit, in)[ R(in, quit, in) + vπ =quit (in)] + T (in, quit, end)[ R(in, quit, end) + vπ =quit (end)]
= 0[$ 0 + vπ =quit (in)] + 1[$ 10 + 0] = $ 10
Therefore, vπ =quit (in) = $ 10. This means the value v for policy quit at state ‘in’ is $ 10, which
matches the old result calculated using PMF.

Optimum policy
The optimum policy π ∗ (in) = arg maxπ vπ (in). As vπ =stay (in) > vπ =quit (in)($ 12 > $ 10), the
optimum policy π ∗ (in) = stay at state ‘in’.

Value iteration
Iterative solution. For policy evaluation, begin by initializing vπ =a (s) randomly (or maybe with just
zeros). Then perform the following update:
vπ =a (s) := Qπ =a (s, a), stopping criteria: ∆vπ =a (s) < ϵ.

12
For calculating the optimum value v∗ and optimum policy π ∗ , perform the following update:
v∗ (s) := max Qπ =a (s, a), π ∗ (s) := arg max Qπ =a (s, a), stopping criteria: ∆v∗ (s) < ϵ.
a a

Reinforcement learning (RL)


Consider the dice game again. The transition probabilities and reward values, denoted as T and R, were
given earlier in the problem statement. But what if these values are not provided, and we need to find
them ourselves? In that case, they can be treated as model parameters within the MDP framework.
Now the model parameters are calculated during the learning (training) phase. This implies that we
need training data to estimate these parameters. But where does this training data come from?
Imagine someone already played the game, and
you observed their actions: the choices they
made, the outcomes of the dice rolls, and
the rewards they received. These observations
form your training data (called episode) is rep-
resented as s0 ; a1 r1 s1 ; a2 r2 s2 ; · · · ; end, where
si−1 is the current state and si is the next state
reached due to action ai obtaining reward ri .

Mote Carlo method (model-based)


Estimate transition probabilities and reward

b(s, a, s′ ) = # times (s, a, s ) occurs ; R
T b (s, a, s′ ) = r in (s, a, r, s′ )
# times (s, a) occurs
Imagine, we are given the episode data: e1 = in; stay, $4, in; stay, $4, in; stay, $4, in; stay, $4, end.
b(in, stay, in) = 3/4, R
Based on this data, T b (in, stay, in) = $ 4, T b(in, stay, end) = 1/4, R
b (in, stay, end) =
b(in, quit, end) = cannot f ind, R
$ 4, T b (in, quit, end) = cannot f ind (absent in episode).
Now imagine you got two more episode data: e2 = in; stay, $4, in; stay, $4, end, e3 = in; stay, $ 4, end.
You need to update the estimates considering all of e1 , e2 , e3 . The new estimates become Tb(in, stay, in) =
4/7, R(in, stay, in) = $ 4, T (in, stay, end) = 3/7, R(in, stay, end) = $ 4.
b b b

Model free Mote Carlo


Estimate Qπ → Q
bπ .
b opt (s, a) =
Q ∑′ Tb(s, a, s′ )[ Rb(s, a, s′ ) + γbvopt (s′ )]
s

u= ∑ ri =⇒ utility
i
training data → in; stay, 4, end. u = 4
training data → in; stay, 4, in; stay, 4, end. u = 4 + 4 = 8
training data → in; stay, 4, · · · , · · · , end. u = 4 + 4 + 4 + 4 = 16
4 + 8 + 16
average u = =Q

3

13
Botstrapping

b π (s, a) ← (1 − η ) Q
Q b π (s′ , a′ )]
b π (s, a) + η [r + γ Q SARSA

Also estimates Q
b π not Q
b opt .

Q-learning
Allows us to get Qb opt (not Q
b π ).

On each (s, a, r, s )

Qb opt ← (1 − η ) Q
b opt (s, a) + η [r + γV bopt (s′ )]
vbopt (s′ ) = max b opt (s′ , a′ )
Q
a′ ∈ Action set

Deep reinforcement learning - use neural net to estimate Q


b opt .

14
——————————— Volume 3 ———————————
Now you (agent) will play against an active counterpart (opponent), as intelligent as you.

A simple turn-taking game


Consider this simple turn-based game. See how you decide your actions.

There are 3 bins – A, B, & C. Each bin has 2 balls (left


& right). The numbers on the balls represent reward
points. You’re the agent, and I’m your opponent. First,
-50 50 1 3 -5 15 you choose a bin. Then, I pick one of the 2 balls from
that bin, and you’ll get the reward points written on it.
bin A bin B bin C So, which bin would you choose?

First, you will have to make assumptions on what I (opponent) might do. In other words, you
will have to assume the opponents policy. To formally donate a policy π, of either an agent or an
opponent, first you need to build the action set (set of all possible actions). Here, we are talking
about the opponents policy πopp , therefore, let’s build the action set for the opponent. What are all
possible actions of the opponent? Two actions: choosing the left ball (L) or choosing the right ball
(R). Therefore, action set of the opponent Aopp = { L, R}. Policy πopp is a vector containing the
probabilities of each of the actions in the action set Aopp .

πopp = [ P( L), P( R)], or πopp (s) = [ P( L|s), P( R|s)], πopp (s, L) = P( L|s)
more genrally ( f or both agent & opponent), π (s) = [ P( a|s)], ∀ a ∈ A; π (s, a) = P( a|s), ∀ a ∈ A

Here, we do not know πopp . So we have to assume it:


1. Assume that the opponent is playing favorably (in my favor - which is not usual, by the way),
then πopp = [0, 1]. In that case,
For bin A, value v = E[u] = ∑ p[u]u = 0 × (−50) + 1 × 50 = 50.
For bin B, value v = E[u] = ∑ p[u]u = 0 × 1 + 1 × 3 = 3.
For bin C, value v = E[u] = ∑ p[u]u = 0 × (−5) + 1 × 15 = 15.
You are getting the maximum value if you choose bin A. Therefore, for this assumption, you will
choose bin A.

2. Now, assume that the opponent is playing adversarially, then πopp = [1, 0]. In that case,
For bin A, value v = E[u] = ∑ p[u]u = 1 × (−50) + 0 × 50 = −50.
For bin B, value v = E[u] = ∑ p[u]u = 1 × 1 + 0 × 3 = 1.
For bin C, value v = E[u] = ∑ p[u]u = 1 × (−5) + 0 × 15 = −5.
You are getting the maximum value if you choose bin B. Therefore, for this assumption, you will
choose bin B.

3. Now, assume that the opponent is playing randomly, sometimes picking L, sometimes picking R,
then say πopp = [1/2, 1/2]. In that case,
For bin A, value v = E[u] = ∑ p[u]u = (1/2) × (−50) + (1/2) × 50 = 0.
For bin B, value v = E[u] = ∑ p[u]u = (1/2) × 1 + (1/2) × 3 = 2.
For bin C, value v = E[u] = ∑ p[u]u = (1/2) × (−5) + (1/2) × 15 = 5.

15
Here, you are getting the maximum value if you choose bin C. Therefore, for this assumption,
you will choose bin C.

Therefore, your policy depends on your assumption of your opponent’s policy. Your assumptions can
be right or wrong, that is another discussion.
First, let us consider two-player, turn taking, zero-sum, fully observable games.
Zero-sum game = ∑{agent,opponent} utility = 0. Utility only comes at the end state, in game
playing.

The halving game


It’s a two-player turn-taking game. Start with any number N. Each player can choose between two
actions: 1) decrement by 1, 2) replace N with f loor ( N/2). The player who is left with 0 wins. How
will you play it? How will you make decisions?

Game evaluation
Both π ag and πopp given. Value of the game is given as

utility(s),
 end state
V (s) = ∑ a π ag (s, a)V (s′ ), agent active
∑ a πopp (s, a)V (s′ ), opponent active

Say for example, π ag (s) = [1, 0, 0] and πopp (s) = [1/2, 1/2]. In other words, π ag (s, A) = 1, π ag (s, B) =
0, π ag (s, C ) = 0, πopp (s, L) = 1/2 and πopp (s, R) = 1/2.

Root Active Here, π ag (s) = [1, 0, 0], πopp (s) = [1/2, 1/2]
0 s0 agent
V (s AL ) = utility(s AL ) = −50
1 0
0 V (s B ) = ∑ πopp (sB , a)V (s′ )
a
0 sA 2 sB 5 sC opponent
1 1 1 1 1 1 = πopp (s B , L)V (s BL ) + πopp (s B , R)V (s BR ) = 2
2
∑ πag (s0, a)V (s′ ) = 0
2 2 2 2 2
V ( s0 ) =
s AL s AR s BL s BR sCL sCR a

-50 50 1 End 3 -5 15 = π ag (s0 , A)V (s A ) + π ag (s0 , B)V (s B ) + π ag (s0 , C )V (sC )

Expectimax policy
π ag not given (need to find out) and πopp is assumed. Value of the game is given as

utility(s),
 end state

V (s) = maxa V (s ),′ agent active π ag (s) = arg max V (s′ )
a


∑ a πopp (s, a)V (s ), opponent active

Say our assumed πopp (s) = [1/2, 1/2].

16
Root Active Here, π ag (s) = [1, 0, 0], πopp (s) = [1/2, 1/2]
5 s agent
0
V (s BL ) = utility(s BL ) = 1
V (sC ) = ∑ πopp (sC , a)V (s′ )
a
0 sA 2 sB 5 sC opponent
1 1 1 1 1 1 = πopp (sC , L)V (sCL ) + πopp (sC , R)V (sCR ) = 5
2 2 2 2 2 2
V (s0 ) = max V (s′ )
a
s AL s AR s BL s BR sCL sCR
= max(V (s A ), V (s B ), V (sC )) = 5
-50 50 1 End 3 -5 15

Minimax policy
π ag not given (need to find out) and πopp not given (need to find out). But motives given for both
agent and opponent. Value of the game is then given as

utility(s),
 end state


V (s) = maxa V (s ), agent active π ag (s) = arg max V (s′ ), πopp

(s) = arg min V (s′ )
a a
mina V (s′ ), opponent active

For the same game above,

Root Active Here, π ag (s) = [1, 0, 0], πopp (s) = [1/2, 1/2]
1 s agent
0
V (s BL ) = utility(s BL ) = 1
V (sC ) = ∑ πopp (sC , a)V (s′ )
a
-50 s A 1 sB -5 sC opponent
= πopp (sC , L)V (sCL ) + πopp (sC , R)V (sCR ) = 5
V (s0 ) = max V (s′ )
a
s AL s AR s BL s BR sCL sCR
= max(V (s A ), V (s B ), V (sC )) = 5
-50 50 1 End 3 -5 15

Expectiminimax policy
Three player = agent, opponent, and random nature. π ag not given (need to find out), πopp not given
(need to find out), and πnature is assumed. But motives given for both agent and opponent. Value of
the game is then given as


 utility(s), end state
max V (s′ ),

agent active
a ∗
V (s) = ′
π ag (s) = arg max V (s′ ), πopp

(s) = arg min V (s′ )


 min a V ( s ) , opponent active a a

∑ a πnature (s, a)V (s ), nature active

17
Speeding up minimax
Evaluation: you can evaluate Vminmax (s) on a depth-limited tree adding artificial values to the inter-
mediate state (as no real utility until end state).
For example, in chess game, the utility (and value) is +∞ or 0 or −∞ when checkmate, no real
values at intermediate states. But you can allocate artificial values at intermediate states based on how
many chess pieces an agent or opponent has at a particular state. How many pawns, rooks, knights,
etc. Is queen taken? and so on. For example,
Eval (s) = 10100 (K − K ′ ) + 9( Q − Q′ ) + 5( R − R′ ) + 3( B − B′ + N − N ′ ) + 1( P − P′ )
or, Eval (s) = (number of legal moves of agent − that of opponent), and so on.

Alpha Beta Pruning


Lower bound on max nodes = αs . Upper bound on min nodes = β s . No interval? → drop that subtree.
Has inverval? → keep updating αs , β s .
αs = max as , β s = min bs
s′ ⪯s s′ ⪯s

If at any node,
1. (≥ αs ) and (≤ β s ) do not overlap → do not explore that node.
2. (≥ αs ) and (≤ β s ) overlap → keep exploring.
Thus, we are reducing the search space.

Learning in the setting of game


⃗) = w
Learn from data Eval (s) = v(s; w ⃗ · ⃗ϕ(s). Here ⃗ϕ(s) → feature vector. For example,
 
Q ag − Qopp
 Legal moves ag − Legal movesopp 
⃗ϕ(s) =  
..
.
s, a, r, s′ → say we have got this from any given episode data. Then Pred(⃗ w) = v(s, w⃗ ), Target =
r + γv(s , w′ ⃗ ) (we will treat Target as a constant, not a function of w
⃗ , r = 0 as intermediate, γ = 1).
Therefore,
1 ∂
Cost(⃗ w) = ( Pred(⃗ w) − Target)2 , ∇w ⃗ Cost (⃗
w) = Cost(⃗w) = ( Pred(⃗ w) − Target)∇w ⃗ Pred (⃗
w)
2 ⃗
∂w
Therefore, the gradient descent update is
w ⃗ − η ( Pred(⃗
⃗ := w w) − Target)∇w
⃗ Pred (⃗
w)

TD (Temporal Difference) learning


On each (s, a, r, s′ ),
w ⃗ ) − (r + γv(s′ , w
⃗ − η [v(s; w
⃗ := w ⃗ ))]∇w ⃗)
⃗ v ( s; w

⃗ · ⃗ϕ(s) (linear), ∇w
⃗) = w
If v(s; w ⃗ ) = ⃗ϕ(s). Therefore, the update equation becomes,
⃗ v ( s; w
⃗ ) − (r + γv(s′ , w
⃗ − η [v(s; w
⃗ := w
w ⃗ ))]⃗ϕ(s)

18
Simultaneous games
Example: Rock-paper-scissors. Another similar game: Two-Finger Morra Game. Rules: Consider a
simultaneous game where the rule is:
- Player A and B show 1 or 2 fingers simultaneously.
- If both choose 1, A gets a reward of 2.
- If both choose 2, A gets a reward of 4.
- Otherwise, A gets a reward of -3.

Payoff Matrix: Let A be the agent and B be the opponent. The actions of A and B are denoted as
a ∈ {1, 2} and b ∈ {1, 2}, respectively. The payoff matrix V ( a, b) is given by:

B=1 B=2
V ( a, b) = A = 1 1 −3
A=2 −3 2
Pure Strategy: (Similar to deterministic policy) A single action a is chosen from the action set.
Mixed Strategy: (Probability distribution) A probability distribution π ( a) is defined over the action
set, satisfying:
0 ≤ π ( a) ≤ 1
For the Two-Finger Morra game, if a player always chooses 2 fingers, it corresponds to a pure strategy
with:
π (1) = 0, π (2) = 1.
If the strategy is to show 1 finger half the time and 2 fingers the other half, it is a mixed strategy:
   
π = π (1), π (2) = 1/2, 1/2 .
Given strategies of agent A and opponent B, how do we evaluate strategies? we compute the values:
V (π A , π B ) = ∑ π A (a)πB (b)V (a, b).
a,b

For the Two-Finger Morra game, given:


π A = [1, 0], π B = [1/2, 1/2] ,
we compute:
V (π A , π B ) = π A (1)π B (1)V (1, 1) + π A (1)π B (2)V (1, 2) + π A (2)π B (2)V (2, 2) + π A (2)π B (1)V (2, 1).
Substituting values:
1 1 1 1
= 1× × 2 + 1 × × (−3) + 0 × × (−3) + 0 × × 4.
2 2 2 2
1
=− (bad for agent A in these strategies).
2
How will players A and B choose their strategies? - Player A (agent) wants to maximize V ( a, b). -
Player B (opponent) wants to minimize V ( a, b). This leads to the minimax strategy:
max min V (π A , π B ) = min max V (π A , π B ).
πA πB πB πA

This is known as Von Neumann’s Theorem. Non-Zero Sum Games: utility of agent + utility of opponent ̸=
0.

19
Factor Graphs
A variable-based model (transition from states). The idea in states is to specify locally, optimize globally,
and find an action sequence.
Question: How to 3-color 7 provinces of Australia so that no two neighboring provinces have
the same color? The colors available are: { Red, Green, Blue } (domain). Now how to solve these

NT
The 7 provinces of Australia
NT QLD
QLD are: 1) WA (Western Aus-
WA
tralia), 2) NT (Northern Terri-
SA
WA
NSW tory), 3) SA (South Australia), SA NSW

VIC 4) QLD (Queensland), 5) NSW


VIC
(New South Wales), 6) VIC
TAS (Victoria), 7) TAS (Tasmania). TAS

One possible solution: {WA : R, NT : G, SA : B, QLD : R, NSW : G, V IC : R, TAS : G }. There


can be other solutions as well.

problems? Each province here is treated as a variable, and solution to the problem: assigning values
(choice of colors here) to these variables while respecting constraints. Here,

Variables X = {WA, NT, SA, QLD, NSW, V IC, TAS}; Domaini = { R, G, B}.
Factors f 1 ( X ) = [WA ̸= NT ], f 2 ( X ) = [ NT ̸= QLD ], · · · so on · · ·

Scope: Variables on which the factor depends on. Scope of f 1 ( X ) is {WA, NT }.


Arity: Number of variable in scope. Arity of f 1 ( X ) is 2.
Complete assignment: x = {WA : R, NT : G, · · · so on} → one choice.
m
weight( x ) = ∏ f j (x) → product of all factors after a complete assignment x.
j =1

Out objective is to find an assignment x such that the weight is maximized. Therefore, the optimum
assignment

x ∗ = arg max weight( x )


x

Dependent factor D(partial assignment, new variable) = set of factors depending on the new variable
and the partially assigned variables, e.g.,

D ({WA : R, NT : G }, SA) = {[WA ̸= SA], [ NT ̸= SA]}.

Another Example: Player Decision Making


Consider three players P1 , P2 , P3 choosing between Red (R) and Blue (B). Their choices depend on
preferences and constraints modeled using factors:

P1 , P2 , P3 are variables, and R, B are values assigned to these variables, Pi ∈ { R, B} (domain).

This means that factors f determine the likelihood or preference of a certain color choice based on
dependencies.

20
P1 P2 P3 P1 must pick B and P3 prefers R. P1 and P2 must pick the same
? ? ?
R R R
or or or and P2 and P3 prefers to pick the same.
B B B
same prefer
B must must same prefers R f j ( P) ≥ 0 are factors → functions containing the constraints or
conditions.
P1 P2 P3

f 1 ( P1 ) = [ P1 == B], f 2 ( P1 , P2 ) = [ P1 == P2 ], f 3 ( P2 , P3 ) =
f1(P1) f2(P1,P2) f3(P2,P3) f4(P3) [ P2 == P3 ] + 2, f 4 ( P3 ) = [ P3 == R] + 1.

Constraint Satisfaction Problem (CSP)


A CSP is a factor graph where all factors are constraints, i.e., f j ( x ) ∈ {0, 1}∀ j ∈ [1, m]. The constraint
is satisfied iff f j ( x ) = 1∀ j ∈ [1, m] or weight( x ) = 1.
If weight( x ) = 1 → consistent assignment, if weight( x ) = 0 → inconsistent assignment.
Now, how do we go about solving these?

Goal: Find an assignment x, so that it’s consistent.

One thing we can do is to start with a partial assignment and the extend it. But we cannot do it
randomly.

CSP algorithm (a pseudocode)


input → partial assignment x, current domain of unassigned variables Di
if x is complete
return;
else
Choose an assigned variable Xi
For each value v in Di of chosen Xi
δ := ∏ f j ∈ D( x,Xi ) f j ( x ∪ { Xi : v})
if δ = 0
continue;
else
update domains of unassigned variables, i.e., Di for Xi : v
repeat with partial assignment x ∪ { Xi : v} & update domain

Consider a new problem: object tracking. Xi → location variables.

X1 X2 X3
Chain structured factor fraph.
t1 t2 Variables { Xi }
Observation factor Oi ( xi )
O1 O2 O3 Transition factor ti ( xi , xi+1 )

Conditining: Goal - try to disconnect the graph and thus simplify the problem.
Markov Blanket: Markov Blanket of a set of variables is another set of variable that need to be
conditioned on to separate the first set from the graph. Markov Blanket of {V IC } = {SA, NSW }.
Markov Blanket of {WA, NT } = {SA, QLD }, and so on.

21
X1 X2 X1 Condition on X2 = B. Remove
f(x1,x2) g(x1) = f(x1,B)
{X2:B}
X2 , f , add g.

Elimination: Goal - same as conditioning: disconnect graph and simplify. But it considers all values
instead of conditioning on one value.

X1 X2 X1 Eliminate X2 , f , add h( x1 ) =
f(x1,x2) h(x1) =max f(x1,x2)
{X2:B}
maxx2 f ( x1 , x2 ).
x2

22
——————————— Volume 4 ———————————
Bayesian Network
Factor graphs + Probability.
Probability Review: Random variables (denoted X, Y, Z, etc.) are variables with unknown values, but
there is a known probability distribution over a random variable.

Joint distribution: Joint distribution P( X, Y, Z ) is the probability distribution over multiple random
variables, e.g., X, Y, Z, etc.

• Notation: P(S = s)

– Uppercase S represents a random variable.


– Lowercase s represents a value that the random variable can take (a particular realization of
the random variable).

• P(S = s) represents a single probability value.

• P(S) represents the whole probability distribution.

• P(S) is a marginal distribution of a joint distribution P(S, R).

• P(S | R = r ) is a conditional distribution, conditioned on R = r.

Probabilistic Inference
We consider a joint distribution P(S, R, T, A, . . . ) as a probabilistic database, an oracle or source of
truth. Suppose the joint distribution is P(S, R, T, A). What we are going to do is probabilistic inference.

P( R | T = 1, A = 0)

Here, R is the query variable. The expression is conditioned on evidence (T = 1, A = 0). S is marginal-
ized out (not a query, not a condition).

So, probabilistic inference is the process of finding the probability of some set of query variables,
conditioned on some set of conditioning variables, which are set to particular values. It is a combination
of marginalization and conditioning.

An example: Alarm System


Let’s look at an example. Suppose you have an alarm in your house that detects earthquakes and
burglaries. Assume these two incidents are independent of each other. Now, suppose you have heard
the alarm. What are the probabilities of an earthquake and burglary in your mind?
Next, you check your home and find no missing items, but you hear on TV that there has been an
earthquake in the area. After this, how do the probabilities in your mind change?

23
B E
Arrows here mean, B (or E) causes A, not the opposite. The joint distribution
is given by:

P( B = b, E = e, A = a) = p(b) p(e) p( a | b, e)
A
p(b), p(e), p( a | b, e) are local conditional distributions (need to be defined).

Formalization
The variables involved are: Earthquake (E), Burglary (B), Alarm (A). B and E are independent,
so there is no edge between them in a Bayesian network. However, both B and E are parents of A,
meaning they cause A, not the other way around. From this representation, we can perform an inference.

In summary, let X = ( X1 , X2 , . . . , Xn ) be a collection of random variables. A Bayesian network is


a directed acyclic graph that specifies a joint distribution over X as a product of local conditional
distributions, one for each node:
n
P ( X1 = x 1 , . . . , X n = x n ) = ∏ P(xi | x parents(i) )
i =1

Now, marginalizing out A:

P( B = b, E = e) = ∑ P( B = b, E = e, A = a)
a
B E
= ∑ p(b) p(e) p( a | b, e) = p(b) p(e) ∑ p( a | b, e) = p(b) p(e)
a a
This shows that marginalization of a leaf node yields a Bayesian network without that
node.

Some example networks

C p(a) A
p(c)
Joint distribution:

p(i|a)
P(C = c, A = a, H = h, I = i ) = p(c) p( a) p(h | c, a) p(i | a)
H I
p(h|c,a)

Markov Model

X1 X2 X3 X4 Markov Model: Here, Xi are all observed nodes.

24
H1 H2 H3 H4
Hidden markov Model (HMM): Here, Hi are hidden
nodes (not observed), Ej are observed nodes.
E1 E2 E3 E4

Hidden Markov Model (HMM)


Naive Bayes

Y
Naive Bayes Model: Here, Y not observed,
W1 , W2 , · · · , WL are observed. In classification,
W1 WL given W (feature vector), find Y (class label).
W2

Generally,

Generally, we need to estimate H given E, where H


H E
is the hidden variable and E is the observed variable.

Inference
P( X3 = x3 | X2 = 5) =?, ∀ x3

Probabilistic Inference
Input: Bayesian network P( X1 = x1 , · · · , Xn = xn ); Evidence E = e, E ⊆ X, Query Q ⊆ X.
Output: P( Q| E = e) or P( Q = q| E = e) for all q.

For examples, in a Hidden Markov Model (HMM), Hi ∈ {1, · · · , k } are actual but hidden data at
time i and Ei ∈ {1, · · · , k} are sensor reading - not actual but observable data at time i.

Start p(hi ); transition p(hi |hi−1 ); emission p(ei |hi ) → local conditional distributions.

Joint distribution P( H = h, E = e) = p(h1 ) ∏in=2 p(hi |hi−1 ) ∏in=1 p(ei |hi ).

We can ask many types of questions from here. Filtering question P( H3 | E1 = e1 , E2 = e2 , E3 = e3);
Smoothing question P( H3 | E1 = e1 , E2 = e2 , E3 = e3 , E4 = e4 , E5 = e5 ). Can have any arbitrary ques-
tion.

Now where do the local conditional distributions come from? These are similar to model parameters
and need to be learned from the training data. Thereforem it’s a supervised learning problem where,
the training data Dtrain are formed with example assignments to X, and parameters to be learned are
θ or local conditional probability distributions.

25
Example problem 1
Consider the simplest Bayesian network P( R = r ) = p(r ). Domain = {1, 2, 3, 4, 5}.
Therefore, here, the parameters are θ = ( p(1), p(2), · · · , p(5)).

If the training data (observed assignments to R) Dtrain = {1, 3, 4, 4, 4, 4, 4, 5, 5, 5} and the param-
eter θ = (1/10, 0/10, 1/10, 5/10, 3/10) = (0.1, 0, 0.1, 0.5, 0.3).

Example problem 2
Consider another Bayesian network P( G = g, R = r ) = p( g) p(r | g), G ∈ {d, c}, R ∈ {1, 2, 3, 4, 5}.
Let’s say the training data Dtrain = {(d, 4), (d, 4), (d, 5), (c, 1), (c, 5)}. Therefore, the model param-
eter
θ = ( p(d), p(c), p(1|d), p(2|d), p(3|d), p(4|d), p(5|d), p(1|c), p(2|c), p(3|c), p(4|c), p(5|c)), i.e.,
θ = (3/5, 2/5, 0, 0, 0, 2/3, 1/3, 1/2, 0, 0, 0, 1/2).

Example problem 3
Consider another Bayesian network P( G = g, A = a, R = r ) = p( g) p( a) p(r | g, a), G ∈ {d, c}, A ∈
{ ao , ai }, R ∈ {1, 2, 3, 4, 5}. Let’s say the training data Dtrain = {(d, ao , 3), (d, ai , 5), (d, ao , 1), (c, ao , 5), (c, ai , 4)}.
Therefore, the model parameter θ = { p( g), p( a), p(r | g, a)} → enumerate like before.

From the training data, p(d) = 3/5, p(c) = 2/5, p( ao ) = 3/5, p( ai ) = 2/5, p(1|d, ao ) =
1/2, p(3|d, ao ) = 1/2, p(5|d, ai ) = 1, p(5|c, ao ) = 1, p(4|c, ai ) = 1.

Maximum likelihood objective, Maximum marginal likelihood, Expectation Maximization


Please cover this part from the handwritten note.

Logic
Logical models, inference, learning. If a + b = 5, a − b = 3, what is a =?
Deterministic and rule-based.
Logic language goals: 1) represent knowledge, 2) reason with it.
Ingredients of logic: 1) syntex (set of valid formulaes, valid expressions), 2) semantics (specify models
for each formula, meanings of expressions), 3) inference rules (operation of formulas)

Propositional symbols (atomic formulas): A, B, C


Logical connectives: ¬, ∧, ∨, →, ↔ (not/negation, and/conjunction, or/disjunction, implication, bidi-
rectional implication)
Build-up formulas: if f and g are formulas, so are the following: ¬ f , f ∧ g, f ∨ g, f → g, f ↔ g, etc.

Similarly, Formula: A, ¬ A, ¬ B → C, ¬ A ∧ (¬ B → C ) ∨ (¬ B → D ), ¬¬ A, etc.


Not a formula: A¬ B, A + B, etc.

26
Model: Model w in propositional logic is an assignement of truth values to propositional symbols.
Example: For propositional symbols A, B(n = 2), there will be 2n = 22 = 4 possible models
W : { A : 0, B : 0}, { A : 0, B : 1}, { A : 1, B : 0}, { A : 1, B : 1}.

Interpolation function I ( f , w): I takes a formula f amd model w as inputs and returns true (1) if w
satisfies f and f alse (0) if it doesn’t.

Example problem 1:
Given formula f = (¬ A ∧ B) ↔ C, model w : { A : 1, B : 1, C : 0}, what is I ( f , w) =?

Solution: I ( A, w) = 1, I ( B, w) = 1, I (C, w) = 0 (given). Therefore,

I (¬ A, w) = 0
=⇒ I (¬ A ∧ B, w) = 0
=⇒ I ((¬ A ∧ B) ↔ C, w) = 1 ( ans.)

M ( f ): set of models w where I ( f , w) = 1.

Knowledge base (KB): set of formulas representing their conjunction (intersection)

M(KB) = ∩ f ∈KB M ( f ).

Example problem 2:
If f = r ∨ w, calculate M( f ).

Solution: Given that f = r ∨ w, i.e.,

w=0 w=1
r=0 0 1
r=1 1 1

Therefore, M( f ) = {{r : 0, w : 1}, {r : 1, w : 0}, {r : 1, w : 1}}.

Example problem 3:
Given, f 1 = r, f 2 = r → w, calculate M (KB).

Solution: Given that f 1 = r, i.e.,


w=0 w=1
r=0 0 0
r=1 1 1

27
Therefore,

M ( f 1 ) = {{r : 1, w : 0}, {r : 1, w : 1}} (1)

Also given, f 2 = r → w, i.e.,


w=0 w=1
r=0 1 1
r=1 0 1
Therefore,

M( f 2 ) = {{r : 0, w : 0}, {r : 0, w : 1}, {r : 1, w : 1}} (2)

Finally from equations (1) and (2), KB = { f 1 , f 2 },

M(KB) = M( f 1 ) ∩ M ( f 2 ) = {{r : 1, w : 1}} .

As you add new knowledge (new formula f ) to your knowledge base KB, the knowledge base shrinks
if f is new. If f is not new, KB doesn’t shrink.

Entailment
KB entails f (written KB |= f ) if M(KB) ⊆ M( f ). Example: r ∧ s |= s.

Contradiction
KB contradicts f i f f M(KB) ∩ M ( f ) = ∅. Example: r ∧ s contradicts ¬s.

Contingency
If f adds non-trivial information to KB

∅ ⊊ M (KB) ∩ M ( f ) ⊊ M(KB)
Example: KB = {r }, f = {s}.

Contradiction
KB contradicts f i f f KB entails ¬ f

Tell Function
For a KB and f : Entailment ⇒ already knew that, Contradiction ⇒ do not believe that, Contingency
⇒ learned something new.

Ask Function
For a KB and f : Entailment ⇒ yes, Contradiction ⇒ no, Contingency ⇒ don’t know.

28
Satisfiability
A knowledge base KB is satisfiable if M (KB) ̸= ∅


Not satisfiable → entailment (

If KB ∪ {¬ φ} is : satisfiable → contingency
satisfiable, and if KB ∪ { f } is:

not satisfiable → contradiction

Inference Rules
Given a KB, what new formula f can be derived?

1. It’s raining (r), If it’s raining, it’s wet (r → w) ⇒ premises

2. It’s wet (w) ⇒ conclusion

r, r → w premises
=⇒
w conclusions

Modus ponens inference rule


For any propositional symbol p & q,
p, p → q
q
Generally, if f 1 , . . . , f k , g are formulas, then the following is an inference rule:

f1, . . . , fk
i.e., if f 1 , . . . , f k are true, g is also true
g

If we see f 1 , . . . , f k , we can add g. Start: KB = { p, p ⇒ q}


After modus ponens inference, KB = { p, p ⇒ q, q}

Derivation: KB derives/proves f (written KB ⊢ f ) iff f eventually gets added to KB by inference rule.

Soundness: A set of inference rules is sound iff:

{ f : KB ⊢ f } ⊆ { f : KB |= f }

Nothing but truth. With premise → KB and conclusion → f . It’s sound if KB entails f:

M(KB) ⊆ M( f )

Completeness: “whole truth”. A set of inference rules is complete if:

{ f : KB |= f } ⊆ { f : KB ⊢ f }

29
Definite Clause
P1 ∧ . . . ∧ Pk → q
If P1 . . . Pk hold, q holds.

Horn Clause
A definite clause ( P1 ∧ . . . ∧ Pk → q) or q goal clause ( P1 ∧ . . . ∧ Pk → False)

Modus Ponens Inference Rule with Horn Clause


P1 , . . . , Pk , ( P1 ∧ . . . ∧ Pk ) → q
q
Propositional Logic → Any legal combination of symbols.

Propositional logic with horn clause:

( P1 ∧ . . . ∧ Pk ) → q

Implication and Disjunction


A → C can be written as ¬ A ∨ C (implication → disjunction)

Also A ∧ B → C can be written as ¬ A ∨ ¬ B ∨ C

De Morgan’s Laws
1)¬( p ∧ q) = ¬ p ∨ ¬q, 2)¬( p ∨ q) = ¬ p ∧ ¬q
Literal: Either p or ¬ p are literals, where p is a propositional symbol.
Clause: Disjunction of literals.
Horn Clause: A clause with at most one positive literal.

Modus Ponens Rewritten with Disjunction


A, ¬ A ∨ C
C

General Clauses
Can have any number of literals: ¬ A ∨ ¬ B ∨ C ∨ ¬ D ∨ E ∨ · · ·

Resolution Inference Rule


P ∨ S, ¬S ∨ V
P∨V
In general, resolution inference rule:

f 1 ∨ . . . ∨ f h ∨ g, ¬ g ∨ v1 ∨ . . . ∨ vm
f 1 ∨ . . . ∨ f h ∨ v1 ∨ . . . ∨ v m

30
Conjunctive Normal Form (CNF)
A CNF formula is a conjunction of clauses:

( A ∨ B ∨ ¬C ) ∧ (¬ B ∨ D )

Each formula in the KB (knowledge base) is a clause. Every formula in propositional logic can be
converted to an equivalent CNF formula:

If M( f ) = M( f ′ ), then f ′ is the CNF of f

Example: Convert to CNF


Given: (m → s) → b. Convert it to CNF.

Solution:
(m → s) → b = ¬(m → s) ∨ b = ¬(¬m ∨ s) ∨ b
= (m ∧ ¬s) ∨ b = (m ∨ b) ∧ (¬s ∨ b) (by distributive law)
Answer:
(m ∨ b) ∧ (¬s ∨ b)

General Conversion Rules:

f ↔ g = ( f → g) ∧ ( g → f )
f → g = ¬f ∨ g
¬( f ∧ g) = ¬ f ∨ ¬ g
¬( f ∨ g) = ¬ f ∧ ¬ g
¬¬ f = f
f ∨ ( g ∧ h) = ( f ∨ g) ∧ ( f ∨ h)

Any propositional logic formula can be converted to a CNF formula.

KB ∪ {¬ f } is unsatisfiable ⇐⇒ KB |= f (proof by contradiction)

References

31

You might also like