0% found this document useful (0 votes)
18 views89 pages

cs188 Su24 Lec07

The document provides an overview of Bayesian Networks, focusing on concepts such as random variables, probability distributions, joint distributions, and marginal distributions. It explains the importance of conditional independence and how Bayesian Networks can represent complex joint distributions through simpler local distributions. The document also discusses the semantics of Bayesian Networks, including nodes, arcs, and conditional probability tables (CPTs), along with examples to illustrate these concepts.

Uploaded by

Parv Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views89 pages

cs188 Su24 Lec07

The document provides an overview of Bayesian Networks, focusing on concepts such as random variables, probability distributions, joint distributions, and marginal distributions. It explains the importance of conditional independence and how Bayesian Networks can represent complex joint distributions through simpler local distributions. The document also discusses the semantics of Bayesian Networks, including nodes, arcs, and conditional probability tables (CPTs), along with examples to illustrate these concepts.

Uploaded by

Parv Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 89

CS 188: Artificial Intelligence

Bayesian Networks

Instructor: Evgeny Pobachienko — UC Berkeley


[Slides credit: Dan Klein, Pieter Abbeel, Anca Dragan, Stuart Russell, Satish Rao, and many others]
Recall: Random Variables
o Recall: random variable is some aspect of the world about which
we (may) have uncertainty
o R = Is it raining?
o T = Is it hot?
o D = How long will it take to drive to work?

o Capital letters: Random variables


o Lowercase letters: values that the R.V. can take
o r ∈ {+r, –r}
o t ∈ {+t, –t}
o d ∈ [0, ∞)
Probability Distributions
o Associate a probability with each value

o Temperature: ▪ Weather:

W P
T P
sun 0.6
hot 0.5
rain 0.1
cold 0.5
fog 0.3
meteor 0.0
Joint Distributions
o A joint distribution over a set of random variables:
specifies a real number for each assignment (or outcome):

T W P
o Must obey: (non-negativity) hot sun 0.4
(normalization) hot rain 0.1
cold sun 0.2
cold rain 0.3
o Size of distribution if n variables with domain sizes d?
o For all but the smallest distributions, impractical to write out!
Probability
~h ~s
T W P
h s hot sun 0.4
hot rain 0.1
U cold sun 0.2
cold rain 0.3
AI to teach AI
Marginal Distributions
o Marginal distributions are sub-tables which eliminate variables
o Marginalization (summing out): Combine collapsed rows by
adding

T P
hot 0.5
T W P
cold 0.5
hot sun 0.4
hot rain 0.1
cold sun 0.2 W P
cold rain 0.3 sun 0.6
rain 0.4
Probability
~h ~s
T W P
h s hot sun 0.4
hot rain 0.1
U cold sun 0.2
cold rain 0.3

𝑃(𝑠, ℎ)
𝑃 ℎ = 𝑃 ℎ, 𝑠 + 𝑃(ℎ, ~𝑠) 𝑃 𝑠|ℎ =
𝑃(ℎ)

~h ~s ~h ~s
h s h s

U U
Conditional Distributions
o Conditional distributions are probability distributions
over some variables given fixed values of others
Conditional Distributions
Joint Distribution

W P
T W P
sun 0.8
hot sun 0.4
rain 0.2
hot rain 0.1
cold sun 0.2
W P cold rain 0.3
sun 0.4
rain 0.6
Normalization Trick

T W P
hot sun 0.4
W P
hot rain 0.1
sun 0.4
cold sun 0.2
rain 0.6
cold rain 0.3
Normalization Trick

SELECT the joint


probabilities
T W P matching the NORMALIZE the selection
hot sun 0.4 evidence (make it sum to one) W P
T W P
hot rain 0.1 cold sun 0.2 sun 0.4
cold sun 0.2 cold rain 0.3 rain 0.6
cold rain 0.3
To Normalize
o (Dictionary) To bring or restore to a normal condition

o Procedure: All entries sum to ONE


o Step 1: Compute Z = sum over all entries
o Step 2: Divide every entry by Z

o Example

W P Normalize W P
sun 0.2 sun 0.4
rain 0.3 Z = 0.5 rain 0.6
Probabilistic Inference
o Probabilistic inference: compute a desired
probability from other known probabilities
(e.g. conditional from joint)

o Probabilities change with new evidence:


o P(on time | no accidents, 5 a.m.) = 0.95
o P(on time | no accidents, 5 a.m., raining) = 0.80
o Observing new evidence causes beliefs to be updated
Inference by Enumeration

o P(W)? S T W P
summer hot sun 0.30
summer hot rain 0.05
summer cold sun 0.10
summer cold rain 0.05
winter hot sun 0.10
winter hot rain 0.05
winter cold sun 0.15
winter cold rain 0.20
Inference by Enumeration

o P(W)? S T W P
summer hot sun 0.30
summer hot rain 0.05
summer cold sun 0.10
summer cold rain 0.05
winter hot sun 0.10
winter hot rain 0.05
winter cold sun 0.15
winter cold rain 0.20
Inference by Enumeration

o P(W)? S T W P
summer hot sun 0.30
summer hot rain 0.05
summer cold sun 0.10
summer cold rain 0.05
winter hot sun 0.10
winter hot rain 0.05
winter cold sun 0.15
winter cold rain 0.20
Inference by Enumeration

o P(W)? S T W P
summer hot sun 0.30
P(sun)=.3+.1+.1+.15=.65 summer hot rain 0.05
summer cold sun 0.10
summer cold rain 0.05
winter hot sun 0.10
winter hot rain 0.05
winter cold sun 0.15
winter cold rain 0.20
Inference by Enumeration

o P(W)? S T W P
summer hot sun 0.30
P(sun)=.3+.1+.1+.15=.65 summer hot rain 0.05
P(rain)=1-.65=.35
summer cold sun 0.10
summer cold rain 0.05
winter hot sun 0.10
winter hot rain 0.05
winter cold sun 0.15
winter cold rain 0.20
Inference by Enumeration
S T W P
summer hot sun 0.30
summer hot rain 0.05
summer cold sun 0.10
o P(W | winter, hot)?
summer cold rain 0.05
winter hot sun 0.10
winter hot rain 0.05
winter cold sun 0.15
winter cold rain 0.20
Inference by Enumeration
S T W P
summer hot sun 0.30
summer hot rain 0.05
summer cold sun 0.10
o P(W | winter, hot)?
summer cold rain 0.05
winter hot sun 0.10
P(sun|winter,hot)~.1
winter hot rain 0.05
P(rain|winter,hot)~.05
winter cold sun 0.15
winter cold rain 0.20
Inference by Enumeration
S T W P
summer hot sun 0.30
summer hot rain 0.05
summer cold sun 0.10
o P(W | winter, hot)?
summer cold rain 0.05
winter hot sun 0.10
P(sun|winter,hot)~.1
winter hot rain 0.05
P(rain|winter,hot)~.05
P(sun|winter,hot)=2/3 winter cold sun 0.15
P(rain|winter,hot)=1/3 winter cold rain 0.20
Inference by Enumeration
o General case: ▪ We want:
o Evidence variables:
o Query* variable:
o Hidden variables: All variables

▪ Step 1: Select the ▪ Step 2: Sum out H to get joint ▪ Step 3: Normalize
entries consistent of Query and evidence
with the evidence
Independence
Independence
o Two variables are independent if:

o This says that their joint distribution factors into a product of two
simpler distributions
o Another form:

o We write:

o Independence is a simplifying modeling assumption


o Empirical joint distributions: at best “close” to independent
o What could we assume for {Weather, Traffic, Cavity, Toothache}?
Independence
U h ~h U h ~h

s s

~s ~s

𝑃(𝑠, ℎ)
𝑃 𝑠|ℎ =
𝑃(ℎ)

𝑃 𝑠, ℎ = 𝑃 𝑠|ℎ ∗ 𝑃(ℎ)
Example: Independence?

T P
hot 0.5
cold 0.5
T W P T W P
hot sun 0.4 hot sun 0.3
hot rain 0.1 hot rain 0.2
cold sun 0.2 cold sun 0.3
cold rain 0.3 cold rain 0.2
W P
sun 0.6
rain 0.4
Example: Independence
o N fair, independent coin flips:

H 0.5 H 0.5 H 0.5


T 0.5 T 0.5 T 0.5
Conditional Independence
Conditional Independence
o Unconditional (absolute) independence very rare (why?)

o Conditional independence is our most basic and robust


form of knowledge about uncertain environments.

o (X is conditionally independent of Y) given Z

if and only if:

or, equivalently, if and only if


Conditional Independence
o Unconditional (absolute) independence very rare (why?)

o Conditional independence is our most basic and robust


form of knowledge about uncertain environments.

o (X is conditionally independent of Y) given Z

if and only if:

or, equivalently, if and only if


Conditional Independence
o What about this domain:
o Traffic
o Umbrella
o Raining
Conditional Independence
o What about this domain:
o Fire
o Smoke
o Alarm
Conditional Independence and the Chain Rule
o Chain rule:

o Trivial decomposition:

o With assumption of conditional independence:

o Bayesian Networks/graphical models help us express conditional independence


assumptions
Bayesian Networks: The Big Picture
Bayesian Networks: The Big Picture
o Two problems with using full joint distribution
tables as our probabilistic models:
o Unless there are only a few variables, the joint is WAY
too big to represent explicitly
o Hard to learn (estimate) anything empirically about
more than a few variables at a time

o Bayesian Networks: a technique for describing


complex joint distributions (models) using
simple, local distributions (conditional
probability tables, or CPTs)
o More properly called graphical models
o We describe how variables locally interact
o Local interactions chain together to give global, indirect
interactions
Example Bayes Net: Insurance
Graphical Model Notation

o Nodes: variables (with domains)


o Can be assigned (observed) or unassigned
(unobserved)

o Arcs: interactions
o MAY indicate influence between variables
o Formally: encode conditional independence
relationships (more later)

o For now: arrows mean that there may be a


causal relationship between the two
variables
Bayes Net Semantics
o A set of nodes, one per variable X
A A
o A directed, acyclic graph
1 n
o A conditional distribution for each node
o A collection of distributions over X, one for
each combination of parents’ values X

o CPT: conditional probability table


o Description of a potentially “causal” process

A Bayes net = Topology (graph) + Local Conditional Probabilities


Example Bayes Net: Car
Example: Coin Flips
o N independent coin flips

X1 X2 Xn

o No interactions between variables: absolute independence


Example: Traffic
o Variables:
o R: It rains
o T: There is traffic

o Model 1: independence ▪ Model 2: rain may cause traffic

R R

T
T
o Why is an agent using model 2 better?
Bayes Net: DAG + CPTs
Example: Alarm Network
o Variables
o B: Burglary
o A: Alarm goes off
o M: Mary calls
o J: John calls
o E: Earthquake!
Example: Alarm Network
o Variables
o B: Burglary
o A: Alarm goes off
o M: Mary calls
o J: John calls
o E: Earthquake!

Burglary Earthqk

Alarm

John Mary
calls calls
Example: Humans
o G: human’s goal / human’s reward parameters
o S: state of the physical world
o A: human’s action

47
Example: Traffic II
o Variables
o T: Traffic
o R: It rains
o L: Low pressure
o D: Roof drips
o B: Ballgame
o C: Cavity
Bayesian Network Semantics
Probabilities in BNs
o Bayes nets implicitly encode joint distributions
o As a product of local conditional distributions
o To see what probability a BN gives to a full assignment, multiply
all the relevant conditionals together:

o Example:
Probabilities in BNs
o Why are we guaranteed that setting

results in a proper joint distribution?

o Chain rule (valid for all distributions):

o Assume conditional independences:

🡪 Consequence:
o Not every BN can represent every joint distribution
o The topology enforces certain conditional independencies
Example: Coin Flips

X1 X2 Xn

h 0.5 h 0.5 h 0.5


t 0.5 t 0.5 t 0.5

P(h)P(h)P(t)P(h)

Only distributions whose variables are absolutely independent can be


represented by a Bayes’ net with no arcs.
Example: Traffic

+r 1/4 P(+r)P(-t|+r) = ¼* ¼
R
-r 3/4

+r +t 3/4
T -t 1/4

-r +t 1/2
-t 1/2
Example: Alarm Network
B P(B) E P(E)
Burglary Earthqk +e 0.002
+b 0.001
-b 0.999 -e 0.998

Alarm
B E A P(A|B,E)
+b +e +a 0.95
John Mary
calls calls +b +e -a 0.05
+b -e +a 0.94
A J P(J|A) A M P(M|A) +b -e -a 0.06
+a +j 0.9 +a +m 0.7 -b +e +a 0.29 P(M|A)P(J|A)
+a -j 0.1 +a -m 0.3 -b +e -a 0.71 P(A|B,E)P(E)
P(B)
-a +j 0.05 -a +m 0.01 -b -e +a 0.001
-a -j 0.95 -a -m 0.99 -b -e -a 0.999
Example: Traffic
o Causal direction

+r 1/4
R
-r 3/4

+r +t 3/16
+r -t 1/16
+r +t 3/4
-r +t 6/16
T -t 1/4
-r -t 6/16
-r +t 1/2
-t 1/2
Example: Reverse Traffic
o Reverse causality?

+t 9/16
T
-t 7/16

+r +t 3/16
+r -t 1/16
+t +r 1/3
-r +t 6/16
R -r 2/3
-r -t 6/16
-t +r 1/7
-r 6/7
Causality?
o When Bayes’ nets reflect the true causal patterns:
o Often simpler (nodes have fewer parents)
o Often easier to think about
o Often easier to elicit from experts

o BNs need not actually be causal


o Sometimes no causal net exists over the domain
(especially if variables are missing)
o E.g. consider the variables Traffic and Drips
o End up with arrows that reflect correlation, not causation

o What do the arrows really mean?


o Topology may happen to encode causal structure
o Topology really encodes conditional independence
Conditional Independence Assumptions
o Each node, given its parents, is Each node, given its MarkovBlanket, is
conditionally independent of all its conditionally independent of all other
non-descendants in the graph nodes in the graph

MarkovBlanket refers to the parents,


children, and children's other parents.
Inference with Bayesian Networks
Inference
o Inference: calculating some ▪ Examples:
useful quantity from a joint
▪ Posterior probability
probability distribution

▪ Most likely explanation:


Inference by Enumeration
o General case: ▪ We want:
o Evidence variables:
o Query variable:
o Hidden variables: All variables

▪ Step 1: Select the ▪ Step 2: Sum out H to get joint ▪ Step 3: Normalize
entries consistent of Query and evidence
with the evidence
Inference by Enumeration in Bayes’ Net
o Given unlimited time, inference in BNs is easy B E

𝑃 𝐴 𝐵, 𝐸 𝑃 𝐵 𝑃 𝐸 = 𝑃 𝐴 𝐵, 𝐸 𝑃 𝐵, 𝐸 = 𝑃(𝐴, 𝐵, 𝐸)
A

J M
Inference by Enumeration in Bayes’ Net
o Given unlimited time, inference in BNs is easy

A,B,E
𝑃 𝐴 𝐵, 𝐸 𝑃 𝐵 𝑃 𝐸 = 𝑃 𝐴 𝐵, 𝐸 𝑃 𝐵, 𝐸 = 𝑃(𝐴, 𝐵, 𝐸)

J M
Inference by Enumeration in Bayes’ Net
o Given unlimited time, inference in BNs is easy

B A,B,E E
𝑃 𝐴 𝐵, 𝐸 𝑃 𝐵 𝑃 𝐸 = 𝑃 𝐴 𝐵, 𝐸 𝑃 𝐵, 𝐸 = 𝑃(𝐴, 𝐵, 𝐸)

𝑃 𝐽 𝐴 𝑃 𝑀 𝐴 𝑃 𝐴, 𝐵, 𝐸 A
= 𝑃 𝐽, 𝑀 𝐴 𝑃 𝐴, 𝐵, 𝐸
= 𝑃 𝐽, 𝑀 𝐴, 𝐵, 𝐸 𝑃 𝐴, 𝐵, 𝐸
= 𝑃(𝐽, 𝑀, 𝐴, 𝐵, 𝐸)
J M
Inference by Enumeration in Bayes’ Net
o Given unlimited time, inference in BNs is easy B E

J M
Example: Traffic Domain

o Random Variables +r 0.1


o R: Raining R -r 0.9

o T: Traffic
o L: Late for class! T +r +t 0.8
+r -t 0.2
-r +t 0.1
-r -t 0.9
L

+t +l 0.3
+t -l 0.7
-t +l 0.1
-t -l 0.9
Inference by Enumeration: Procedural Outline
o Track objects called factors
o Initial factors are local CPTs (one per node)

+r 0.1 +r +t 0.8 +t +l 0.3


-r 0.9 +r -t 0.2 +t -l 0.7
-r +t 0.1 -t +l 0.1
-r -t 0.9 -t -l 0.9
o Any known values are selected
o E.g. if we know , the initial factors are

+r 0.1 +r +t 0.8 +t +l 0.3


-r 0.9 +r -t 0.2 -t +l 0.1
-r +t 0.1
-r -t 0.9

o Procedure: Join all factors, then sum out all hidden variables
Operation 1: Join Factors
o First basic operation: joining factors
o Combining factors:
o Just like a database join
o Get all factors over the joining variable
o Build a new factor over the union of the variables
involved

o Example: Join on R

R
+r 0.1 +r +t 0.8 +r +t 0.08
-r 0.9 +r -t 0.2 +r -t 0.02 R,T
-r +t 0.1 -r +t 0.09
T -r -t 0.9 -r -t 0.81
o Computation for each entry: pointwise products
Example: Multiple Joins
Example: Multiple Joins

+r 0.1
R -r 0.9 Join R
+r +t 0.08
Join T
R, T, L
+r -t 0.02
T +r +t 0.8 -r +t 0.09
+r -t 0.2 -r -t 0.81 R, T
-r +t 0.1 +r +t +l 0.024
-r -t 0.9 +r +t -l 0.056
L
L +r -t +l 0.002
+t +l 0.3 +r -t -l 0.018
+t +l 0.3 +t -l 0.7 -r +t +l 0.027
+t -l 0.7 -t +l 0.1 -r +t -l 0.063
-t +l 0.1 -t -l 0.9 -r -t +l 0.081
-t -l 0.9 -r -t -l 0.729
Operation 2: Eliminate
o Second basic operation:
marginalization
o Take a factor and sum out a variable
o Shrinks a factor to a smaller one
o A projection operation

o Example:

+r +t 0.08
+r -t 0.02 +t 0.17
-r +t 0.09 -t 0.83
-r -t 0.81
Multiple Elimination
R, T, L T, L L
+r +t +l 0.024
+r +t -l 0.056
+r -t +l 0.002 Sum Sum
+r -t -l 0.018 out R +t +l 0.051 out T
-r +t +l 0.027 +t -l 0.119 +l 0.134
-r +t -l 0.063 -t +l 0.083 -l 0.866
-r -t +l 0.081 -t -l 0.747
-r -t -l 0.729
Thus Far: Multiple Join, Multiple Eliminate (= Inf by Enumeration)
Recap: Inference by Enumeration
* Works fine with
o General case: ▪ We want: multiple query
o Evidence variables: variables, too
o Query* variable:
o Hidden variables: All variables

▪ Step 1: Select the entries ▪ Step 2: Sum out H to get joint ▪ Step 3: Normalize
consistent with the evidence of Query and evidence
Thus Far: Multiple Join, Multiple Eliminate (= Inference by
Enumeration)

▪ Compute joint ▪ Sum out hidden variables

▪ [Step 3: Normalize]
Thus Far: Multiple Join, Multiple Eliminate (= Inference by
Enumeration)
Inference by Enumeration vs. Variable Elimination
o Why is inference by enumeration slow? ▪ Idea: interleave joining and marginalizing!
o You join up the whole joint distribution before ▪ Called “Variable Elimination”
you sum out the hidden variables ▪ Still NP-hard, but usually much faster than
inference by enumeration
Traffic Domain
R

T o Inference by Enumeration ▪ Variable Elimination

Join on r Join on r

Join on t Eliminate r
Eliminate r Join on t

Eliminate t Eliminate t
Traffic Domain
R

T o Inference by Enumeration ▪ Variable Elimination

L
(5𝑎) + (5𝑏) 5(𝑎 + 𝑏)
Marginalizing Early (Variable Elimination)
Variable Elimination
Evidence
o If evidence, start with factors that select that evidence
o No evidence uses these initial factors:

+r 0.1 +r +t 0.8 +t +l 0.3


-r 0.9 +r -t 0.2 +t -l 0.7
-r +t 0.1 -t +l 0.1
-r -t 0.9 -t -l 0.9

o Computing , the initial factors become:

+r 0.1 +r +t 0.8 +t +l 0.3


+r -t 0.2 +t -l 0.7
-t +l 0.1
-t -l 0.9

o We eliminate all vars other than query + evidence


Evidence
o Result will be a selected joint of query and evidence
o E.g. for P(L | +r), we would end up with:

Normalize
+r +l 0.026 +l 0.26
+r -l 0.074 -l 0.74

o To get our answer, just normalize this!

o That ’s it!
General Variable Elimination
o Query:

o Start with initial factors:


o Local CPTs (but instantiated by evidence)

o While there are still hidden variables


(not Q or evidence):
o Pick a hidden variable H
o Join all factors mentioning H
o Eliminate (sum out) H

o Join all remaining factors and


normalize
Example

marginal can be obtained from joint by summing o


use Bayes’ net joint distribution expression
use x*(y+z) = xy + xz
joining on a, and then summing out gives f1
use x*(y+z) = xy + xz
joining on e, and then summing out gives f2

All we are doing is exploiting uwy + uwz + uxy + uxz + vwy + vwz + vxy +vxz = (u+v)(w+x)(y+z) to improve computational efficiency!
Example

Choose A
Example

Choose E

Finish with B

Normalize
Variable Elimination Example
Variable Elimination Ordering
o For the query P(Xn|y1,…,yn) work through the following two different
orderings as done in previous slide: Z, X1, …, Xn-1 and X1, …, Xn-1, Z.
What is the size of the maximum factor generated for each of the
orderings?

o Answer: 2n versus 2 (assuming binary)


o In general: the ordering can greatly affect efficiency.
VE: Computational and Space Complexity
o The computational and space complexity of variable elimination is
determined by the largest factor

o The elimination ordering can greatly affect the size of the largest
factor.
o E.g., previous slide’s example 2n vs. 2

o Does there always exist an ordering that only results in small


factors?
o No!
“Easy” Structures: Polytrees

o A polytree is a directed graph with no undirected cycles

o For poly-trees you can always find an ordering that is efficient


o Try it!!

You might also like