BAYESIAN
NETWORKS
Bayesian Network Motivation
We want a representation and reasoning
system that is based on conditional
independence
Compact yet expressive representation
Efficient reasoning procedures
Bayesian Networks are such a Thomas Bayes
representation
Named after Thomas Bayes (ca. 1702 –1761)
Term coined in 1985 by Judea Pearl (1936 – )
Their invention changed the focus on AI
from logic to probability!
2
Judea Pearl
Bayesian Networks
A Bayesian network specifies a joint distribution in a structured
form
Represent dependence/independence via a directed graph
Nodes = random variables
Edges = direct dependence
Structure of the graph Conditional independence relations
Requires that graph is acyclic (no directed cycles)
Two components to a Bayesian network
The graph structure (conditional independence assumptions)
The numerical probabilities (for each variable given its parents)
Bayesian Networks
General form:
𝑃(𝑋1, 𝑋2, … . 𝑋𝑁) = ෑ 𝑃(𝑋𝑖 | 𝑝𝑎𝑟𝑒𝑛𝑡𝑠(𝑋𝑖 ) )
𝑖
The full joint distribution The graph-structured approximation
Example of a simple Bayesian
network
𝑃(𝑋1, 𝑋2, … . 𝑋𝑁 ) = ෑ 𝑃(𝑋𝑖 | 𝑝𝑎𝑟𝑒𝑛𝑡𝑠(𝑋𝑖 ) ) A B
𝑖
𝑃 𝐴, 𝐵, 𝐶 = 𝑃 𝐶 𝐴, 𝐵 𝑃 𝐴 𝑃(𝐵)
C
Probability model has simple factored form
Directed edges => direct dependence
Absence of an edge => conditional independence
Also known as belief networks, graphical models, causal
networks
Other formulations, e.g., undirected graphical models
Examples of 3-way Bayesian
Networks
A B C Absolute Independence:
p(A,B,C) = p(A) p(B) p(C)
Examples of 3-way Bayesian
Networks
Conditionally independent
effects:
𝑝(𝐴, 𝐵, 𝐶) = 𝑝(𝐵|𝐴)𝑝(𝐶|𝐴)𝑝(𝐴)
A
B and C are conditionally
independent given A
B C
e.g., A is a disease, and we
model B and C as
conditionally independent
symptoms given A
Examples of 3-way Bayesian
Networks
Independent Clauses:
𝑝(𝐴, 𝐵, 𝐶) = 𝑝(𝐶|𝐴, 𝐵)𝑝(𝐴)𝑝(𝐵)
A B
“Explaining away” effect:
C
A and B are independent but become
dependent once C is known!!
(we’ll come back to this later)
Examples of 3-way Bayesian
Networks
A B C Markov dependence:
p(A,B,C) = p(C|B) p(B|A)p(A)
The Alarm Example
You have a new burglar alarm installed
It is reliable about detecting burglary, but responds to minor
earthquakes
Two neighbors (John, Mary) promise to call you at work when
they hear the alarm
John always calls when hears alarm, but confuses alarm
with phone ringing (and calls then also)
Mary likes loud music and sometimes misses alarm!
Given evidence about who has and hasn’t called, estimate the
probability of a burglary
The Alarm Example
Represent problem using 5 binary variables:
B = a burglary occurs at your house
E = an earthquake occurs at your house
A = the alarm goes off
J = John calls to report the alarm
M = Mary calls to report the alarm
What is P(B | M, J) ?
We can use the full joint distribution to answer this question
◼ Requires 25 = 32 probabilities
Can we use prior domain knowledge to come up with a
Bayesian network that requires fewer probabilities?
Constructing a Bayesian
Network: Step 1
Order the variables in terms of causality
(may be a partial order)
e.g., {E, B} -> {A} -> {J, M}
Use these assumptions to create the graph
structure of the Bayesian network
The Resulting Bayesian
Network
network topology reflects causal knowledge
Constructing a Bayesian
Network: Step 2
Fill in conditional probability
tables (CPTs)
One for each node
2𝑝 entries, where 𝑝 is the number
of parents
Where do these
probabilities come from?
Expert knowledge
From data (relative frequency
estimates)
Or a combination of both
The Bayesian network
Shouldn’t these add up to 1?
No. Each row adds up to 1, and
we’re using this to let us show
only half of the table. For
example,
𝑃 ¬𝐴 𝐵, 𝐸 = 1 − 𝑃 𝐴 𝐵, 𝐸
= 1 − 0.95 = 0.05
The Bayesian network
What is P(j m a b e)?
P (j | a) P (m | a) P (a | b, e) P (b) P (e)
Number of Probabilities in Bayesian
Networks
(i.e. why Bayesian Networks are
effective)
Consider n binary variables
Unconstrained joint distribution requires
O(2n) probabilities
If we have a Bayesian network, with a
maximum of k parents for any node, then we
need O(n 2k) probabilities
16 binary variables
Full joint distribution is 216
How many probability values required for
Bayes Net?
Bayesian Networks from a
different Variable Ordering
Example for BN construction: Fire
Diagnosis
You want to diagnose whether there is a fire
in a building
You receive a noisy report about whether
everyone is leaving the building
If everyone is leaving, this may have been
caused by a fire alarm
If there is a fire alarm, it may have been
caused by a fire or by tampering
If there is a fire, there may be smoke
Example for BN construction: Fire
Diagnosis
First you choose the variables. In this case, all are Boolean:
Tampering is true when the alarm has been tampered with
Fire is true when there is a fire
Alarm is true when there is an alarm
Smoke is true when there is smoke
Leaving is true if there are lots of people leaving the
building
Report is true if the sensor reports that lots of people
are leaving the building
Let’s construct the Bayesian network for this
First, you choose a total ordering of the variables, let’s say:
Fire; Tampering; Alarm; Smoke; Leaving; Report.
Example for BN construction: Fire
Diagnosis
Example for BN construction: Fire
Diagnosis
Example for BN construction: Fire
Diagnosis
• Using the total ordering of variables:
Let’s say Fire; Tampering; Alarm; Smoke; Leaving; Report.
• Now choose the parents for each variable by evaluating
conditional independencies
Fire is the first variable in the ordering. It does not have
parents.
Tampering independent of fire (learning that one is true would
not change your beliefs about the probability of the other)
Alarm depends on both Fire and Tampering: it could be caused
by either or both
Smoke is caused by Fire, and so is independent of Tampering and
Alarm given whether there is a Fire
Leaving is caused by Alarm, and thus is independent of the other
variables given Alarm
Report is caused by Leaving, and thus is independent of the
other variables given Leaving
Example for BN construction: Fire
Diagnosis
• How many probabilities do we need to specify
for this Bayesian network?
• 1+1+4+2+2+2 = 12
Independence
Let define the symbol ⊥ to indicate
independence of two variables.
A B C 𝐴⊥𝐵
𝐴⊥𝐶
𝐵⊥𝐶
Independence
True False
𝑅⊥𝐴
𝑅 ⊥ 𝐴|𝐿
𝐿⊥𝑆
𝐿 ⊥ 𝑆|𝐹
𝑆 ⊥ 𝐴|𝐹
𝑅 ⊥ 𝑆|𝐿
General rule of thumb:
A known variable makes everything below that variable
independent from everything above that variable.
Another (tricky) Example
True False
B⊥M
B ⊥ M|E
B ⊥ M|A
B⊥E
B ⊥ E|A
Explaining Away
Earth doesn’t care whether your
house is currently being burgled
While you are on vacation, one of
your neighbors calls and tells you
your home’s burglar alarm is
ringing.
But now suppose you learn that
there was a medium-sized
earthquake in your neighborhood.
Oh, whew! Probably not a burglar
after all.
𝐸𝑎𝑟𝑡ℎ𝑞𝑢𝑎𝑘𝑒 “explains away” the
hypothetical burglar, so knowing
about the 𝐴𝑙𝑎𝑟𝑚 and
𝐸𝑎𝑟𝑡ℎ𝑞𝑢𝑎𝑘𝑒 effects you estimate
of 𝐵𝑢𝑟𝑔𝑙𝑎𝑟𝑦.
Independence
Is there a principled way to determine all
these dependencies?
Yes! It’s called D-Separation – 3 specific rules.
◼ Some say D-separation rules are easy
◼ Our book: “rather complicated… we omit it”
◼ The truth: a mix of both… easy to state rules, can
be tricky to apply. Talk to me if you want to know
more.
Next class…
Inference using Bayes Nets