0% found this document useful (0 votes)
21 views

AI unit 5 notes

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

AI unit 5 notes

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 35

UNIT V

PROBABILISTIC REASONING

Syllabus : Acting under uncertainty – Bayesian inference – naïve Bayes


models. Probabilistic reasoning – Bayesian networks – exact inference in BN –
approximate inference in BN – causal networks.

Acting under uncertainty:

1. UNCERTAINTY
To act rationally under uncertainty we must be able to evaluate how likely certain things are.
With FOL a fact F is only useful if it is known to be true or false. But we need to be able to
evaluate how likely it is that F is true. By weighing likelihoods of events (probabilities) we
can develop mechanisms for acting rationally under uncertainty.
Dental Diagnosis example.
In FOL we might formulate
P. symptom(P,toothache)→ disease(p,cavity) disease(p,gumDisease)
disease(p,foodStuck)

When do we stop?
Cannot list all possible causes.
We also want to rank the possibilities. We don’t want to start drilling for a cavity before
checking for more likely causes first.

Axioms Of Probability
Given a set U (universe), a probability function is a function defined over the subsets of U
that maps each subset to the real numbers and that satisfies the Axioms of Probability

1.Pr(U) = 1
2.Pr(A) [0,1]
3.Pr(A ∈B) = Pr(A) + Pr(B) –Pr(A ∩B)

Note if A ∩B = {} then Pr(A ∪B) = Pr(A) + Pr(B)
2. REVIEW OF PROBABILTY
· Natural way to represent uncertainty
· People have intuitive notions about probabilities
· Many of these are wrong or inconsistent

· Most people don’t get what probabilities mean


· Understanding Probabilities
· Initially, probabilities are “relative frequencies”
· This works well for dice and coin flips
· For more complicated events, this is problematic
· What is the probability that Obama will be reelected?
· This event only happens once
· We can’t count frequencies
· still seems like a meaningful question
· In general, all events are unique

Probabilities and Beliefs


• Suppose I have flipped a coin and hidden the outcome
• What is P(Heads)?
• Note that this is a statement about a belief, not a statement about the world
• The world is in exactly one state (at the macro level) and it is in that state with probability 1.
• Assigning truth values to probability statements is very tricky business
• Must reference speakers state of knowledge

Frequentism and Subjectivism


• Frequentists hold that probabilities must come from relative frequencies
• This is a purist viewpoint
• This is corrupted by the fact that relative frequencies are often unobtainable
• Often requires complicated and convoluted
• assumptions to come up with probabilities
• Subjectivists: probabilities are degrees of belief
o Taints purity of probabilities
o Ofen more practical
Types are:
1 Unconditional or prior probabilities
2 Conditional or posterior probabilities

3. PROBABILISTIC REASONING
· Representing Knowledge in an Uncertain Domain
· Belief network used to encode the meaningful dependence between
variables. o Nodes represent random variables
o Arcs represent direct influence
o Nodes have conditional probability table that gives that var's probability given the different
states of its parents
o Is a Directed Acyclic Graph (or DAG)

The Semantics of Belief Networks


· To construct net, think of as representing the joint probability distribution.
· To infer from net, think of as representing conditional independence statements.
· Calculate a member of the joint probability by multiplying individual
conditional probabilities:

o P(X1=x1, . . . Xn=xn) =
o = P(X1=x1|parents(X1)) * . . . * P(Xn=xn|parents(Xn))
· Note: Only have to be given the immediate parents of Xi, not all other nodes:
o P(Xi|X(i-1),...X1) = P(Xi|parents(Xi))

· To incrementally construct a network:


1. Decide on the variables
2. Decide on an ordering of them
3. Do until no variables are left:

a. Pick a variable and make a node for it


b. Set its parents to the minimal set of pre-existing nodes
c. Define its conditional probability

· Often, the resulting conditional probability tables are much smaller than the
exponential size of the full joint
· If don't order nodes by "root causes" first, get larger conditional probability
tables
· Different tables may encode the same probabilities.
· Some canonical distributions that appear in conditional probability tables:

o deterministic logical relationship (e.g. AND, OR) o deterministic numeric relationship (e.g.
MIN)

o parameteric relationship (e.g. weighted sum in neural net) o noisy logical relationship (e.g.
noisy-OR, noisy-MAX)

Direction-dependent separation or D-separation:


· If all undirected paths between 2 nodes are d-separated given evidence node(s)
E, then the 2 nodes are independent given E.
· Evidence node(s) E d-separate X and Y if for every path between them E
contains a node Z that:

o has an arrow in on the path leading from X and an arrow out on the path leading to Y (or
vice versa)
o has arrows out leading to both X and Y
o does NOT have arrows in from both X and Y (nor Z's children too)

Inference in Belief Networks


· Want to compute posterior probabilities of query variables given evidence
variables.
· Types of inference for belief networks:

o Diagnostic inference: symptoms to causes o Causal inference: causes to symptoms


o Intercausal inference:
o Mixed inference: mixes those above

Inference in Multiply Connected Belief Networks

· Multiply connected graphs have 2 nodes connected by more than one path
· Techniques for handling:
o Clustering: Group some of the intermediate nodes into one meganode. Pro:
Perhaps best way to get exact evaluation.

Con: Conditional probability tables may exponentially increase in size.

o Cutset conditioning: Obtain simplier polytrees by instantiating variables as


constants.
Con: May obtain exponential number of simplier polytrees.

Pro: It may be safe to ignore trees with lo probability (bounded cutset conditioning).
Stochastic simulation: run thru the net with randomly choosen values for each node (weighed
by prior probabilities).

4. BAYESIAN NETWORK
Bayes’ nets:
A technique for describing complex joint distributions (models) using simple, local
distributions
(conditional probabilities)
More properly called graphical models
Local interactions chain together to give global indirect interactions
A Bayesian network is a graphical structure that allows us to represent and reason about an
uncertain domain. The nodes in a Bayesian network represent a set of random variables,

X=X1;::Xi;:::Xn, from the domain. A set of directed arcs(or links) connects pairs of nodes,
Xi!Xj, representing the direct dependencies between variables.

Assuming discrete variables, the strength of the relationship between variables is quantified
by conditional probability distributions associated with each node. The only constraint on the
arcs allowed in a BN is that there must not be any directed cycles: you cannot return to a
node simply by following directed arcs.

Such networks are called directed acyclic graphs, or simply dags. There are a number of steps
that a knowledge engineer must undertake when building a Bayesian network. At this stage
we will present these steps as a sequence; however it is important to note that in the real-
world the process is not so simple.

Nodes and values

First, the knowledge engineer must identify the variables of interest. This involves answering
the question: what are the nodes to represent and what values can they take, or what state can
they be in? For now we will consider only nodes that take discrete values. The values should
be both mutually exclusive and exhaustive , which means that the variable must take on
exactly one of these values at a time. Common types of discrete nodes include:

Boolean nodes, which represent propositions, taking the binary values true (T) and false (F).
In a medical diagnosis domain, the node Cancer would represent the proposition that a patient
has cancer.

Ordered values. For example, a node Pollution might represent a patient’s pol-lution exposure
and take the values low, medium, high

Integral values. For example, a node called Age might represent a patient’s age and have
possible values from 1 to 120.

Even at this early stage, modeling choices are being made. For example, an alternative to
representing a patient’s exact age might be to clump patients into different age groups, such
as baby, child, adolescent, young, middleaged, old. The trick is to choose values that
represent the domain efficiently.
1 Representation of joint probability distribution
2 Conditional independence relation in Bayesian network

5. INFERENCE IN BAYESIAN NETWORK


1 Tell
2 Ask
3 Kinds of inferences
4 Use of Bayesian network

· In general, the problem of Bayes Net inference is NP-hard (exponential in the


size of the graph).
· For singly-connected networks or polytrees in which there are no undirected
loops, there are linear time algorithms based on belief propagation.
· Each node sends local evidence messages to their children and parents.
· Each node updates belief in each of its possible values based on incoming
messages from it neighbors and propagates evidence on to its neighbors.
· There are approximations to inference for general networks based on loopy
belief propagation that iteratively refines probabilities that converge to accurate limit.
TEMPORAL MODELS
1 Monitoring or filtering
2 Prediction

Bayes' Theorem
Many of the methods used for dealing with uncertainty in expert systems are based on Bayes'
Theorem.

Notation:
P(A) Probability of event A
P(A B) Probability of events A and B occurring together P(A | B) Conditional probability of
event A
given that event B has occurred .nr/
If A and B are independent, then P(A | B) = P(A). .co
Expert systems usually deal with events that are not independent, e.g. a disease and its
symptoms are not independent.

Theorem
P (A B) = P(A | B)* P(B) = P(B | A) * P(A) therefore P(A | B) = P(B | A) * P(A) / P(B)

Uses of Bayes' Theorem

In doing an expert task, such as medical diagnosis, the goal is to determine identifications
(diseases) given observations (symptoms). Bayes' Theorem provides such a relationship.
P(A | B) = P(B | A) * P(A) / P(B)

Suppose: A=Patient has measles, B =has a rash


Then:P(measles/rash)=P(rash/measles) * P(measles) / P(rash)

The desired diagnostic relationship on the left can be calculated based on the known
statistical quantities on the right.

Joint Probability Distribution

Given a set of random variables X1 ... Xn, an atomic event is an assignment of a particular
value to each Xi. The joint probability distribution is a table that assigns a probability to each
atomic event. Any question of conditional probability can be answered from the joint.
Toothache ¬ Toothache
Cavity 0.04 0.06
¬ Cavity 0.01 0.89

Problems:

The size of the table is combinatoric: the product of the number of possibilities for each
random variable. The time to answer a question from the table will also be combinatoric.
Lack of evidence: we may not have statistics for some table entries, even though those entries
are not impossible.

Chain Rule

We can compute probabilities using a chain rule as follows: P(A &and B &and C) = P(A | B
&and C) * P(B | C) * P(C)

If some conditions C1 &and ... &and Cn are independent of other conditions U, we will have:
P(A | C1 &and ... &and Cn &and U) = P(A | C1 &and ... &and Cn)

This allows a conditional probability to be computed more easily from smaller tables using
the chain rule.

Bayesian Networks

Bayesian networks, also called belief networks or Bayesian belief networks, express
relationships among variables by directed acyclic graphs with probability tables stored at the
nodes.[Example from Russell & Norvig.]

1 A burglary can set the alarm off


2 An earthquake can set the alarm off
3 The alarm can cause Mary to call
4 The alarm can cause John to call
Computing with Bayesian Networks

If a Bayesian network is well structured as a poly-tree (at most one path between any two
nodes), then probabilities can be computed relatively efficiently. One kind of algorithm, due
to Judea Pearl, uses a message-passing style in which nodes of the network compute
probabilities and send them to nodes they are connected to. Several software packages exist
for computing with belief networks.

A Hidden Markov Model (HMM) tagger chooses the tag for each word that maximizes:
[Jurafsky, op. cit.] P(word | tag) * P(tag | previous n tags)

For a bigram tagger, this is approximated as:


ti = argmaxj P( wi | tj ) P( tj | ti - 1 )

In practice, trigram taggers are most often used, and a search is made for the best set of tags
for the whole sentence; accuracy is about 96%.

6. HIDDEN MARKOV MODELS

A hidden Markov model (HMM) is an augmentation of the Markov chain to include


observations. Just like the state transition of the Markov chain, an HMM also includes
observations of the state. These observations can be partial in that different states can map to
the same observation and noisy in that the same state can stochastically map to different
observations at different times.

The assumptions behind an HMM are that the state at time t+1 only depends on the state at
time t, as in the Markov chain. The observation at time t only depends on the state at time t.
The observations are modeled using the variable for each time t whose domain is the set of
possible observations. The belief network representation of an HMM is depicted in Figure.
Although the belief network is shown for four stages, it can proceed indefinitely.

A stationary HMM includes the following probability distributions:

P(S0) specifies initial conditions.


P(St+1|St) specifies the dynamics.
P(Ot|St) specifies the sensor model.

There are a number of tasks that are common for HMMs.

The problem of filtering or belief-state monitoring is to determine the current state based on
the current and previous observations, namely to determine P(Si|O0,...,Oi).

Note that all state and observation variables after Si are irrelevant because they are not
observed and can be ignored when this conditional distribution is computed.

The problem of smoothing is to determine a state based on past and future observations.
Suppose an agent has observed up to time k and wants to determine the state at time i for i<k;
the smoothing problem is to determine

P(Si|O0,...,Ok).

All of the variables Si and Vi for i>k can be ignored.

Bayesian inference

Bayes' theorem:

Bayes' theorem is also known as Bayes' rule, Bayes' law, or Bayesian reasoning, which
determines the probability of an event with uncertain knowledge.

In probability theory, it relates the conditional probability and marginal probabilities of two
random events.

Bayes' theorem was named after the British mathematician Thomas Bayes. The Bayesian
inference is an application of Bayes' theorem, which is fundamental to Bayesian statistics.

It is a way to calculate the value of P(B|A) with the knowledge of P(A|B).

Bayes' theorem allows updating the probability prediction of an event by observing new
information of the real world.

Example: If cancer corresponds to one's age then by using Bayes' theorem, we can determine
the probability of cancer more accurately with the help of age.
Bayes' theorem can be derived using product rule and conditional probability of event A with
known event B:

As from product rule we can write:

1. P(A ⋀ B)= P(A|B) P(B) or

Similarly, the probability of event B with known event A:

1. P(A ⋀ B)= P(B|A) P(A)

Equating right hand side of both the equations, we will get:

The above equation (a) is called as Bayes' rule or Bayes' theorem. This equation is basic of
most modern AI systems for probabilistic inference.

It shows the simple relationship between joint and conditional probabilities. Here,

P(A|B) is known as posterior, which we need to calculate, and it will be read as Probability
of hypothesis A when we have occurred an evidence B.

P(B|A) is called the likelihood, in which we consider that hypothesis is true, then we calculate
the probability of evidence.

P(A) is called the prior probability, probability of hypothesis before considering the
evidence

P(B) is called marginal probability, pure probability of an evidence.

In the equation (a), in general, we can write P (B) = P(A)*P(B|Ai), hence the Bayes' rule can
be written as:

Where A1, A2, A3,........, An is a set of mutually exclusive and exhaustive events.
Applying Bayes' rule:

Bayes' rule allows us to compute the single term P(B|A) in terms of P(A|B), P(B), and P(A).
This is very useful in cases where we have a good probability of these three terms and want
to determine the fourth one. Suppose we want to perceive the effect of some unknown cause,
and want to compute that cause, then the Bayes' rule becomes:

Example-1:

Question: what is the probability that a patient has diseases meningitis with a stiff
neck?

Given Data:

A doctor is aware that disease meningitis causes a patient to have a stiff neck, and it occurs
80% of the time. He is also aware of some more facts, which are given as follows:

o The Known probability that a patient has meningitis disease is 1/30,000.


o The Known probability that a patient has a stiff neck is 2%.

Let a be the proposition that patient has stiff neck and b be the proposition that patient has
meningitis. , so we can calculate the following as:

P(a|b) = 0.8

P(b) = 1/30000

P(a)= .02

Hence, we can assume that 1 patient out of 750 patients has meningitis disease with a stiff
neck.

Example-2:

Question: From a standard deck of playing cards, a single card is drawn. The
probability that the card is king is 4/52, then calculate posterior probability P(King|
Face), which means the drawn face card is a king card.
Solution:

P(king): probability that the card is King= 4/52= 1/13

P(face): probability that a card is a face card= 3/13

P(Face|King): probability of face card when we assume it is a king = 1

Putting all values in equation (i) we will get:

Application of Bayes' theorem in Artificial intelligence:

Following are some applications of Bayes' theorem:

o It is used to calculate the next step of the robot when the already executed step is
given.
o Bayes' theorem is helpful in weather forecasting.
o It can solve the Monty Hall problem.

naïve Bayes models:

o Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes


theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional training
dataset.
o Naïve Bayes Classifier is one of the simple and most effective Classification
algorithms which helps in building the fast machine learning models that can make
quick predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental
analysis, and classifying articles.

Why is it called Naïve Bayes?


The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be
described as:

o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the
bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an
apple. Hence each feature individually contributes to identify that it is an apple
without depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.

Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends on the
conditional probability.
o The formula for Bayes' theorem is given as:

Where,

P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.

P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.

P(A) is Prior Probability: Probability of hypothesis before observing the evidence.

P(B) is Marginal Probability: Probability of Evidence.

Working of Naïve Bayes' Classifier:

Working of Naïve Bayes' Classifier can be understood with the help of the below example:

Suppose we have a dataset of weather conditions and corresponding target variable "Play".
So using this dataset we need to decide that whether we should play or not on a particular day
according to the weather conditions. So to solve this problem, we need to follow the below
steps:

1. Convert the given dataset into frequency tables.


2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.

Problem: If the weather is sunny, then the Player should play or not?

Solution: To solve this, first consider the below dataset:


Outlook Play

0 Rainy Yes

1 Sunny Yes

2 Overcast Yes

3 Overcast Yes

4 Sunny No

5 Rainy Yes

6 Sunny Yes

7 Overcast Yes

8 Rainy No

9 Sunny No

10 Sunny Yes

11 Rainy No

12 Overcast Yes

13 Overcast Yes

Frequency table for the Weather Conditions:

Weather Yes No

Overcast 5 0

Rainy 2 2

Sunny 3 2

Total 10 5
Likelihood table weather condition:

Weather No Yes

Overcast 0 5 5/14= 0.35

Rainy 2 2 4/14=0.29

Sunny 2 3 5/14=0.35

All 4/14=0.29 10/14=0.71

Applying Bayes'theorem:

P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)

P(Sunny|Yes)= 3/10= 0.3

P(Sunny)= 0.35

P(Yes)=0.71

So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60

P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)

P(Sunny|NO)= 2/4=0.5

P(No)= 0.29

P(Sunny)= 0.35

So P(No|Sunny)= 0.5*0.29/0.35 = 0.41

So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)

Hence on a Sunny day, Player can play the game.

Advantages of Naïve Bayes Classifier:

o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other Algorithms.
o It is the most popular choice for text classification problems.
Disadvantages of Naïve Bayes Classifier:

o Naive Bayes assumes that all features are independent or unrelated, so it cannot learn
the relationship between features.

Applications of Naïve Bayes Classifier:

o It is used for Credit Scoring.


o It is used in medical data classification.
o It can be used in real-time predictions because Naïve Bayes Classifier is an eager
learner.
o It is used in Text classification such as Spam filtering and Sentiment analysis.

Types of Naïve Bayes Model:

There are three types of Naive Bayes Model, which are given below:

o Gaussian: The Gaussian model assumes that features follow a normal distribution.
This means if predictors take continuous values instead of discrete, then the model
assumes that these values are sampled from the Gaussian distribution.
o Multinomial: The Multinomial Naïve Bayes classifier is used when the data is
multinomial distributed. It is primarily used for document classification problems, it
means a particular document belongs to which category such as Sports, Politics,
education, etc.
The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the
predictor variables are the independent Booleans variables. Such as if a particular
word is present or not in a document. This model is also famous for document
classification tasks.

Probabilistic reasoning:

Uncertainty:

Till now, we have learned knowledge representation using first-order logic and propositional
logic with certainty, which means we were sure about the predicates. With this knowledge
representation, we might write A→B, which means if A is true then B is true, but consider a
situation where we are not sure about whether A is true or not then we cannot express this
statement, this situation is called uncertainty.

So to represent uncertain knowledge, where we are not sure about the predicates, we need
uncertain reasoning or probabilistic reasoning.

Causes of uncertainty:
Following are some leading causes of uncertainty to occur in the real world.

1. Information occurred from unreliable sources.


2. Experimental Errors
3. Equipment fault
4. Temperature variation
5. Climate change.

Probabilistic reasoning:

Probabilistic reasoning is a way of knowledge representation where we apply the concept of


probability to indicate the uncertainty in knowledge. In probabilistic reasoning, we combine
probability theory with logic to handle the uncertainty.

We use probability in probabilistic reasoning because it provides a way to handle the


uncertainty that is the result of someone's laziness and ignorance.

In the real world, there are lots of scenarios, where the certainty of something is not
confirmed, such as "It will rain today," "behavior of someone for some situations," "A match
between two teams or two players." These are probable sentences for which we can assume
that it will happen but not sure about it, so here we use probabilistic reasoning.

Need of probabilistic reasoning in AI:

o When there are unpredictable outcomes.


o When specifications or possibilities of predicates becomes too large to handle.
o When an unknown error occurs during an experiment.

In probabilistic reasoning, there are two ways to solve problems with uncertain knowledge:

o Bayes' rule
o Bayesian Statistics

As probabilistic reasoning uses probability and related terms, so before understanding


probabilistic reasoning, let's understand some common terms:

Probability: Probability can be defined as a chance that an uncertain event will occur. It is
the numerical measure of the likelihood that an event will occur. The value of probability
always remains between 0 and 1 that represent ideal uncertainties.

1. 0 ≤ P(A) ≤ 1, where P(A) is the probability of an event A.

1. P(A) = 0, indicates total uncertainty in an event A.

1. P(A) =1, indicates total certainty in an event A.


We can find the probability of an uncertain event by using the below formula.

o P(¬A) = probability of a not happening event.


o P(¬A) + P(A) = 1.

Event: Each possible outcome of a variable is called an event.

Sample space: The collection of all possible events is called sample space.

Random variables: Random variables are used to represent the events and objects in the real
world.

Prior probability: The prior probability of an event is probability computed before


observing new information.

Posterior Probability: The probability that is calculated after all evidence or information has
taken into account. It is a combination of prior probability and new information.

Conditional probability:

Conditional probability is a probability of occurring an event when another event has already
happened.

Let's suppose, we want to calculate the event A when event B has already occurred, "the
probability of A under the conditions of B", it can be written as:

Where P(A⋀B)= Joint probability of a and B

P(B)= Marginal probability of B.

If the probability of A is given and we need to find the probability of B, then it will be given
as:

It can be explained by using the below Venn diagram, where B is occurred event, so sample
space will be reduced to set B, and now we can only calculate event A when event B is
already occurred by dividing the probability of P(A⋀B) by P( B ).
Example:

In a class, there are 70% of the students who like English and 40% of the students who likes
English and mathematics, and then what is the percent of students those who like English also
like mathematics?

Solution:

Let, A is an event that a student likes Mathematics

B is an event that a student likes English.

Hence, 57% are the students who like English also like Mathematics.

Bayesian networks

Bayesian belief network is key computer technology for dealing with probabilistic events and
to solve a problem which has uncertainty. We can define a Bayesian network as:

"A Bayesian network is a probabilistic graphical model which represents a set of variables
and their conditional dependencies using a directed acyclic graph."

It is also called a Bayes network, belief network, decision network, or Bayesian model.

Bayesian networks are probabilistic, because these networks are built from a probability
distribution, and also use probability theory for prediction and anomaly detection.

Real world applications are probabilistic in nature, and to represent the relationship between
multiple events, we need a Bayesian network. It can also be used in various tasks
including prediction, anomaly detection, diagnostics, automated insight, reasoning, time
series prediction, and decision making under uncertainty.

Bayesian Network can be used for building models from data and experts opinions, and it
consists of two parts:

o Directed Acyclic Graph


o Table of conditional probabilities.

The generalized form of Bayesian network that represents and solve decision problems under
uncertain knowledge is known as an Influence diagram.

A Bayesian network graph is made up of nodes and Arcs (directed links), where:

o Each node corresponds to the random variables, and a variable can


be continuous or discrete.
o Arc or directed arrows represent the causal relationship or conditional probabilities
between random variables. These directed links or arrows connect the pair of nodes in
the graph.
These links represent that one node directly influence the other node, and if there is no
directed link that means that nodes are independent with each other

o In the above diagram, A, B, C, and D are random variables represented


by the nodes of the network graph.
o If we are considering node B, which is connected with node A by a
directed arrow, then node A is called the parent of Node B.
o Node C is independent of node A.

The Bayesian network has mainly two components:

o Causal Component
o Actual numbers

Each node in the Bayesian network has condition probability distribution P(Xi |Parent(Xi) ),
which determines the effect of the parent on that node.

Bayesian network is based on Joint probability distribution and conditional probability. So


let's first understand the joint probability distribution:

Joint probability distribution:

If we have variables x1, x2, x3,....., xn, then the probabilities of a different combination of x1,
x2, x3.. xn, are known as Joint probability distribution.

P[x1, x2, x3,....., xn], it can be written as the following way in terms of the joint probability
distribution.

= P[x1| x2, x3,....., xn]P[x2, x3,....., xn]

= P[x1| x2, x3,....., xn]P[x2|x3,....., xn]....P[xn-1|xn]P[xn].

In general for each variable Xi, we can write the equation as:

P(Xi|Xi-1,........., X1) = P(Xi |Parents(Xi ))


Explanation of Bayesian network:

Let's understand the Bayesian network through an example by creating a directed acyclic
graph:

Example: Harry installed a new burglar alarm at his home to detect burglary. The alarm
reliably responds at detecting a burglary but also responds for minor earthquakes. Harry has
two neighbors David and Sophia, who have taken a responsibility to inform Harry at work
when they hear the alarm. David always calls Harry when he hears the alarm, but sometimes
he got confused with the phone ringing and calls at that time too. On the other hand, Sophia
likes to listen to high music, so sometimes she misses to hear the alarm. Here we would like
to compute the probability of Burglary Alarm.

Problem:

Calculate the probability that alarm has sounded, but there is neither a burglary, nor
an earthquake occurred, and David and Sophia both called the Harry.
Solution:

o The Bayesian network for the above problem is given below. The network structure is
showing that burglary and earthquake is the parent node of the alarm and directly
affecting the probability of alarm's going off, but David and Sophia's calls depend on
alarm probability.
o The network is representing that our assumptions do not directly perceive the burglary
and also do not notice the minor earthquake, and they also not confer before calling.
o The conditional distributions for each node are given as conditional probabilities table
or CPT.
o Each row in the CPT must be sum to 1 because all the entries in the table represent an
exhaustive set of cases for the variable.
o In CPT, a boolean variable with k boolean parents contains 2 K probabilities. Hence, if
there are two parents, then CPT will contain 4 probability values

List of all events occurring in this network:

o Burglary (B)
o Earthquake(E)
o Alarm(A)
o David Calls(D)
o Sophia calls(S)

We can write the events of problem statement in the form of probability: P[D, S, A, B, E],
can rewrite the above probability statement using joint probability distribution:

P[D, S, A, B, E]= P[D | S, A, B, E]. P[S, A, B, E]

=P[D | S, A, B, E]. P[S | A, B, E]. P[A, B, E]

= P [D| A]. P [ S| A, B, E]. P[ A, B, E]

= P[D | A]. P[ S | A]. P[A| B, E]. P[B, E]

= P[D | A ]. P[S | A]. P[A| B, E]. P[B |E]. P[E]

Let's take the observed probability for the Burglary and earthquake component:

P(B= True) = 0.002, which is the probability of burglary.

P(B= False)= 0.998, which is the probability of no burglary.

P(E= True)= 0.001, which is the probability of a minor earthquake

P(E= False)= 0.999, Which is the probability that an earthquake not occurred.
We can provide the conditional probabilities as per the below tables:

Conditional probability table for Alarm A:

The Conditional probability of Alarm A depends on Burglar and earthquake:

B E P(A= True) P(A= False)

True True 0.94 0.06

True False 0.95 0.04

False True 0.31 0.69

False False 0.001 0.999

Conditional probability table for David Calls:

The Conditional probability of David that he will call depends on the probability of Alarm.

A P(D= True) P(D= False)

True 0.91 0.09

False 0.05 0.95

Conditional probability table for Sophia Calls:

The Conditional probability of Sophia that she calls is depending on its Parent Node "Alarm."

A P(S= True) P(S= False)

True 0.75 0.25

False 0.02 0.98

From the formula of joint distribution, we can write the problem statement in the form of
probability distribution:

P(S, D, A, ¬B, ¬E) = P (S|A) *P (D|A)*P (A|¬B ^ ¬E) *P (¬B) *P (¬E).


= 0.75* 0.91* 0.001* 0.998*0.999

= 0.00068045.

Hence, a Bayesian network can answer any query about the domain by using Joint
distribution.

The semantics of Bayesian Network:

There are two ways to understand the semantics of the Bayesian network, which is given
below:

1. To understand the network as the representation of the Joint probability distribution.

It is helpful to understand how to construct the network.

2. To understand the network as an encoding of a collection of conditional independence


statements.

It is helpful in designing inference procedure.

Exact inference in BN:

Bayesian networks are a type of probabilistic graphical model which represents a set of
variables and their conditional dependencies using a Directed Acyclic Graph (DAG). It uses
Bayesian inference for probability computations. Bayesian networks aim to model conditional
dependence, and therefore causation, by representing conditional dependence by edges in a
directed graph. Through these relationships, one can efficiently conduct inference on the
random variables in the graph through the use of factors.

The Bayesian Network

Using the relationships specified by our Bayesian network, we can obtain a compact,
factorized representation of the joint probability distribution by taking advantage of
conditional independence.
A Bayesian network is a directed acyclic graph in which each edge corresponds to a
conditional dependency, and each node corresponds to a unique random variable. Formally, if
an edge (A, B) exists in the graph connecting random variables A and B, it means that P(B|A)
is a factor in the joint probability distribution, so we must know P(B|A) for all values of B and
A in order to conduct inference. In the above example, since Rain has an edge going into
WetGrass, it means that P(WetGrass|Rain) will be a factor, whose probability values are
specified next to the WetGrass node in a conditional probability table.

Bayesian networks satisfy the local Markov property, which states that a node is
conditionally independent of its non-descendants given its parents. In the above example, this
means that P(Sprinkler|Cloudy, Rain) = P(Sprinkler|Cloudy) since Sprinkler is conditionally
independent of its non-descendant, Rain, given Cloudy. This property allows us to simplify the
joint distribution, obtained in the previous section using the chain rule, to a smaller form. After
simplification, the joint distribution for a Bayesian network is equal to the product of P(node|
parents(node)) for all nodes, stated below:

In larger networks, this property allows us to greatly reduce the amount of required
computation, since generally, most nodes will have few parents relative to the overall size of
the network.

Inference
Inference is one key objective in a Bayesian network (BN), and it aims to estimate
the posterior distributions of state variables based on evidence (observations).

- Exact Inference

Inference over a Bayesian network can come in two forms.

The first is simply evaluating the joint probability of a particular assignment of values for each
variable (or a subset) in the network. For this, we already have a factorized form of the joint
distribution, so we simply evaluate that product using the provided conditional probabilities. If
we only care about a subset of variables, we will need to marginalize out the ones we are not
interested in. In many cases, this may result in underflow, so it is common to take the
logarithm of that product, which is equivalent to adding up the individual logarithms of each
term in the product.

The second, more interesting inference task, is to find P(x|e), or, to find the probability of
some assignment of a subset of the variables (x) given assignments of other variables (our
evidence, e). In the above example, an example of this could be to find P(Sprinkler, WetGrass
| Cloudy), where {Sprinkler, WetGrass} is our x, and {Cloudy} is our e. In order to calculate
this, we use the fact that P(x|e) = P(x, e) / P(e) = αP(x, e), where α is a normalization constant
that we will calculate at the end such that P(x|e) + P(¬x | e) = 1. In order to calculate P(x, e),
we must marginalize the joint probability distribution over the variables that do not appear in x
or e, which we will denote as Y.

Approximate inference in BN:

Approximate Inference

What is approximate inference?

 It is a method of estimating probabilities in Bayesian networks also called ‘Monte Carlo’


algorithms.

 We will discuss two types of algorithms: Direct sampling and Markov chain sampling.
Why use approximate inference?

 Exact inference becomes intractable for large multiply-connected networks

 Variable elimination can have exponential time and space complexity

 Exact inference is strictly HARDER than NP-complete problems (#P-hard)

Direct Sampling

In direct sampling we take samples of events. We expect the frequency of the samples to
converge on the probability of the event.

Rejection Sampling —

 Used to compute conditional probabilities P(X|e)

 Generate samples as before

 Reject samples that do not match evidence

 Estimate by counting the how often event X is in the resulting samples

Likelihood Weighting —

 Avoid inefficiency of rejection sampling

 Fix values for evidence variables and only sample the remaining variables

 Weight samples with regard to how likely they are

Markov Chain Sampling

 Generate events by making a random change to the preceding event

 This change is made using the Markov Blanket of the variable to be changed

 Markov Blanket = parents, children, children’s parents

 Tally and normalize results


Causal networks:

Decisions based on machine learning (ML) are potentially advantageous over human
decisions, as they do not suffer from the same subjectivity, and can be more accurate and
easier to analyse. At the same time, data used to train ML systems often contain human and
societal biases that can lead to harmful decisions: extensive evidence in areas such as hiring,
criminal justice, surveillance, and healthcare suggests that ML decision systems can treat
individuals unfavorably (unfairly) on the basis of characteristics such as race, gender,
disabilities, and sexual orientation – referred to as sensitive attributes.

Currently, most fairness criteria used for evaluating and designing ML decision systems
focus on the relationships between the sensitive attribute and the system output. However,
the training data can display different patterns of unfairness depending on how and why the
sensitive attribute influences other variables. Using such criteria without fully accounting for
this could be problematic: it could, for example, lead to the erroneous conclusion that a
model exhibiting harmful biases is fair and, vice-versa, that a model exhibiting harmless
biases is unfair. The development of technical solutions to fairness also requires considering
the different, potentially intricate, ways in which unfairness can appear in the data.

Understanding how and why a sensitive attribute influences other variables in a dataset can
be a challenging task, requiring both a technical and sociological analysis. The visual, yet
mathematically precise, framework of Causal Bayesian networks (CBNs) represents a
flexible useful tool in this respect as it can be used to formalize, measure, and deal with
different unfairness scenarios underlying a dataset. A CBN (Figure 1) is a graph formed by
nodes representing random variables, connected by links denoting causal influence. By
defining unfairness as the presence of a harmful influence from the sensitive attribute in the
graph, CBNs provide us with a simple and intuitive visual tool for describing different
possible unfairness scenarios underlying a dataset. In addition, CBNs provide us with a
powerful quantitative tool to measure unfairness in a dataset and to help researchers develop
techniques for addressing it.

Causal Bayesian Networks as a Visual Tool

Characterising patterns of unfairness underlying a dataset

Consider a hypothetical college admission example (inspired by the Berkeley case) in which
applicants are admitted based on qualifications Q, choice of department D, and gender G;
and in which female applicants apply more often to certain departments (for simplicity’s
sake, we consider gender as binary, but this is not a necessary restriction imposed by the
framework).
Figure 1. CBN representing a hypothetical college admission process.
Definition: In a CBN, a path from node X to node Z is defined as a sequence of linked nodes
starting at X and ending at Z. X is a cause of (has an influence on) Z if there exists a causal
path from X to Z, namely a path whose links are pointing from the preceding nodes toward
the following nodes in the sequence. For example, in Figure 1, the path G→D→A is causal,
whilst the path G→D→A←Q is non causal.

The admission process is represented by the CBN in Figure 1. Gender has a direct influence
on admission through the causal path G→A and an indirect influence through the causal
path G→D→A. The direct influence captures the fact that individuals with the same
qualifications who are applying to the same department might be treated differently based on
their gender. The indirect influence captures differing admission rates between female and
male applicants due to their differing department choices.

Whilst the direct influence of the sensitive attribute on admission is considered unfair for
social and legal reasons, the indirect influence could be considered fair or unfair depending
on contextual factors. In Figure 2a, 2b and 2c, we depict three possible scenarios, where
total or partial red paths are used to indicate unfair and and partially-unfair links,
respectively.
Figure 2a: In the first scenario, female applicants voluntarily apply to departments with low
acceptance rates, and therefore the path G→D is considered fair.

Figure 2b: In the second scenario, female applicants apply to departments with low
acceptance rates due to systemic historical or cultural pressures, and therefore the path
G→D is considered unfair (as a consequence, the path D→A becomes partially unfair).
Figure 2c: In the third scenario, the college lowers the admission rates for departments
voluntarily chosen more often by women. The path G→D is considered fair, but the path
D→A is partially unfair.
This simplified example shows how CBNs can provide us with a visual framework for
describing different possible unfairness scenarios. Understanding which scenario underlies a
dataset can be challenging or even impossible, and might require expert knowledge. It is
nevertheless necessary to avoid pitfalls when evaluating or designing a decision system.

As an example, let’s assume that a university uses historical data to train a decision system
to decide whether a prospective applicant should be admitted, and that a regulator wants to
evaluate its fairness. Two popular fairness criteria are statistical parity (requiring the
same admission rates among female and male applicants) and equal false positive or
negative rates (EFPRs/EFNRs, requiring the same error rates among female and male
applicants: i.e., the percentage of accepted applicants erroneously predicted as rejected, and
vice-versa). In other words, statistical parity and EFPRs/EFNRs require all the predictions
and the incorrect predictions to be independent of gender.

From the discussion above, we can deduce that whether such criteria are appropriate or not
strictly depends on the nature of the data pathways. Due to the presence of the unfair direct
influence of gender on admission, it would be inappropriate for the regulator to use
EFPRs/EFNRs to gauge fairness, because this criterion considers the influence that gender
has on admission in the data as legitimate. This means that it would be possible for the
system to be deemed fair, even if it carries the unfair influence: this would automatically be
the case for an error-free decision system. On the other hand, if the path G→D→A was
considered fair, it would be inappropriate to use statistical parity. In this case, it would be
possible for the system to be deemed unfair, even if it does not contain the unfair direct
influence of gender on admission through the path G→A and only contains the fair indirect
influence through the path G→D→A. In our first paper, we raise these concerns in the
context of the fairness debate surrounding the COMPAS pretrial risk assessment tool, which
has been central to the dialogue around the risks of using ML decision systems.

Causal Bayesian Networks as a Quantitative Tool

Path-specific (counterfactual) inference techniques for fairness

CBNs can also be used to quantify unfairness in a dataset and to design techniques for
alleviating unfairness in the case of complex relationships in the data.

Path-specific techniques enable us to estimate the influence that a sensitive attribute has on
other variables along specific sets of causal paths. This can be used to measure the degree of
unfairness on a given dataset in complex scenarios in which some causal paths are
considered unfair whilst other causal paths are considered fair. In the college admission
example in which the path G→D→A is considered fair, path-specific techniques would
enable us to measure the influence of G on A restricted to the direct path G→A over the
whole population, in order to obtain an estimate of the degree of unfairness contained in the
dataset.

Sidenote: It's worth noting that, in our simple example, we do not consider the presence of
confounders for the influence of G on A. In this case, as there are no unfair causal paths
from G to A except the direct one, the degree of unfairness could simply be obtained by
measuring the discrepancy between p(A | G=0, Q, D) and p(A | G=1, Q, D), where p(A |
G=0, Q, D) indicates the distribution of A conditioned on the candidate being male, their
qualifications, and department choice.

The additional use of counterfactual inference techniques would enable us to ask if a


specific individual was treated unfairly, for example by asking whether a rejected female
applicant (G=1, Q=q, D=d, A=0) would have obtained the same decision in a counterfactual
world in which her gender were male along the direct path G→A. In this simple example,
assuming that the admission decision is obtained as the deterministic function f of G, Q, and
D, i.e., A = f(G, Q, D), this corresponds to asking if f(G=0, Q=q, D=d) = 0, namely if a male
applicant with the same department choice and qualifications would have also been rejected.
We exemplify this in Figure 3 by re-computing the admission decision after changing the
female candidate's photo to a male one in the profile.

Figure 3. Counterfactual scenario


However, path-specific counterfactual inference is generally more complex to achieve, if
some variables are unfairly influenced by G. Assume that G also has an influence on Q
through a direct path G→Q which is considered unfair. In this case, the CBN contains both
variables that are fairly and unfairly influenced by G. Path-specific counterfactual inference
would consist in performing a counterfactual correction of q, q_0, i.e of the variable which is
unfairly influenced by G, and then computing the counterfactual decision as f(G=0, Q=q_0,
D=d). The counterfactual correction q_0 is obtained by first using the information of the
female applicant (G=1, Q=q, D=d, A=0) and knowledge about the CBN to get an estimate of
the specific latent randomness in the makeup of the applicant, and then using this estimate to
re-compute the value of Q as if G=0 along G→Q.

In addition to answering questions of fairness in a dataset, path-specific counterfactual


inference could be used to design methods to alleviate the unfairness of an ML system. In
our second paper, we propose a method to perform path-specific counterfactual inference
and suggest that it can be used to post-process the unfair decisions of a trained ML system
by replacing them with counterfactual decisions. The resulting system is said to satisfy path-
specific counterfactual fairness.

You might also like