AI unit 5 notes
AI unit 5 notes
PROBABILISTIC REASONING
1. UNCERTAINTY
To act rationally under uncertainty we must be able to evaluate how likely certain things are.
With FOL a fact F is only useful if it is known to be true or false. But we need to be able to
evaluate how likely it is that F is true. By weighing likelihoods of events (probabilities) we
can develop mechanisms for acting rationally under uncertainty.
Dental Diagnosis example.
In FOL we might formulate
P. symptom(P,toothache)→ disease(p,cavity) disease(p,gumDisease)
disease(p,foodStuck)
When do we stop?
Cannot list all possible causes.
We also want to rank the possibilities. We don’t want to start drilling for a cavity before
checking for more likely causes first.
Axioms Of Probability
Given a set U (universe), a probability function is a function defined over the subsets of U
that maps each subset to the real numbers and that satisfies the Axioms of Probability
1.Pr(U) = 1
2.Pr(A) [0,1]
3.Pr(A ∈B) = Pr(A) + Pr(B) –Pr(A ∩B)
∪
Note if A ∩B = {} then Pr(A ∪B) = Pr(A) + Pr(B)
2. REVIEW OF PROBABILTY
· Natural way to represent uncertainty
· People have intuitive notions about probabilities
· Many of these are wrong or inconsistent
3. PROBABILISTIC REASONING
· Representing Knowledge in an Uncertain Domain
· Belief network used to encode the meaningful dependence between
variables. o Nodes represent random variables
o Arcs represent direct influence
o Nodes have conditional probability table that gives that var's probability given the different
states of its parents
o Is a Directed Acyclic Graph (or DAG)
o P(X1=x1, . . . Xn=xn) =
o = P(X1=x1|parents(X1)) * . . . * P(Xn=xn|parents(Xn))
· Note: Only have to be given the immediate parents of Xi, not all other nodes:
o P(Xi|X(i-1),...X1) = P(Xi|parents(Xi))
· Often, the resulting conditional probability tables are much smaller than the
exponential size of the full joint
· If don't order nodes by "root causes" first, get larger conditional probability
tables
· Different tables may encode the same probabilities.
· Some canonical distributions that appear in conditional probability tables:
o deterministic logical relationship (e.g. AND, OR) o deterministic numeric relationship (e.g.
MIN)
o parameteric relationship (e.g. weighted sum in neural net) o noisy logical relationship (e.g.
noisy-OR, noisy-MAX)
o has an arrow in on the path leading from X and an arrow out on the path leading to Y (or
vice versa)
o has arrows out leading to both X and Y
o does NOT have arrows in from both X and Y (nor Z's children too)
· Multiply connected graphs have 2 nodes connected by more than one path
· Techniques for handling:
o Clustering: Group some of the intermediate nodes into one meganode. Pro:
Perhaps best way to get exact evaluation.
Pro: It may be safe to ignore trees with lo probability (bounded cutset conditioning).
Stochastic simulation: run thru the net with randomly choosen values for each node (weighed
by prior probabilities).
4. BAYESIAN NETWORK
Bayes’ nets:
A technique for describing complex joint distributions (models) using simple, local
distributions
(conditional probabilities)
More properly called graphical models
Local interactions chain together to give global indirect interactions
A Bayesian network is a graphical structure that allows us to represent and reason about an
uncertain domain. The nodes in a Bayesian network represent a set of random variables,
X=X1;::Xi;:::Xn, from the domain. A set of directed arcs(or links) connects pairs of nodes,
Xi!Xj, representing the direct dependencies between variables.
Assuming discrete variables, the strength of the relationship between variables is quantified
by conditional probability distributions associated with each node. The only constraint on the
arcs allowed in a BN is that there must not be any directed cycles: you cannot return to a
node simply by following directed arcs.
Such networks are called directed acyclic graphs, or simply dags. There are a number of steps
that a knowledge engineer must undertake when building a Bayesian network. At this stage
we will present these steps as a sequence; however it is important to note that in the real-
world the process is not so simple.
First, the knowledge engineer must identify the variables of interest. This involves answering
the question: what are the nodes to represent and what values can they take, or what state can
they be in? For now we will consider only nodes that take discrete values. The values should
be both mutually exclusive and exhaustive , which means that the variable must take on
exactly one of these values at a time. Common types of discrete nodes include:
Boolean nodes, which represent propositions, taking the binary values true (T) and false (F).
In a medical diagnosis domain, the node Cancer would represent the proposition that a patient
has cancer.
Ordered values. For example, a node Pollution might represent a patient’s pol-lution exposure
and take the values low, medium, high
Integral values. For example, a node called Age might represent a patient’s age and have
possible values from 1 to 120.
Even at this early stage, modeling choices are being made. For example, an alternative to
representing a patient’s exact age might be to clump patients into different age groups, such
as baby, child, adolescent, young, middleaged, old. The trick is to choose values that
represent the domain efficiently.
1 Representation of joint probability distribution
2 Conditional independence relation in Bayesian network
Bayes' Theorem
Many of the methods used for dealing with uncertainty in expert systems are based on Bayes'
Theorem.
Notation:
P(A) Probability of event A
P(A B) Probability of events A and B occurring together P(A | B) Conditional probability of
event A
given that event B has occurred .nr/
If A and B are independent, then P(A | B) = P(A). .co
Expert systems usually deal with events that are not independent, e.g. a disease and its
symptoms are not independent.
Theorem
P (A B) = P(A | B)* P(B) = P(B | A) * P(A) therefore P(A | B) = P(B | A) * P(A) / P(B)
In doing an expert task, such as medical diagnosis, the goal is to determine identifications
(diseases) given observations (symptoms). Bayes' Theorem provides such a relationship.
P(A | B) = P(B | A) * P(A) / P(B)
The desired diagnostic relationship on the left can be calculated based on the known
statistical quantities on the right.
Given a set of random variables X1 ... Xn, an atomic event is an assignment of a particular
value to each Xi. The joint probability distribution is a table that assigns a probability to each
atomic event. Any question of conditional probability can be answered from the joint.
Toothache ¬ Toothache
Cavity 0.04 0.06
¬ Cavity 0.01 0.89
Problems:
The size of the table is combinatoric: the product of the number of possibilities for each
random variable. The time to answer a question from the table will also be combinatoric.
Lack of evidence: we may not have statistics for some table entries, even though those entries
are not impossible.
Chain Rule
We can compute probabilities using a chain rule as follows: P(A &and B &and C) = P(A | B
&and C) * P(B | C) * P(C)
If some conditions C1 &and ... &and Cn are independent of other conditions U, we will have:
P(A | C1 &and ... &and Cn &and U) = P(A | C1 &and ... &and Cn)
This allows a conditional probability to be computed more easily from smaller tables using
the chain rule.
Bayesian Networks
Bayesian networks, also called belief networks or Bayesian belief networks, express
relationships among variables by directed acyclic graphs with probability tables stored at the
nodes.[Example from Russell & Norvig.]
If a Bayesian network is well structured as a poly-tree (at most one path between any two
nodes), then probabilities can be computed relatively efficiently. One kind of algorithm, due
to Judea Pearl, uses a message-passing style in which nodes of the network compute
probabilities and send them to nodes they are connected to. Several software packages exist
for computing with belief networks.
A Hidden Markov Model (HMM) tagger chooses the tag for each word that maximizes:
[Jurafsky, op. cit.] P(word | tag) * P(tag | previous n tags)
In practice, trigram taggers are most often used, and a search is made for the best set of tags
for the whole sentence; accuracy is about 96%.
The assumptions behind an HMM are that the state at time t+1 only depends on the state at
time t, as in the Markov chain. The observation at time t only depends on the state at time t.
The observations are modeled using the variable for each time t whose domain is the set of
possible observations. The belief network representation of an HMM is depicted in Figure.
Although the belief network is shown for four stages, it can proceed indefinitely.
The problem of filtering or belief-state monitoring is to determine the current state based on
the current and previous observations, namely to determine P(Si|O0,...,Oi).
Note that all state and observation variables after Si are irrelevant because they are not
observed and can be ignored when this conditional distribution is computed.
The problem of smoothing is to determine a state based on past and future observations.
Suppose an agent has observed up to time k and wants to determine the state at time i for i<k;
the smoothing problem is to determine
P(Si|O0,...,Ok).
Bayesian inference
Bayes' theorem:
Bayes' theorem is also known as Bayes' rule, Bayes' law, or Bayesian reasoning, which
determines the probability of an event with uncertain knowledge.
In probability theory, it relates the conditional probability and marginal probabilities of two
random events.
Bayes' theorem was named after the British mathematician Thomas Bayes. The Bayesian
inference is an application of Bayes' theorem, which is fundamental to Bayesian statistics.
Bayes' theorem allows updating the probability prediction of an event by observing new
information of the real world.
Example: If cancer corresponds to one's age then by using Bayes' theorem, we can determine
the probability of cancer more accurately with the help of age.
Bayes' theorem can be derived using product rule and conditional probability of event A with
known event B:
The above equation (a) is called as Bayes' rule or Bayes' theorem. This equation is basic of
most modern AI systems for probabilistic inference.
It shows the simple relationship between joint and conditional probabilities. Here,
P(A|B) is known as posterior, which we need to calculate, and it will be read as Probability
of hypothesis A when we have occurred an evidence B.
P(B|A) is called the likelihood, in which we consider that hypothesis is true, then we calculate
the probability of evidence.
P(A) is called the prior probability, probability of hypothesis before considering the
evidence
In the equation (a), in general, we can write P (B) = P(A)*P(B|Ai), hence the Bayes' rule can
be written as:
Where A1, A2, A3,........, An is a set of mutually exclusive and exhaustive events.
Applying Bayes' rule:
Bayes' rule allows us to compute the single term P(B|A) in terms of P(A|B), P(B), and P(A).
This is very useful in cases where we have a good probability of these three terms and want
to determine the fourth one. Suppose we want to perceive the effect of some unknown cause,
and want to compute that cause, then the Bayes' rule becomes:
Example-1:
Question: what is the probability that a patient has diseases meningitis with a stiff
neck?
Given Data:
A doctor is aware that disease meningitis causes a patient to have a stiff neck, and it occurs
80% of the time. He is also aware of some more facts, which are given as follows:
Let a be the proposition that patient has stiff neck and b be the proposition that patient has
meningitis. , so we can calculate the following as:
P(a|b) = 0.8
P(b) = 1/30000
P(a)= .02
Hence, we can assume that 1 patient out of 750 patients has meningitis disease with a stiff
neck.
Example-2:
Question: From a standard deck of playing cards, a single card is drawn. The
probability that the card is king is 4/52, then calculate posterior probability P(King|
Face), which means the drawn face card is a king card.
Solution:
o It is used to calculate the next step of the robot when the already executed step is
given.
o Bayes' theorem is helpful in weather forecasting.
o It can solve the Monty Hall problem.
o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the
bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an
apple. Hence each feature individually contributes to identify that it is an apple
without depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends on the
conditional probability.
o The formula for Bayes' theorem is given as:
Where,
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
Working of Naïve Bayes' Classifier can be understood with the help of the below example:
Suppose we have a dataset of weather conditions and corresponding target variable "Play".
So using this dataset we need to decide that whether we should play or not on a particular day
according to the weather conditions. So to solve this problem, we need to follow the below
steps:
Problem: If the weather is sunny, then the Player should play or not?
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 5
Likelihood table weather condition:
Weather No Yes
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny)= 0.35
P(Yes)=0.71
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other Algorithms.
o It is the most popular choice for text classification problems.
Disadvantages of Naïve Bayes Classifier:
o Naive Bayes assumes that all features are independent or unrelated, so it cannot learn
the relationship between features.
There are three types of Naive Bayes Model, which are given below:
o Gaussian: The Gaussian model assumes that features follow a normal distribution.
This means if predictors take continuous values instead of discrete, then the model
assumes that these values are sampled from the Gaussian distribution.
o Multinomial: The Multinomial Naïve Bayes classifier is used when the data is
multinomial distributed. It is primarily used for document classification problems, it
means a particular document belongs to which category such as Sports, Politics,
education, etc.
The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the
predictor variables are the independent Booleans variables. Such as if a particular
word is present or not in a document. This model is also famous for document
classification tasks.
Probabilistic reasoning:
Uncertainty:
Till now, we have learned knowledge representation using first-order logic and propositional
logic with certainty, which means we were sure about the predicates. With this knowledge
representation, we might write A→B, which means if A is true then B is true, but consider a
situation where we are not sure about whether A is true or not then we cannot express this
statement, this situation is called uncertainty.
So to represent uncertain knowledge, where we are not sure about the predicates, we need
uncertain reasoning or probabilistic reasoning.
Causes of uncertainty:
Following are some leading causes of uncertainty to occur in the real world.
Probabilistic reasoning:
In the real world, there are lots of scenarios, where the certainty of something is not
confirmed, such as "It will rain today," "behavior of someone for some situations," "A match
between two teams or two players." These are probable sentences for which we can assume
that it will happen but not sure about it, so here we use probabilistic reasoning.
In probabilistic reasoning, there are two ways to solve problems with uncertain knowledge:
o Bayes' rule
o Bayesian Statistics
Probability: Probability can be defined as a chance that an uncertain event will occur. It is
the numerical measure of the likelihood that an event will occur. The value of probability
always remains between 0 and 1 that represent ideal uncertainties.
Sample space: The collection of all possible events is called sample space.
Random variables: Random variables are used to represent the events and objects in the real
world.
Posterior Probability: The probability that is calculated after all evidence or information has
taken into account. It is a combination of prior probability and new information.
Conditional probability:
Conditional probability is a probability of occurring an event when another event has already
happened.
Let's suppose, we want to calculate the event A when event B has already occurred, "the
probability of A under the conditions of B", it can be written as:
If the probability of A is given and we need to find the probability of B, then it will be given
as:
It can be explained by using the below Venn diagram, where B is occurred event, so sample
space will be reduced to set B, and now we can only calculate event A when event B is
already occurred by dividing the probability of P(A⋀B) by P( B ).
Example:
In a class, there are 70% of the students who like English and 40% of the students who likes
English and mathematics, and then what is the percent of students those who like English also
like mathematics?
Solution:
Hence, 57% are the students who like English also like Mathematics.
Bayesian networks
Bayesian belief network is key computer technology for dealing with probabilistic events and
to solve a problem which has uncertainty. We can define a Bayesian network as:
"A Bayesian network is a probabilistic graphical model which represents a set of variables
and their conditional dependencies using a directed acyclic graph."
It is also called a Bayes network, belief network, decision network, or Bayesian model.
Bayesian networks are probabilistic, because these networks are built from a probability
distribution, and also use probability theory for prediction and anomaly detection.
Real world applications are probabilistic in nature, and to represent the relationship between
multiple events, we need a Bayesian network. It can also be used in various tasks
including prediction, anomaly detection, diagnostics, automated insight, reasoning, time
series prediction, and decision making under uncertainty.
Bayesian Network can be used for building models from data and experts opinions, and it
consists of two parts:
The generalized form of Bayesian network that represents and solve decision problems under
uncertain knowledge is known as an Influence diagram.
A Bayesian network graph is made up of nodes and Arcs (directed links), where:
o Causal Component
o Actual numbers
Each node in the Bayesian network has condition probability distribution P(Xi |Parent(Xi) ),
which determines the effect of the parent on that node.
If we have variables x1, x2, x3,....., xn, then the probabilities of a different combination of x1,
x2, x3.. xn, are known as Joint probability distribution.
P[x1, x2, x3,....., xn], it can be written as the following way in terms of the joint probability
distribution.
In general for each variable Xi, we can write the equation as:
Let's understand the Bayesian network through an example by creating a directed acyclic
graph:
Example: Harry installed a new burglar alarm at his home to detect burglary. The alarm
reliably responds at detecting a burglary but also responds for minor earthquakes. Harry has
two neighbors David and Sophia, who have taken a responsibility to inform Harry at work
when they hear the alarm. David always calls Harry when he hears the alarm, but sometimes
he got confused with the phone ringing and calls at that time too. On the other hand, Sophia
likes to listen to high music, so sometimes she misses to hear the alarm. Here we would like
to compute the probability of Burglary Alarm.
Problem:
Calculate the probability that alarm has sounded, but there is neither a burglary, nor
an earthquake occurred, and David and Sophia both called the Harry.
Solution:
o The Bayesian network for the above problem is given below. The network structure is
showing that burglary and earthquake is the parent node of the alarm and directly
affecting the probability of alarm's going off, but David and Sophia's calls depend on
alarm probability.
o The network is representing that our assumptions do not directly perceive the burglary
and also do not notice the minor earthquake, and they also not confer before calling.
o The conditional distributions for each node are given as conditional probabilities table
or CPT.
o Each row in the CPT must be sum to 1 because all the entries in the table represent an
exhaustive set of cases for the variable.
o In CPT, a boolean variable with k boolean parents contains 2 K probabilities. Hence, if
there are two parents, then CPT will contain 4 probability values
o Burglary (B)
o Earthquake(E)
o Alarm(A)
o David Calls(D)
o Sophia calls(S)
We can write the events of problem statement in the form of probability: P[D, S, A, B, E],
can rewrite the above probability statement using joint probability distribution:
Let's take the observed probability for the Burglary and earthquake component:
P(E= False)= 0.999, Which is the probability that an earthquake not occurred.
We can provide the conditional probabilities as per the below tables:
The Conditional probability of David that he will call depends on the probability of Alarm.
The Conditional probability of Sophia that she calls is depending on its Parent Node "Alarm."
From the formula of joint distribution, we can write the problem statement in the form of
probability distribution:
= 0.00068045.
Hence, a Bayesian network can answer any query about the domain by using Joint
distribution.
There are two ways to understand the semantics of the Bayesian network, which is given
below:
Bayesian networks are a type of probabilistic graphical model which represents a set of
variables and their conditional dependencies using a Directed Acyclic Graph (DAG). It uses
Bayesian inference for probability computations. Bayesian networks aim to model conditional
dependence, and therefore causation, by representing conditional dependence by edges in a
directed graph. Through these relationships, one can efficiently conduct inference on the
random variables in the graph through the use of factors.
Using the relationships specified by our Bayesian network, we can obtain a compact,
factorized representation of the joint probability distribution by taking advantage of
conditional independence.
A Bayesian network is a directed acyclic graph in which each edge corresponds to a
conditional dependency, and each node corresponds to a unique random variable. Formally, if
an edge (A, B) exists in the graph connecting random variables A and B, it means that P(B|A)
is a factor in the joint probability distribution, so we must know P(B|A) for all values of B and
A in order to conduct inference. In the above example, since Rain has an edge going into
WetGrass, it means that P(WetGrass|Rain) will be a factor, whose probability values are
specified next to the WetGrass node in a conditional probability table.
Bayesian networks satisfy the local Markov property, which states that a node is
conditionally independent of its non-descendants given its parents. In the above example, this
means that P(Sprinkler|Cloudy, Rain) = P(Sprinkler|Cloudy) since Sprinkler is conditionally
independent of its non-descendant, Rain, given Cloudy. This property allows us to simplify the
joint distribution, obtained in the previous section using the chain rule, to a smaller form. After
simplification, the joint distribution for a Bayesian network is equal to the product of P(node|
parents(node)) for all nodes, stated below:
In larger networks, this property allows us to greatly reduce the amount of required
computation, since generally, most nodes will have few parents relative to the overall size of
the network.
Inference
Inference is one key objective in a Bayesian network (BN), and it aims to estimate
the posterior distributions of state variables based on evidence (observations).
- Exact Inference
The first is simply evaluating the joint probability of a particular assignment of values for each
variable (or a subset) in the network. For this, we already have a factorized form of the joint
distribution, so we simply evaluate that product using the provided conditional probabilities. If
we only care about a subset of variables, we will need to marginalize out the ones we are not
interested in. In many cases, this may result in underflow, so it is common to take the
logarithm of that product, which is equivalent to adding up the individual logarithms of each
term in the product.
The second, more interesting inference task, is to find P(x|e), or, to find the probability of
some assignment of a subset of the variables (x) given assignments of other variables (our
evidence, e). In the above example, an example of this could be to find P(Sprinkler, WetGrass
| Cloudy), where {Sprinkler, WetGrass} is our x, and {Cloudy} is our e. In order to calculate
this, we use the fact that P(x|e) = P(x, e) / P(e) = αP(x, e), where α is a normalization constant
that we will calculate at the end such that P(x|e) + P(¬x | e) = 1. In order to calculate P(x, e),
we must marginalize the joint probability distribution over the variables that do not appear in x
or e, which we will denote as Y.
Approximate Inference
We will discuss two types of algorithms: Direct sampling and Markov chain sampling.
Why use approximate inference?
Direct Sampling
In direct sampling we take samples of events. We expect the frequency of the samples to
converge on the probability of the event.
Rejection Sampling —
Likelihood Weighting —
Fix values for evidence variables and only sample the remaining variables
This change is made using the Markov Blanket of the variable to be changed
Decisions based on machine learning (ML) are potentially advantageous over human
decisions, as they do not suffer from the same subjectivity, and can be more accurate and
easier to analyse. At the same time, data used to train ML systems often contain human and
societal biases that can lead to harmful decisions: extensive evidence in areas such as hiring,
criminal justice, surveillance, and healthcare suggests that ML decision systems can treat
individuals unfavorably (unfairly) on the basis of characteristics such as race, gender,
disabilities, and sexual orientation – referred to as sensitive attributes.
Currently, most fairness criteria used for evaluating and designing ML decision systems
focus on the relationships between the sensitive attribute and the system output. However,
the training data can display different patterns of unfairness depending on how and why the
sensitive attribute influences other variables. Using such criteria without fully accounting for
this could be problematic: it could, for example, lead to the erroneous conclusion that a
model exhibiting harmful biases is fair and, vice-versa, that a model exhibiting harmless
biases is unfair. The development of technical solutions to fairness also requires considering
the different, potentially intricate, ways in which unfairness can appear in the data.
Understanding how and why a sensitive attribute influences other variables in a dataset can
be a challenging task, requiring both a technical and sociological analysis. The visual, yet
mathematically precise, framework of Causal Bayesian networks (CBNs) represents a
flexible useful tool in this respect as it can be used to formalize, measure, and deal with
different unfairness scenarios underlying a dataset. A CBN (Figure 1) is a graph formed by
nodes representing random variables, connected by links denoting causal influence. By
defining unfairness as the presence of a harmful influence from the sensitive attribute in the
graph, CBNs provide us with a simple and intuitive visual tool for describing different
possible unfairness scenarios underlying a dataset. In addition, CBNs provide us with a
powerful quantitative tool to measure unfairness in a dataset and to help researchers develop
techniques for addressing it.
Consider a hypothetical college admission example (inspired by the Berkeley case) in which
applicants are admitted based on qualifications Q, choice of department D, and gender G;
and in which female applicants apply more often to certain departments (for simplicity’s
sake, we consider gender as binary, but this is not a necessary restriction imposed by the
framework).
Figure 1. CBN representing a hypothetical college admission process.
Definition: In a CBN, a path from node X to node Z is defined as a sequence of linked nodes
starting at X and ending at Z. X is a cause of (has an influence on) Z if there exists a causal
path from X to Z, namely a path whose links are pointing from the preceding nodes toward
the following nodes in the sequence. For example, in Figure 1, the path G→D→A is causal,
whilst the path G→D→A←Q is non causal.
The admission process is represented by the CBN in Figure 1. Gender has a direct influence
on admission through the causal path G→A and an indirect influence through the causal
path G→D→A. The direct influence captures the fact that individuals with the same
qualifications who are applying to the same department might be treated differently based on
their gender. The indirect influence captures differing admission rates between female and
male applicants due to their differing department choices.
Whilst the direct influence of the sensitive attribute on admission is considered unfair for
social and legal reasons, the indirect influence could be considered fair or unfair depending
on contextual factors. In Figure 2a, 2b and 2c, we depict three possible scenarios, where
total or partial red paths are used to indicate unfair and and partially-unfair links,
respectively.
Figure 2a: In the first scenario, female applicants voluntarily apply to departments with low
acceptance rates, and therefore the path G→D is considered fair.
Figure 2b: In the second scenario, female applicants apply to departments with low
acceptance rates due to systemic historical or cultural pressures, and therefore the path
G→D is considered unfair (as a consequence, the path D→A becomes partially unfair).
Figure 2c: In the third scenario, the college lowers the admission rates for departments
voluntarily chosen more often by women. The path G→D is considered fair, but the path
D→A is partially unfair.
This simplified example shows how CBNs can provide us with a visual framework for
describing different possible unfairness scenarios. Understanding which scenario underlies a
dataset can be challenging or even impossible, and might require expert knowledge. It is
nevertheless necessary to avoid pitfalls when evaluating or designing a decision system.
As an example, let’s assume that a university uses historical data to train a decision system
to decide whether a prospective applicant should be admitted, and that a regulator wants to
evaluate its fairness. Two popular fairness criteria are statistical parity (requiring the
same admission rates among female and male applicants) and equal false positive or
negative rates (EFPRs/EFNRs, requiring the same error rates among female and male
applicants: i.e., the percentage of accepted applicants erroneously predicted as rejected, and
vice-versa). In other words, statistical parity and EFPRs/EFNRs require all the predictions
and the incorrect predictions to be independent of gender.
From the discussion above, we can deduce that whether such criteria are appropriate or not
strictly depends on the nature of the data pathways. Due to the presence of the unfair direct
influence of gender on admission, it would be inappropriate for the regulator to use
EFPRs/EFNRs to gauge fairness, because this criterion considers the influence that gender
has on admission in the data as legitimate. This means that it would be possible for the
system to be deemed fair, even if it carries the unfair influence: this would automatically be
the case for an error-free decision system. On the other hand, if the path G→D→A was
considered fair, it would be inappropriate to use statistical parity. In this case, it would be
possible for the system to be deemed unfair, even if it does not contain the unfair direct
influence of gender on admission through the path G→A and only contains the fair indirect
influence through the path G→D→A. In our first paper, we raise these concerns in the
context of the fairness debate surrounding the COMPAS pretrial risk assessment tool, which
has been central to the dialogue around the risks of using ML decision systems.
CBNs can also be used to quantify unfairness in a dataset and to design techniques for
alleviating unfairness in the case of complex relationships in the data.
Path-specific techniques enable us to estimate the influence that a sensitive attribute has on
other variables along specific sets of causal paths. This can be used to measure the degree of
unfairness on a given dataset in complex scenarios in which some causal paths are
considered unfair whilst other causal paths are considered fair. In the college admission
example in which the path G→D→A is considered fair, path-specific techniques would
enable us to measure the influence of G on A restricted to the direct path G→A over the
whole population, in order to obtain an estimate of the degree of unfairness contained in the
dataset.
Sidenote: It's worth noting that, in our simple example, we do not consider the presence of
confounders for the influence of G on A. In this case, as there are no unfair causal paths
from G to A except the direct one, the degree of unfairness could simply be obtained by
measuring the discrepancy between p(A | G=0, Q, D) and p(A | G=1, Q, D), where p(A |
G=0, Q, D) indicates the distribution of A conditioned on the candidate being male, their
qualifications, and department choice.