0% found this document useful (0 votes)
3 views

bayesian-course-1-short

The document outlines a lecture on Bayesian inference and Bayes' rule, presented by Ben Lambert at Imperial College London. It covers the theory and practice of inference, differences between Frequentist and Bayesian approaches, and the elements of Bayes' rule, including posterior predictive distributions. The course aims to equip participants with an understanding of Bayesian inference, MCMC sampling, and model assessment by the end of the session.

Uploaded by

Lavy Koilpitchai
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

bayesian-course-1-short

The document outlines a lecture on Bayesian inference and Bayes' rule, presented by Ben Lambert at Imperial College London. It covers the theory and practice of inference, differences between Frequentist and Bayesian approaches, and the elements of Bayes' rule, including posterior predictive distributions. The course aims to equip participants with an understanding of Bayesian inference, MCMC sampling, and model assessment by the end of the session.

Uploaded by

Lavy Koilpitchai
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 92

Lecture 1: introduction to inference and Bayes’

rule

Ben Lambert1
[email protected]

1 Imperial
College
London

Tuesday 5th March, 2019


Outline

1 Introduction

2 Course outline

3 The theory and practice of inference


A conceptual introduction to inference
Frequentist and Bayesian world views

4 Elements of Bayes’ rule for inference

5 Posterior predictive distributions


1 Introduction

2 Course outline

3 The theory and practice of inference


A conceptual introduction to inference
Frequentist and Bayesian world views

4 Elements of Bayes’ rule for inference

5 Posterior predictive distributions


Who am I?

Researcher in epidemiology at Imperial College London


(formerly here in Zoology).
User of Bayesian statistics for the past X years.
Born in the same town as Thomas Bayes (Tunbridge
Wells).
Course timetable

Today:
Lecture: 9.30am - 11am.
Class: 11:30am - 1pm.
Lecture: 2pm-3.30pm.
Class: 3.30pm - 5.15pm.
N.B. Usually I have 8-9 hours of lectures to teach this
material. We have 3 hours!
Lecture notes:
http://ben-lambert.com/bayesian-short-course/
1 Introduction

2 Course outline

3 The theory and practice of inference


A conceptual introduction to inference
Frequentist and Bayesian world views

4 Elements of Bayes’ rule for inference

5 Posterior predictive distributions


Tangible benefits of Bayesian inference

Simple and intuitive model building (unlike frequentist


statistics there is no need to remember lots of specific
formulae).
Exhaustive and creative model testing.
The best predictions; for example, Nate Silver.
Allows estimation of models that would be impossible in
frequentist statistics.
Dealing with “beliefs” that can be updated rather than
fixed “long-run frequencies” means Bayesian statistics has
wider applications; for example, robot vision and
navigation.
Why don’t more people use Bayesian inference?

Most existing texts put a strong emphasis on its


(seemingly) complex mathematical basis.
Poor explanation of why we need MCMC algorithms.
Poor explanation of how these MCMC algorithms work,
and how to implement them in practice.
The view that Bayesian inference is more wishy-washy
than frequentist inference.
Books I recommend
Course outcomes

By the end of this course you should:


Understand the basic theory and motivation of Bayesian
inference.
Know how to critically assess a statistical model.
Appreciate why we often need to use MCMC sampling in
Bayesian inference.
Be able to start coding up an MCMC sampler.
Know how to perform inference on ODE models.
Know what is meant by Approximate Bayesian
Computation (ABC) and when to use it.
1 Introduction

2 Course outline

3 The theory and practice of inference


A conceptual introduction to inference
Frequentist and Bayesian world views

4 Elements of Bayes’ rule for inference

5 Posterior predictive distributions


Lecture outcomes

By the end of this lecture you should:


1 Understand the motivation behind inference.
2 Appreciate the similarities and differences between
Frequentist and Bayesian approaches to inference.
3 Understand the intuition behind Bayes rule for inference.
4 Know what posterior predictive distributions are and why
they are useful.
1 Introduction

2 Course outline

3 The theory and practice of inference


A conceptual introduction to inference
Frequentist and Bayesian world views

4 Elements of Bayes’ rule for inference

5 Posterior predictive distributions


The Big world

1 Consider an observable characteristic we are trying to


explain, for example the heights of 5 randomly chosen
individuals.
2 Assume that there exists a true process T that generates
the heights of all individuals in our sample.
3 There is variability in the observables outputted by T ; this
can either be ontological (for example due to the inherent
variability in picking our random sample), or
epistemological (for example, because we lack knowledge
of the genetics and environmental factors that affect
growth).
4 Imagine a set of all conceivable processes that could result
in our sample of height observations, which we call the
“Big World”.
The Big world

B
Figure: Images adapted from “A Technical Introduction to Probability and
Bayesian Inference for Stan Users”, Stan Development Team, 2016.
What is inference?

Motivation: update our knowledge of T in light of data,


and use the updated knowledge to estimate quantities of
interest.
- In our height example we might want to estimate the
mean height of the entire population having witnessed our
sample of 5 individuals.
Method:
1 Find areas of the Big World that are closest to T ; ideally
we would find T itself!
2 Estimate quantities of interest using these subsets of the
Small World.
The Small world

1 The infinity of the Big World is too large to be useful.


2 Instead we first consider a subset of possible data
generating processes which we call the “Small World”, or
Θ.
3 The Small World corresponds to a single probability model
framework; in our height example we might suppose that
H ∼ N(µ, σ), where µ is the mean height, and σ is their
standard deviation.
4 By varying our parameters θ = (µ, σ) we get different data
generating processes.
5 The collection of probability distributions we get by varying
θ ⊂ Θ in the Small World is known as the Likelihood.
The Big world

B
An unlikely Small World

T Θ

B
A Boxian Small World: “All models are wrong but
some are useful”

T
Θ

B
The prior

1 The Small World is still too big for our purposes.


2 We usually have some knowledge about which areas of the
Small World are nearest to T . For example we don’t
believe that µ = 100m and µ = 1.5m are equally probable.
3 As such, in Bayesian inference we define a prior probability
density that gives a weighting to all θ ∈ Θ reflecting our
beliefs.
4 Frequentist inference does not require us to specify a prior
(this causes issues later on that we will discuss).
The prior

T
μ = 1.5m
Θ
μ = 100m

B
The prior

T
μ = 1.5m
Θ
μ = 100m

B
The data

1 Inference is the process of updating our prior knowledge in


light of data.
2 In Bayesian inference with a likelihood and our prior
knowledge explicitly stated we use Bayes’ rule to find our
posterior probability density over θ ∈ Θ.
3 The lack of a prior means that in Frequentist inference we
generate posterior weightings approximately using rules of
thumb (more on this in a minute).
Before the data

T
Θ

B
After the data

T
Θ

B
Summary of the inference process

What is the whole (Bayesian) inference process?


The whole inference process

Define the observables: The Big World

B
The whole inference process

Specify a likelihood

T
Θ

B
The whole inference process

Specify a prior

T
Θ

B
The whole inference process

Input the data

T
Θ

B
1 Introduction

2 Course outline

3 The theory and practice of inference


A conceptual introduction to inference
Frequentist and Bayesian world views

4 Elements of Bayes’ rule for inference

5 Posterior predictive distributions


Example likelihood: frequency of lift
malfunctioning1

Imagine we want to create a model for the frequency a lift


(elevator) breaks down in a given year, X .
This model will be used to plan expenditure on lift repairs
over the following few years.
An aside: how to survive a falling lift

Figure: Taken from www.npr.org

1
Inspired by Prof. Philip Maini.
Example likelihood: frequency of lift malfunctioning

Assume a range of unpredictable and uncorrelated factors


(temperature, lift usage, etc.) affect the functioning of the
lift.
=⇒ X ∼ Poisson(θ), where θ is the mean number of
times the lift breaks in one year.
By specifying that X is Poisson-distributed we define the
boundaries of the Small World.
Important: we don’t a priori know the true value of θ
=⇒ our model defines collection of probability models;
one for each value of θ.
We call this collection of models the Likelihood.
Example likelihood: frequency of lift malfunctioning

X ∼ Poisson(5)
0.20
● ●

0.15 ●

pmf


0.10


0.05
● ●

● ● ●
0.00 ● ● ● ● ● ● ● ● ● ● ● ● ●
0 5 10 15 20 25
number of times lift breaks in one year
Example likelihood: frequency of lift malfunctioning

X ∼ Poisson(10)
0.20

0.15
● ●
● ●
pmf

0.10 ●



0.05 ●
● ●
● ●
● ●
● ● ●
0.00 ● ● ● ● ● ● ● ●
0 5 10 15 20 25
number of times lift breaks in one year
Example likelihood: frequency of lift malfunctioning

X ∼ Poisson(15)
0.20

0.15
pmf

0.10 ●
● ●

● ●
● ●

0.05 ●

● ●
● ●
● ● ●
● ● ● ● ● ● ● ●
0.00
0 5 10 15 20 25
number of times lift breaks in one year
Example likelihood: frequency of lift malfunctioning

X ∼ Poisson(θ)
0.20

0.15

θ=5
pmf

0.10 θ = 10
θ = 15
0.05

0.00
0 5 10 15 20 25
number of times lift breaks in one year
The aim of inference: inverting the likelihood

Assume we find that the lift broke down 8 times in the


past year.
Our likelihood gives us an infinite number of possible ways
in which this could have come about.
Each of these ways corresponds to a unique value of θ.
The aim of inference: inverting the likelihood

0.20 X ~ Poisson(5) 0.20 X ~ Poisson(10) 0.20 X ~ Poisson(15)

0.15 0.15 0.15


pmf

pmf

pmf
0.10 0.10 0.10

0.05 0.05 0.05

0.00 0.00 0.00


0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25
number of times lift breaks in one year number of times lift breaks in one year number of times lift breaks in one year

X=8
The aim of inference: inverting the likelihood

We know that any of these models, each corresponding to


different values of θ, could generate the data.
In inference we want to use our prior knowledge and data
to help us choose which of these models make most sense.
Essentially we want to run the process in reverse.
The aim of inference: inverting the likelihood

Start with data


0.20 X ~ Poisson(5) 0.20 X ~ Poisson(10) 0.20 X ~ Poisson(15)

0.15 0.15 0.15


pmf

pmf

pmf
0.10 0.10 0.10

0.05 0.05 0.05

0.00 0.00 0.00


0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25
number of times lift breaks in one year number of times lift breaks in one year number of times lift breaks in one year

X=8
The aim of inference: inverting the likelihood

Infer the data generating process


0.20 X ~ Poisson(5) 0.20 X ~ Poisson(10) 0.20 X ~ Poisson(15)

0.15 0.15 0.15


pmf

pmf

pmf
0.10 0.10 0.10

0.05 0.05 0.05

0.00 0.00 0.00


0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25
number of times lift breaks in one year number of times lift breaks in one year number of times lift breaks in one year

X=8
The aim of inference: inverting the likelihood

Both Frequentists and Bayesians essentially invert:


p(X |θ) → p(θ|X ).
This amounts to going from an ’effect’ back to a ’cause’.
Their methods of inversion are different.
Frequentist inversion: null hypothesis testing

Frequentist inference considers a single hypothesis θ about data


generating process at a time.

H0 : A hypothesis θ is true (1)


H1 : A hypothesis θ is false (2)

Frequentists use a rule of thumb:


If Pr (data as or more extreme than X |θ) < 0.05, then θ is
false, =⇒ p(θ|X ) = 0
If Pr (data as or more extreme than X |θ) ≥ 0.05, then θ
could be true, =⇒ p(θ|X ) =?
Frequentist inversion: null hypothesis testing

For X = 8 we can carry out a series of these hypothesis


tests across a range of θ.
For example, assume θ = 15:

0.12

0.10

0.08
pmf

0.06

0.04

0.02

0.00
0 5 10 15 20 25 30
number of times lift breaks in one year
Frequentist inversion: null hypothesis testing

For X = 8 we can carry out a series of these hypothesis


tests across a range of θ.
For example, assume θ = 15:
Pr(X ≤ 8|θ = 15) ≃ 0.037
0.12

0.10

0.08
pmf

0.06

0.04

0.02

0.00
0 5 10 15 20 25 30
number of times lift breaks in one year
Frequentist inversion: null hypothesis testing

For X = 8 we can carry out a series of these hypothesis


tests across a range of θ.
For example, assume θ = 15:
Pr(X ≤ 8|θ = 15) ≃ 0.037 < 0.05 ∴ reject !
0.12

0.10

0.08
pmf

0.06

0.04

0.02

0.00
0 5 10 15 20 25 30
number of times lift breaks in one year
Frequentist inversion: null hypothesis testing

If we carry out a series of similar hypothesis tests over the


range of θ we find the 90% confidence intervals (90%
because we have used two one sided 5% test sizes):

4.0 ≤ θ ≤ 14.4 (3)


Bayesian inversion

Bayesians instead use a rule consistent with the rules of


probability known as Bayes’ rule:

p(X |θ) × p(θ)


p(θ|X ) = (4)
p(X )
Resulting in an accumulation of evidence (not binary decision)
across all potential hypotheses θ.
Bayesian inversion

pdf

prior

0 5 10 15 20 25
mean number of times lift breaks in one year (θ)
Bayesian inversion

prior
pdf

data (likelihood)

0 5 10 15 20 25
mean number of times lift breaks in one year (θ)
Bayesian inversion

prior
data (likelihood)
pdf

posterior

0 5 10 15 20 25
mean number of times lift breaks in one year (θ)
Bayesian inversion: finding summary intervals

Often we are required to give summary intervals for


estimated parameters.
There are a number of choices here.
These intervals are known as credible intervals, in contrast
to the confidence intervals of Frequentism.
These are found by finding an interval such that X% of
the area under the pdf (probability mass) is contained
within it.
Bayesian inversion

0.15

0.10
pdf

posterior

0.05

0.00
0 5 10 15 20 25
number of times lift breaks in one year (θ)
Bayesian inversion

0.15
area = 0.9

0.10
pdf

posterior

0.05

0.00
0 5 10 15 20 25
number of times lift breaks in one year (θ)

=⇒ find a 90% central posterior interval of


3.6 ≤ θ ≤ 12.4.
Frequentist versus Bayesians: summary

All methods of inference attempt to invert the likelihood


to make it a valid probability distribution.
Frequentists: Use a heuristic to do this: if the probability
of obtaining data as or more extreme than the actual
observation is low when conditioned on θ, then we reject θ.
Bayesians: Use Bayes’ law for inversion, which requires
we specify a prior distribution.
1 Introduction

2 Course outline

3 The theory and practice of inference


A conceptual introduction to inference
Frequentist and Bayesian world views

4 Elements of Bayes’ rule for inference

5 Posterior predictive distributions


Bayes’ rule in action: breast cancer screening
Bayes’ rule in action: breast cancer screening

Suppose:
The probability that a randomly chosen 40 year old
1
woman has breast cancer is approximately 100 .
If a woman has breast cancer the probability they will test
positive in a mammography is about 90%.
However there is a risk of about 8% of a false positive
result of the test.
Question: given that a woman tests positive, what is the
probability that they have breast cancer?
Bayes’ rule in action: breast cancer screening

Answer: we want to find the probability the woman has cancer


given she has tested positive, which we can do via Bayes’ rule
(it’s the same for pmfs as it was for pdfs):

Pr(+ | ) × Pr( )
Pr( | +) =
Pr(+)
Bayes’ rule in action: breast cancer screening

0.9 0.01

}
}
Pr(+ | ) × Pr( )
Pr( | +) =
Pr(+)

}
?
Marginalise out any cancer dependence via summation
(discrete equivalent of integration):

Pr(+) = Pr(+ | ) × Pr( ) + Pr(+ | ) × Pr( )


}
}≈ 0.09
0.9 × 0.01 + 0.08 × 0.99
Bayes’ rule in action: breast cancer screening

Putting this into Bayes’ rule:

Pr( | +) = 0.9 × 0.01


0.09
≈ 0.1
Intuitively, the number of false positives dwarfs the number of
true positives.
Bayes’ rule for inference

Take Bayes’ rule for probability density of A given B:

p(B|A) × p(A)
p(A|B) = (5)
p(B)
Bayes’ rule for inference

Using a sleight of hand replace: A → θ and B → X , where θ is


a parameter vector, and X is a data sample.

p(X |θ) × p(θ)


p(θ|X ) = (6)
p(X )
But what do these terms mean?
Likelihood summary

p(X |θ) × p(θ)


p(θ|X ) = (7)
p(X )

In our example θ is the rate of lift malfunctioning.


Here X is the data.
p(X |θ) represents the likelihood.
Remember not a probability distribution because θ varies.
Encapsulates many subjective judgements about analysis.
Priors summary

p(X |θ) × p(θ)


p(θ|X ) = (8)
p(X )

p(θ) represents the prior.


A valid probability distribution.
Similar to the likelihood; it is also subjective.
No “objective” rule for priors

Embody subjective assumptions about state of the world.


Essentially measure Pr (cause|pre-data knowledge).
- Since knowledge differs between subjects =⇒ different
priors.
Can be informed by pre-experimental data (for example,
previous studies or from a collection of previous studies).
Denominator summary

p(X |θ) × p(θ)


p(θ|X ) = (9)
p(X )

p(X ) represents the denominator.


Two different interpretations:
- Before we collect X it is the prior predictive distribution.
- When we have data X = 2 it is simply a number (that
normalises the posterior) known as the evidence or
marginal likelihood.
Calculated from the numerator.
Source of some difficulty of exact Bayesian inference
(return to this later).
Posteriors summary

p(X |θ) × p(θ)


p(θ|X ) = (10)
p(X )

p(θ|X ) represents the posterior.


A valid probability distribution.
Starting point for all further analysis in Bayesian inference.
Posterior point estimates

Mathematical models and policy makers often require


point estimates of parameters.
In Bayesian inference there are choices for estimates:
- Posterior mean.
- Posterior median.
- Maximum a posteriori (MAP); also known as the mode.
(Statistical decision theory: under different loss functions
each can be “optimal”.)
However, generally prefer posterior mean or median over
MAP.
- MAP ignores the measure by focusing solely on density.
- (Linked) MAP can lie a long way from probability mass.
Posterior point estimates

MAP
mean
posterior pdf

median
MAP

mean
median
θ
Intuition behind Bayesian analyses

Bayes’ rule:

p(X |θ) × p(θ)


p(θ|X ) = (11)
p(X )
Tells us that:

p(θ|X ) ∝ p(X |θ) × p(θ) (12)


Because p(X ) is independent of θ
=⇒ the posterior is a essentially a weighted (geometric) mean
of the prior and likelihood.
Example problem: paternal discrepancy

Paternal discrepancy is the term given to a child who


has a biological father different to their supposed
biological father.
Question: how common is it?
Answer: a recent meta-analysis of studies of “paternal
discrepancy” (PD) found a rate of ∼ 10%2 .
Suppose we have data for a random sample of 10
children’s presence/absence of PD.
Aim: infer the prevalence of PD in the population (θ).
Paternal discrepancy

Assume individual samples are:


Independent.
Identically-distributed.
Since sample size is fixed at 10 =⇒ binomial likelihood.
Intuition behind Bayesian analyses: PD rate again

Consider single sample of 10 children; 2 of which have PD.


prior
pdf

0 20 40 60 80 100
likelihood
likelihood

0 20 40 60 80 100
posterior
pdf

0 20 40 60 80 100
θ (PD prevalence), %
Intuition behind Bayesian analyses: PD rate again

Now holding prior constant and varying proportion with PD.


prior
pdf

0 20 40 60 80 100
likelihood
likelihood

0 20 40 60 80 100
posterior
pdf

0 20 40 60 80 100
θ (PD prevalence), %
Intuition behind Bayesian analyses: PD rate again

Constant prior and proportion with PD (20%); sample size↑.


prior
pdf

0 20 40 60 80 100
likelihood
likelihood

0 20 40 60 80 100
posterior
pdf

0 20 40 60 80 100
θ (PD prevalence), %
An exception: zero priors (avoid these)

pdf prior

0 20 40 60 80 100
likelihood
likelihood

0 20 40 60 80 100
posterior
pdf

0 20 40 60 80 100
θ (PD prevalence), %
Intuition behind Bayesian analyses: summary

The posterior is a weighted average of the prior and


likelihood (data).
Changes in position of prior or likelihood are reflected in
posterior.
The weighting towards the likelihood increases as more
data is collected =⇒ models with a lot of data are less
dependent on priors.
Exception to this is “zero” priors.
1 Introduction

2 Course outline

3 The theory and practice of inference


A conceptual introduction to inference
Frequentist and Bayesian world views

4 Elements of Bayes’ rule for inference

5 Posterior predictive distributions


Forecasting

Consider a new data sample X̃ .


Want to find p(X̃ |X ); the probability of the new data
sample given our current data X .
We call p(X̃ |X ) the posterior predictive distribution,
and can be used:
- To forecast.
- To check model.
Posterior predictive distributions

To obtain p(X̃ |X ) we sample from the joint distribution:

p(X̃ , θ|X ) = p(X̃ |θ, X ) × p(θ|X )


independent
z }| {
= p(X̃ |θ, X ) ×p(θ|X )
sampling distribution posterior
z }| { z }| {
= p(X̃ |θ) × p(θ|X )

Again do this stepwise:


1 Sample θi ∼ p(θ|X ); i.e. from the posterior.
2 Sample X̃i ∼ p(X̃ |θi ); i.e. from the sampling distribution.
Posterior predictive distribution

1. Sample θi from posterior.

posterior
pdf

0 20 40 60 80 100
θ (PD prevalence), %
Posterior predictive distribution

2. Sample X̃i from sampling distribution =⇒

posterior predictive
pmf

0 1 2 3 4 5 6 7 8 9 10
˜
X , number of PD cases in new sample
Posterior predictive distribution

A more concentrated posterior...

posterior
pdf

0 20 40 60 80 100
θ (PD prevalence), %
Posterior predictive distribution

...yields a narrower posterior predictive range.

posterior predictive
pmf

0 1 2 3 4 5 6 7 8 9 10
˜
X , number of PD cases in new sample
Posterior predictive distribution: uses

Why should we estimate this distribution?


Forecasts:
- A valid probability distribution.
- =⇒ no extra work to obtain predictive intervals.
Check model’s suitability:
- Use posterior predictive distribution to obtain “simulated”
data.
- If model fits data =⇒ should “look” like real data.
- Exhaustive and creative way of checking any aspect of a
model (come back to this next lecture).
The posterior predictive distribution: from
“conceptual” to “observable” post-data world

T
Posterior Posterior predictive

~
θ|X X|X
Summary

All methods of inference involve the subjective decision of


defining the boundaries of the Small World (likelihood).
Small World inference involves inversion of the likelihood.
Frequentists and Bayesians differ in their approach to carry
out this inversion:
- Frequentists use null hypothesis tests.
- Bayesians use Bayes’ rule, which requires us to specify a
prior.
Bayesian statistics requires us to be able to manipulate
probability distributions.
Summary

The posterior is essentially the weighted (geometric)


average of the prior and likelihood. =⇒ the more data
you have the greater the weighting of the posterior
towards the likelihood.
The posterior predictive distribution can be obtained by
sampling and is useful for forecasting or doing model
checks.
Not sure I understand?

Bayesian statistics:

p(D|θ) × p(θ)
p(θ|D) = (13)
p(D)

Beigeian statistics:

p(D|θ) × p(θ)
p(θ|D) = (14)
p(D)

You might also like