Introduction to Probabilistic Inference
Probabilistic inference is the process of deducing the probabilities of certain outcomes or
parameters given observed data, within the framework of probability theory. It forms the
cornerstone of many fields such as statistics, machine learning, artificial intelligence, and data
science. By leveraging probabilistic models, we can make informed decisions, predict future
events, and understand underlying patterns in data.
Importance of Probabilistic Inference
In many real-world scenarios, we are faced with uncertainty due to incomplete or noisy data.
Probabilistic inference allows us to quantify this uncertainty and make predictions or decisions
accordingly. It provides a principled way to update our beliefs in light of new evidence, ensuring
that our conclusions are grounded in both prior knowledge and observed data.
Applications of Probabilistic Inference
Probabilistic inference is widely used across many real-world problems involve estimating
unobserved variables from observed data. Examples include:
Application Observed variable Unobserved variable
climate science earth observations climate forecast
autonomous driving image pixel values pedestrians and vehicles present
movie recommendation ratings of watched films ratings of unwatched films
medicine genome DNA susceptibility to genetic diseases
Fundamental Concepts
Sum Rule
The sum rule, also known as the marginalization rule, allows us to compute the marginal
probability of a random variable by summing (or integrating) over all possible values of another
variable:
For discrete variables:
p(x) = ∑ p(x, y)
For continuous variables:
p(x) = ∫ p(x, y) dy
Product Rule
The product rule expresses the joint probability of two events as the product of a conditional
probability and a marginal probability:
p(x, y) = p(x) p(y|x) = p(y) p(x|y)
These two rules form the basis for all probabilistic reasoning and inference.
Bayes' Theorem
Bayes' theorem is derived from the product rule and provides a way to update our beliefs about
the parameters or hypotheses in light of new data:
p(D|θ) p(θ)
p(θ|D) =
p(D)
Posterior: p(θ ∣ D): Probability of parameters θ given data D.
Likelihood: p(D ∣ θ): Probability of data D given parameters θ.
Prior: p(θ): Initial probability of parameters θ.
Marginal Likelihood: p(D): Probability of data D.
Bayesian Inference
Bayesian inference is a statistical method that applies Bayes' theorem to update the probability
for a hypothesis as more evidence or information becomes available.
Learning: Parameter Estimation
In Bayesian learning, we estimate the parameters θ of a model m given observed data D:
p(D|θ, m) p(θ|m)
p(θ|D, m) =
p(D|m)
Posterior: p(θ ∣ D, m): Updated belief about parameters after observing data.
Likelihood: p(D ∣ θ, m): Probability of observing data D given parameters θ.
Prior: p(θ ∣ m): Initial belief about parameters before observing data.
Evidence: p(D ∣ m): Probability of observing data under model m.
Explanation
Posterior represents what we know about the parameters after seeing the data.
Likelihood encapsulates what the data tells us about the parameters.
Prior reflects what we knew (or assumed) before observing the data.
Prediction: Predictive Distribution
Once we have the posterior distribution of the parameters, we can make predictions about new
data x : ∗
∗ ∗
p(x |D, m) = ∫ p(x |θ, m) p(θ|D, m) dθ
Predictive Distribution: p(x ∗
∣ D, m) : Probability of future observations given past data.
Integrating Over Parameters: We average over all possible parameter values, weighted
by their posterior probability.
Interpretation
We average all possible predictions p(x ∣ θ, m), weighting each by how plausible θ is given the
∗
observed data D. This approach naturally incorporates uncertainty in the parameter estimates
into our predictions.
Model Comparison
In Bayesian inference, we can compare different models to see which one explains the data
best:
p(D|m) p(m)
p(m|D) =
p(D)
Posterior Probability of Model p(m ∣ D): Probability that model m is the correct model
given the data.
Model Evidence p(D ∣ m): Probability of the data under model m.
Prior Probability of Model p(m): Initial belief about the plausibility of model m.
Bayes Factors
The ratio of the posterior probabilities of two models is known as the Bayes factor, which
quantifies the evidence in favor of one model over another.
Bayesian Decision Theory
Bayesian decision theory provides a framework for making optimal decisions under uncertainty
by maximizing expected utility (or reward).
Expected Reward
The expected reward R(a) for taking action a is calculated as:
R(a) = ∑ R(a, x) p(x|D)
Reward R(a, x): Reward for taking action a when the true state of the world is x.
Posterior Probability p(x ∣ D): Probability of state x given data D.
Explanation
We compute the action a with the highest expected conditional reward, considering all possible
states of the world. This approach separates inference and decision-making, allowing us to first
infer probabilities and then make decisions based on these probabilities.
Flavours of Inference and Decision Problems
Machine learning and inference problems can generally be categorized into three main types:
1. Supervised Learning
Objective: Learn a mapping from inputs x to outputs y based on observed pairs
.
(x i , y i )
Applications: Regression, classification, time series prediction.
2. Unsupervised Learning
Objective: Model the underlying structure or distribution in data without explicit
output labels.
Applications: Clustering, dimensionality reduction, density estimation.
3. Reinforcement Learning
Objective: Learn to make decisions by performing actions a in an environment to
t
maximize cumulative rewards r . t
Applications: Robotics, game playing, adaptive control systems.
Example: The Radioactive Decay Problem
To illustrate the concepts of probabilistic inference, let's consider a classic problem in statistical
estimation: estimating the decay constant of a radioactive substance.
Problem Setup
Unstable particles decay at distances x from a source, following an exponential distribution
characterized by a decay constant λ:
1 x
p(x|λ) = exp (− )
Z(λ) λ
Normalization Constant Z(λ): Ensures the probability density integrates to one over the
observed range.
We observe N decay events within a specific range (x min , . Our goal is to infer the value of
x max )
λ based on these observations.
Heuristic Approaches
Before delving into Bayesian inference, let's explore two heuristic methods for estimating λ.
1. Histogram-Based Approach
Method: Bin the observed decay distances into a histogram and perform linear
regression on the logarithm of the bin counts.
Assumption: The logarithm of the counts should decrease linearly with distance for
an exponential distribution.
Issues:
Bin Size Sensitivity: The choice of bin size can significantly affect the estimate.
Uncertainty Estimation: Does not provide a measure of uncertainty for λ.
Justification: Linear regression may not be the most appropriate method due to
the discrete nature of histogram counts.
2. Statistic-Based Approach
Method: Use the sample mean of the observed distances to estimate λ.
Formula
x min exp(−x min /λ) − x max exp(−x max /λ)
μ = λ +
exp(−x min /λ) − exp(−x max /λ)
Issues:
Sample Mean Limitations: The sample mean might exceed the maximum possible
value due to the truncated range.
Arbitrariness: The choice of using the mean is somewhat arbitrary and may not fully
utilize the information in the data.
Bayesian Inference Approach
A more principled method is to apply Bayesian inference to estimate λ.
Steps in Bayesian Inference
1. Specify the Likelihood Function:
The likelihood of observing the data {x N
n } n=1 given λ is:
N
p({x n } n=1 |λ) = ∏ p(x n |λ)
n=1
2. Choose a Prior Distribution:
We select a prior distribution p(λ) that reflects our initial beliefs about λ. For example, a
uniform prior over a reasonable range:
p(λ) = U (λ; λ min , λ max )
3. Compute the Posterior Distribution:
Applying Bayes' theorem:
N
p({x n } n=1 |λ) p(λ)
N
p(λ|{x n } n=1 ) =
N
p({x n } )
n=1
Since p({x n}
N
n=1
) does not depend on λ, we can write:
N
N
p(λ|{x n } n=1 ) ∝ p(λ) ∏ p(x n |λ)
n=1
4. Simplify the Posterior Expression:
Substituting the exponential likelihood:
N N
N
1 1
p(λ|{x n } n=1 ) ∝ p(λ)( ) exp (− ∑ xn )
Z(λ) λ
n=1
The posterior depends on λ through the normalization constant Z(λ) and the exponential term.
5. Compute Sufficient Statistics:
Note that the data enter the posterior only through the sum S = ∑
N
n=1
xn and the number
of observations N . These are known as sufficient statistics.
Understanding the Likelihood
The likelihood function p({x N
n } n=1 ∣ λ) represents how probable the observed data are for
different values of λ. It typically peaks at the value of λ that makes the observed data most
probable.
Posterior Visualization
By plotting the posterior distribution p(λ ∣ {x n
}
N
n=1
, we can visualize our updated beliefs about λ
)
after observing the data. The shape of the posterior reflects both the data and the prior.
Summarizing the Posterior
We can compute summaries of the posterior distribution, such as:
Mean: Expected value of λ under the posterior.
Variance: Measures the uncertainty in our estimate of λ.
Credible Intervals: Ranges within which λ lies with a certain probability (e.g., 95%
credible interval).
Predictive Distribution
With the posterior distribution in hand, we can make predictions about future decay events.
Computing the Predictive Distribution
The predictive distribution for a new observation x is: ∗
∗ N ∗ N
p(x |{x n } n=1 ) = ∫ p(x |λ) p(λ|{x n } n=1 ) dλ
This integral averages over all possible values of λ, weighted by their posterior probabilities.
Interpretation
The predictive distribution incorporates both the uncertainty in λ and the inherent randomness
of the decay process. It provides a full probabilistic description of where we expect future decay
events to occur.
Maximum Likelihood Estimation (MLE)
As an alternative to Bayesian inference, we can use maximum likelihood estimation to find the
value of λ that maximizes the likelihood of the observed data.
MLE Formulation
N
λ ML = argmax p({x n } n=1 |λ)
λ
Comparison with Bayesian Approach
Point Estimate: MLE provides a single estimate of λ, whereas Bayesian inference
provides a full posterior distribution.
Uncertainty Quantification: Bayesian inference naturally accounts for uncertainty in λ,
while MLE does not.
Prior Information: Bayesian inference incorporates prior beliefs, which can be beneficial
when data are scarce.
Summary of the Radioactive Decay Problem
The Bayesian approach to the radioactive decay problem involves:
1. Model Specification: Assuming an exponential decay model p(x ∣ λ).
2. Prior Selection: Choosing a prior p(λ) that reflects prior beliefs.
3. Posterior Computation: Applying Bayes' theorem to compute p(λ ∣ {x N
.
n } n=1 )
4. Prediction: Calculating the predictive distribution for future observations.
This approach provides a principled and coherent method for parameter estimation and
prediction, fully accounting for uncertainty.
Conclusion
Probabilistic inference is a powerful framework for reasoning under uncertainty. By leveraging
the fundamental rules of probability and Bayesian principles, we can:
1. Update our beliefs in light of new data.
2. Make predictions that account for parameter uncertainty.
3. Compare models in a principled way.
4. Make optimal decisions based on expected rewards.