Open In App

Probability Theory in Machine Learning

Last Updated : 08 May, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Probability theory is a branch of mathematics dealing with the element of uncertainty. It helps us understand how likely an event is to happen. In machine learning, it plays a very important role, since most real-world data is uncertain and may change with time. It makes predictions, classifies data, and improves accuracy in our models.

What is Probability?

Probability is a measure of the chance of an event happening. It is a number between 0 and 1, where 0 means the event will never happen, and 1 means the event will definitely happen. If the probability is between 0 and 1, the event has some chance of occurring but is not guaranteed.

The probability of an event A is given by the formula:

P(A) = \frac{\text{Number of favorable outcomes}}{\text{Total number of outcomes}}

For example, if we toss a fair coin, the probability of getting heads is:

P(H) = \frac{\text{1}}{\text{2}}

Basics of Probability Theory

Probability theory is based on the definitions of sample space, events, and random experiments. These all contribute to giving a clear indication of how various probabilities are associated with different events.

Sample Space

The sample space is the set of all possible outcomes of a random experiment. It is denoted by S. For example: In rolling a six-sided die, the sample space is S = {1, 2, 3, 4, 5, 6}.

Event

Any subset of the sample space is referred to as an event. It has one or more outcomes. For example: In rolling a die, the event "rolling an even number" consists of E = {2, 4, 6}.

Random Experiment

A random experiment is an experiment in which the outcome is uncertain and can vary each time the experiment is performed. Examples include rolling a die, flipping a coin, and selecting a random person from a population.

Random Variable

A random variable (RV) is a function that assigns a numerical value to each outcome in a sample space. Random variables can be classified into two types:

  • Discrete Random Variable: A discrete Random Variable can only take a finite or countable number of values. Examples: The number of heads in 10 coins flips or the number of defective products in a batch.
  • Continuous Random Variable: A continuous Random Variable takes an infinite number of values within a range. Examples: The height of people in a city or the temperature recorded each day.

Probability Rules

Probability follows some basic laws that enable the calculation of various probabilities.

1. Addition Rule (Union of Two Events): To find the probability that either event A or event B occurs.

P(A \cup B) = P(A) + P(B) - P(A \cap B)

2. Multiplication Rule (Independent Events): If two events A and B are independent, meaning that the occurrence of one does not depend on the occurrence of the other.

P(A \cap B) = P(A) \cdot P(B)

3. Complementary Rule: The complementary rule states that the probability of an event not happening.

P(A') = 1 - P(A)

4. Independent Events: Two events A and B are said to be independent if the occurrence of one event does not depend on or influence the occurrence of the other.

P(A \cap B) = P(A) \cdot P(B)

5. Conditional Probability: Conditional probability defines the probability of an event occurring given that another event has already occurred.

P(A|B) = \frac{P(A \cap B)}{P(B)}

6. Bayes' Theorem: Bayes' theorem forms a important machine learning rule which helps update our probabilities when acquiring new information.

P(A|B) = \frac{P(B|A) P(A)}{P(B)}

This formula is applied frequently in the spam classification processes, medical diagnoses, and also fraud detection mechanisms.

Likelihood

Likelihood is a measure of how good a probability model is at describing the observed data. Given a dataset X= {x1, x2​, ...., xn} and a probability distribution P(X∣θ) parameterized by θ, the likelihood function is:

L(\theta | X) = P(X | \theta) = \prod_{i=1}^{n} P(x_i | \theta)

In machine learning, we very often use Maximum Likelihood Estimation (MLE) to learn the best possible parameters θ, which maximize the likelihood. Instead of maximizing likelihood, we may maximize log-likelihood, to simplify the computations:

\log L(\theta | X) = \sum_{i=1}^{n} \log P(x_i | \theta)

For example, in logistic regression, the likelihood function assists in finding optimal weights that are capable of accurately classifying the data points.

Entropy

Entropy is a measure of the amount of uncertainty or randomness in a probability distribution. Given a discrete random variable X with probabilities P(x), entropy is defined as:

H(X) = - \sum_{i} P(x_i) \log P(x_i)

In decision trees, entropy is used in information gain to decide how to split data. If entropy is high, the data is more uncertain, and if entropy is low, the data is more predictable. For example, in binary classification, if P(x)=0.5 for each class, entropy is maximum, meaning the dataset is most uncertain.

Confidence Intervals

A confidence interval gives a range in which we expect the true value of a parameter to lie with a certain probability. For a population mean 𝜇 estimated from a sample mean 𝑥 with standard deviation 𝑠 and sample size 𝑛, the confidence interval is:

\bar{x} \pm z \frac{s}{\sqrt{n}}

where 𝑧 is the critical value from the standard normal distribution. For example, if a machine learning model predicts the weight of a watermelon to be 5 kg, with a confidence interval of ±0.5 kg. This means we can say with 95% confidence that the true weight of the watermelon is between 4.5 kg and 5.5 kg.

Probability Distributions

1. Bernoulli Distribution - A Bernoulli distribution represents a single trial that is either a success with probability p or a failure with probability (1 - p). It is the most common representation of binary data, such as a biased coin toss.

P(X = x) =\begin{cases}p, & \text{if } x = 1 \\1 - p, & \text{if } x = 0\end{cases}

2. Binomial Distribution - A Binomial distribution models the number of successes in n number of independent Bernoulli trials, where each trial has success probability p. It is commonly used to model repeated coin flips.

P(X = k) = \binom{n}{k} p^k (1 - p)^{n-k}, \quad k = 0,1,2, \dots, n

3. Geometric Distribution - A Geometric distribution models the number of trials needed until the first success occurs in a sequence of independent Bernoulli trials with probability p.

P(X = k) = p(1 - p)^{k-1}, \quad k = 1,2,3, \dots

4. Poisson Distribution - A Poisson distribution describes the probability of k events occurring in a fixed time interval, assuming that these events occur at an average rate λ. It is often used to model rare events such as earthquakes or web traffic.

P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}, \quad k = 0,1,2, \dots

5. Uniform Distribution- A Uniform distribution assigns the same probability to every value in a given interval (a, b) on the real number line.

f(x) =\begin{cases}\frac{1}{b-a}, & a \leq x \leq b \\0, & \text{otherwise}\end{cases}

6. Exponential Distribution - An Exponential distribution models the time between independent events occurring at a constant average rate λ. It is often used to model waiting times, like the time between bus arrivals.

f(x) =\begin{cases}\lambda e^{-\lambda x}, & x \geq 0 \\0, & x < 0\end{cases}

7. Normal (Gaussian) Distribution - The normal distribution is known as the Gaussian distribution. It is one of the very important distributions in statistics that actually models data that cluster around a mean with a standard deviation σ. Many natural phenomena, such as heights, test scores, and errors in measurements, follow a normal distribution.

f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x - \mu)^2}{2\sigma^2}}, \quad -\infty < x < \infty

Machine Learning Algorithms Using Probability Theory

1. Naive Bayes Classifier: Naive Bayes is a family of simple probabilistic classifiers based on Bayes' Theorem, assuming independence between features. It’s particularly effective for text classification and spam filtering.

2. Hidden Markov Models (HMMs): HMMs are used to model systems that are assumed to be a Markov process with hidden states. They are widely applied in speech recognition, bioinformatics, and sequence modeling.

3. Bayesian Networks: Bayesian Networks (or Belief Networks) are graphical models that represent probabilistic relationships among variables. They are used for causal reasoning, decision-making, and inference in uncertain domains.

4. Gaussian Mixture Models (GMMs): GMMs model a dataset as a mixture of several Gaussian distributions. Each component represents a cluster, and the model uses the Expectation-Maximization (EM) algorithm to estimate parameters.


Next Article

Similar Reads