Mathematics For Machine Learning
Mathematics For Machine Learning
Tanujit Chakraborty
@ Sorbonne
Webpage: https://www.ctanujit.org/
Lecture Outline
• Introduction and Notation
• Linear Algebra
Vectors
Matrices
Eigen decomposition
• Differential Calculus
• Optimization Algorithms
• Probability
Random variables
Probability distributions
• Information Theory
KL Divergence and Entropy
Maximum Likelihood Estimation
* Most of the slides are adapted from MML Book and several online resources that are acknowledged at the end of the presentation.
Math for ML
Math for ML
Math for ML
Notation
• 𝑎, 𝑏, 𝑐 Scalar (integer or real)
• 𝐱, 𝐲, 𝐳 Vector (bold-font, lower case)
• 𝐀, 𝐁, 𝐂 Matrix (bold-font, upper-case)
• A, B, C Tensor (bold-font, upper-case)
• 𝑋, 𝑌, 𝑍 Random variable (normal font, upper-case)
• 𝑎∈𝒜 Set membership: 𝑎 is member of set 𝒜
• 𝒜 Cardinality: number of items in set 𝒜
• 𝐯 Norm of vector 𝐯
• 𝐮 ∙ 𝐯 or 𝐮, 𝐯 Dot product of vectors 𝐮 and 𝐯
• ℝ Set of real numbers
• ℝ𝑛 Real numbers space of dimension n
• 𝑦 = 𝑓 𝑥 or 𝑥 ↦ 𝑓 𝑥 Function (map): assign a unique value 𝑓(𝑥) to each input value 𝑥
• 𝑓: ℝ𝑛 → ℝ Function (map): map an n-dimensional vector into a scalar
Notation
• 𝐀⊙𝐁 Element-wise product of matrices A and B
• 𝐀† Pseudo-inverse of matrix A
𝑑𝑛 𝑓
• n-th derivative of function f with respect to x
𝑑𝑥 𝑛
• 𝛻𝐱 𝑓 𝐱 Gradient of function f with respect to x
• 𝐇𝑓 Hessian matrix of function f
• 𝑋~𝑃 Random variable 𝑋 has distribution 𝑃
• 𝑃 𝑋|𝑌 Probability of 𝑋 given 𝑌
• 𝒩 𝜇, 𝜎 2 Gaussian distribution with mean 𝜇 and variance 𝜎 2
• 𝔼𝑋~𝑃 𝑓 𝑋 Expectation of 𝑓 𝑋 with respect to 𝑃 𝑋
• Var 𝑓 𝑋 Variance of 𝑓 𝑋
• Cov 𝑓 𝑋 , 𝑔 𝑌 Covariance of 𝑓 𝑋 and 𝑔 𝑌
• corr 𝑋, 𝑌 Correlation coefficient for 𝑋 and 𝑌
• 𝐷𝐾𝐿 𝑃||𝑄 Kullback-Leibler divergence for distributions 𝑃 and 𝑄
• 𝐶𝐸 𝑃, 𝑄 Cross-entropy for distributions 𝑃 and 𝑄
Vectors
• Vector definition
Computer science: vector is a one-dimensional array of ordered real-valued scalars
Mathematics: vector is a quantity possessing both magnitude and direction, represented by an arrow
indicating the direction, and the length of which is proportional to the magnitude
• Vectors are written in column form or in row form
Denoted by bold-font lower-case letters
1
7 𝑇
𝐱= 𝐱= 1 7 0 1
0
1
• For a general form vector with 𝑛 elements, the vector lies in the 𝑛-dimensional space 𝐱 ∈ ℝ𝑛
𝑥1
𝑥
𝐱= 2
⋮
𝑥𝑛
Geometry of Vectors
• First interpretation of a vector: point in space
E.g., in 2D we can visualize the data points with respect to a
coordinate origin
• Vector addition
We add the coordinates, and follow the directions given by the two
vectors that are added
standardized
data
Dot Product and Angles
• Dot product of vectors, 𝐮 ∙ 𝐯 = 𝐮𝑇 𝐯 = 𝑖 𝑢𝑖 ∙ 𝑣𝑖
It is also referred to as inner product, or scalar product of vectors
The dot product 𝐮 ∙ 𝐯 is also often denoted by 𝐮, 𝐯
𝐮∙𝐯
𝐮∙𝐯= 𝐮 𝐯 𝑐𝑜𝑠 𝜃 cos𝜃 =
𝐮 𝐯
𝑖=1
𝑛
• For 𝑝 = 1, we have ℓ1 norm
𝐱 1 = 𝑥𝑖
Uses the absolute values of the elements 𝑖=1
Discriminate between zero and non-zero elements
𝐱 ∙ 𝐲 ∙ 𝑐𝑜𝑠 𝜃
𝐩𝐫𝐨𝐣𝐱 𝐲 =
𝐱
Hyperplanes
• Hyperplane is a subspace whose dimension is one less than that of its ambient space
In a 2D space, a hyperplane is a straight line (i.e., 1D)
In a 3D, a hyperplane is a plane (i.e., 2D)
In a d-dimensional vector space, a hyperplane has 𝑑 − 1 dimensions, and divides the space into two half-
spaces
• Hyperplane is a generalization of a concept of plane in high-dimensional space
• In ML, hyperplanes are decision boundaries used for linear classification
Data points falling on either sides of the hyperplane are attributed to different classes
Each element 𝑎𝑖𝑗 belongs to the ith row and jth column
The elements are denoted 𝑎𝑖𝑗 or 𝐀 𝑖𝑗 or 𝐀 𝑖𝑗 or 𝐀 𝒊, 𝒋
• Scalar multiplication
cA i , j c Ai , j
A
T
i, j
A j, i
• Identity matrix ( In ): has ones on the main diagonal, and zeros elsewhere
1 0 0
E.g.: identity matrix of size 3×3 : I 3 0 1 0
0 0 1
Matrices
Tr 𝐀 = 𝑎𝑖𝑖
𝑖
• The matrix-vector product is a column vector of length m, whose ith element is the dot product 𝐚𝑇𝑖 𝐱
• Size: 𝐀 𝑛 × 𝑘 ∙ 𝐁 𝑘 × 𝑚 = 𝐂 𝑛 × 𝑚
Linear Dependence
• For the following matrix 2 −1
𝐁=
4 −2
• Notice that for the two columns 𝐛1 = 2, 4 𝑇 and 𝐛2 = −1, −2 𝑇 , we can write 𝐛1 = −2 ∙ 𝐛2
This means that the two columns are linearly dependent
• A collection of vectors 𝐯1 , 𝐯2 , … , 𝐯𝑘 are linearly dependent if there exist coefficients 𝑎1 , 𝑎2 , … , 𝑎𝑘 not all equal to
zero, so that
• For an 𝑛 × 𝑚 matrix, the rank of the matrix is the largest number of linearly independent columns
• The matrix B from the previous example has 𝑟𝑎𝑛𝑘 𝐁 = 1, since the two columns are linearly
dependent
2 −1
𝐁=
4 −2
• The matrix C below has 𝑟𝑎𝑛𝑘 𝐂 = 2, since it has two linearly independent columns
i.e., 𝐜4 = −1 ∙ 𝐜1 , 𝐜5 = −1 ∙ 𝐜3 , 𝐜2 = 3 ∙ 𝐜1 +3 ∙ 𝐜3
1 3 0 −1 0
𝐂 = −1 0 1 1 −1
0 3 1 0 −1
2 3 −1 −2 1
Inverse of a Matrix
• For a square 𝑛 × 𝑛 matrix A with rank 𝑛, 𝐀−𝟏 is its inverse matrix if their product is an identity matrix I
A1 A
• Properties of inverse matrices 1
AB
1
B 1A 1
• If det 𝐴 = 0 (i.e., rank 𝐴 < 𝑛), then the inverse does not exist
A matrix that is not invertible is called a singular matrix
• If the inverse of a matrix is equal to its transpose, the matrix is said to be orthogonal matrix
A 1 AT
Pseudo-Inverse of a Matrix
• Pseudo-inverse of a matrix
Also known as Moore-Penrose pseudo-inverse
• For matrices that are not square, the inverse does not exist
Therefore, a pseudo-inverse is used
−1 𝑇
• If 𝑚 > 𝑛, then the pseudo-inverse is 𝐀† = 𝐀T 𝐀 𝐀 and 𝐀† 𝐀 = 𝐈
† 𝑇 T −1
• If 𝑚 < 𝑛, then the pseudo-inverse is 𝐀 = 𝐀 𝐀𝐀 and 𝐀𝐀† = 𝐈
† †
E.g., for a matrix with dimension 𝐗 2×3 , a pseudo-inverse can be found of size 𝐗 3×2 , so that 𝐗 2×3 𝐗 3×2 = 𝐈2×2
Tensors
• Tensors are n-dimensional arrays of scalars
Vectors are first-order tensors, 𝐯 ∈ ℝ𝑛
Matrices are second-order tensors, 𝐀 ∈ ℝ𝑚×𝑛
E.g., a fourth-order tensor is T ∈ ℝ𝑛1 ×𝑛2×𝑛3 ×𝑛4
Similarly, manifolds can be informally imagined as generalization of the concept of surfaces in high-dimensional
spaces
• To begin with an intuitive explanation, the surface of the Earth is an example of a two-dimensional
manifold embedded in a three-dimensional space
This is true because the Earth looks locally flat, so on a small scale it is like a 2-D plane
oThis means that Earth is not really flat, it only looks locally like
a Euclidean plane, but at large scales it folds up on itself,
and has a different global structure than a flat plane
Manifolds
• Manifolds are studied in mathematics under topological spaces
• An n-dimensional manifold is defined as a topological space with the property that each point has a
neighborhood that is homeomorphic to the Euclidean space of dimension n
This means that a manifold locally resembles Euclidean space near each point
Informally, a Euclidean space is locally smooth, it does not have holes, edges, or other sudden changes, and it does not
have intersecting neighborhoods
Although the manifolds can have very complex structure on a large scale, resemblance of the Euclidean space on a small
scale allows to apply standard math concepts
Upper figure: a circle is a l-D manifold embedded in 2-D, where each arc of
the circle locally resembles a line segment
Lower figures: other examples of 1-D manifolds
Note that a number 8 figure is not a manifold because it has an intersecting
point (it is not Euclidean locally)
E.g., in ML, let’s assume we have a training set of images with size 224 ×
224 × 3 pixels
Learning an arbitrary function in such high-dimensional space would be
intractable
Despite that, all images of the same class (“cats” for example) might lie on a
low-dimensional manifold
This allows function learning and image classification
The data points have 3 dimensions (left figure), i.e., the input space of the data is 3-dimensional
The data points lie on a 2-dimensional manifold, shown in the right figure
Most ML algorithms extract lower-dimensional data features that enable to distinguish between various classes of high-
dimensional input data
oThe low-dimensional representations of the input data are called embeddings
Eigen Decomposition
• Eigen decomposition is decomposing a matrix into a set of eigenvalues and eigenvectors
• Eigenvalues of a square matrix 𝐀 are scalars 𝜆 and eigenvectors are non-zero vectors 𝐯 that satisfy
𝐀𝐯 = 𝜆𝐯
• Eigenvalues are found by solving the following equation
det 𝐀 − 𝜆𝐈 = 0
• If a matrix 𝐀 has n linearly independent eigenvectors 𝐯1 , … , 𝐯 𝑛 with corresponding eigenvalues 𝜆1 , … , 𝜆𝑛 , the eigen
decomposition of 𝐀 is given by
𝐀 = 𝐕𝚲𝐕 −1
Columns of the matrix 𝐕 are the eigenvectors, i.e., 𝐕 = 𝐯1 , … , 𝐯 𝑛
𝚲 is a diagonal matrix of the eigenvalues, i.e., 𝚲 = 𝜆1 , … , 𝜆𝑛
• Decomposing a matrix into eigenvalues and eigenvectors allows to analyze certain properties of the matrix
If all eigenvalues are positive, the matrix is positive definite
If all eigenvalues are positive or zero-valued, the matrix is positive semidefinite
If all eigenvalues are negative or zero-values, the matrix is negative semidefinite
oPositive semidefinite matrices are interesting because they guarantee that ∀𝐱, 𝐱 𝑇 𝐀𝐱 ≥ 0
• E.g., this is used for dimensionality reduction with PCA (principal component analysis) where the eigenvectors
corresponding to the largest eigenvalues are used for extracting the most important data dimensions
• For a non-square matrix 𝐀, the squares of the singular values 𝜎𝑖 are the eigenvalues 𝜆𝑖 of 𝐀𝑻 𝐀, i.e., 𝜎𝑖2 = 𝜆𝑖 for 𝑖 =
1,2, … , 𝑛
• Applications of SVD include computing the pseudo-inverse of non-square matrices, matrix approximation,
determining the matrix rank
Matrix Norms
𝑚 𝑛
• Frobenius norm – calculates the square-root of the
2
summed squares of the elements of matrix 𝐗 𝐗 𝐹 = 𝑥𝑖𝑗
𝑖=1 𝑗=1
This norm is similar to Euclidean norm of a vector
𝑛 𝑚
• 𝑳2,1 norm – is the sum of the Euclidean norms of the 𝐗 = 2
𝑥𝑖𝑗
2,1
columns of matrix 𝐗 𝑗=1 𝑖=1
𝑓 𝑥+ℎ −𝑓 𝑥
𝑓 ′ 𝑥 = lim
ℎ→0 ℎ
• If 𝑓 ′ 𝑎 exists, f is said to be differentiable at a
• If f ‘ 𝑐 is differentiable for ∀𝑐 ∈ 𝑎, 𝑏 , then f is differentiable on this interval
We can also interpret the derivative 𝑓′(𝑥) as the instantaneous rate of change of 𝑓(𝑥) with respect to x
I.e., for a small change in x, what is the rate of change of 𝑓(𝑥)
• Given 𝑦 = 𝑓(𝑥), where x is an independent variable and y is a dependent variable, the following expressions are
equivalent:
𝑑𝑦 𝑑𝑓 𝑑
𝑓′ 𝑥 = 𝑓′ = = = 𝑓 𝑥 = 𝐷𝑓 𝑥 = 𝐷𝑥 𝑓(𝑥)
𝑑𝑥 𝑑𝑥 𝑑𝑥
𝑑
• The symbols , D, and 𝐷𝑥 are differentiation operators that indicate operation of differentiation
𝑑𝑥
Differential Calculus
• The following rules are used for computing the derivatives of explicit functions
Higher Order Derivatives
• The derivative of the first derivative of a function 𝑓 𝑥 is the second derivative of 𝑓 𝑥
𝑑2 𝑓 𝑑 𝑑𝑓
=
𝑑𝑥 2 𝑑𝑥 𝑑𝑥
E.g., in physics, if the function describes the displacement of an object, the first derivative gives the velocity of the object
(i.e., the rate of change of the position)
The second derivative gives the acceleration of the object (i.e., the rate of change of the velocity)
• If we apply the differentiation operation any number of times, we obtain the n-th derivative of 𝑓 𝑥
𝑛
𝑛
𝑑𝑛 𝑓 𝑑
𝑓 𝑥 = 𝑛= 𝑓 𝑥
𝑑𝑥 𝑑𝑥
Taylor Series
• Taylor series provides a method to approximate any function 𝑓(𝑥) at a point 𝑥0 if we have the first n
1 2 𝑛
derivatives 𝑓 𝑥0 , 𝑓 𝑥0 , 𝑓 𝑥0 , … , 𝑓 𝑥0
1 𝑑2 𝑓 2
𝑑𝑓
𝑓 𝑥 ≈ 𝑥 − 𝑥0 + 𝑥 − 𝑥0 + 𝑓 𝑥0
2 𝑑𝑥 2 𝑑𝑥 𝑥0
𝑥0
• For example, the figure shows the first-order, second-order, and fifth-order
polynomial of the exponential function 𝑓(𝑥) = 𝑒 𝑥 at the point 𝑥0 = 0
Picture from: http://d2l.ai/chapter_appendix-mathematics-for-deep-learning/single-variable-calculus.html
Geometric Interpretation
• To provide a geometric interpretation of the derivatives, let’s consider a first-order Taylor series approximation
of 𝑓 𝑥 at 𝑥 = 𝑥0
𝑑𝑓
𝑓 𝑥 ≈ 𝑓 𝑥0 + 𝑥 − 𝑥0
𝑑𝑥 𝑥0
• The expression approximates the function 𝑓 𝑥 by a line which passes through the point 𝑥0 , 𝑓 𝑥0 and has
𝑑𝑓 𝑑𝑓
slope (i.e., the value of at the point 𝑥0 )
𝑑𝑥 𝑥0 𝑑𝑥
• Therefore, the first derivative of a function is also the slope of the tangent
line to the curve of the function
𝜕𝑦 𝜕𝑓 𝜕
= = 𝑓 𝐱 = 𝑓𝑥𝑖 = 𝑓𝑖 = 𝐷𝑖 𝑓 = 𝐷𝑥𝑖 𝑓
𝜕𝑥𝑖 𝜕𝑥𝑖 𝜕𝑥𝑖
Gradient
• We can concatenate partial derivatives of a multivariate function with respect to all its input variables to
obtain the gradient vector of the function
• The gradient of the multivariate function 𝑓(𝐱) with respect to the n-dimensional input vector
𝐱 = 𝑥1 , 𝑥2 , … , 𝑥𝑛 𝑇 , is a vector of n partial derivatives
𝑇
𝜕𝑓 𝐱 𝜕𝑓 𝐱 𝜕𝑓 𝐱
𝛻𝐱 𝑓 𝐱 = , ,…,
𝜕𝑥1 𝜕𝑥2 𝜕𝑥𝑛
• When there is no ambiguity, the notations 𝛻𝑓 𝐱 or 𝛻𝐱 𝑓 are often used for the gradient instead of 𝛻𝐱 𝑓 𝐱
The symbol for the gradient is the Greek letter 𝛻 (pronounced “nabla”), although 𝛻𝐱 𝑓 𝐱 is more often it is pronounced
“gradient of f with respect to x”
• In ML, the gradient descent algorithm relies on the opposite direction of the gradient of the loss function ℒ
with respect to the model parameters 𝜃 𝛻𝜃 ℒ for minimizing the loss function
Adversarial examples can be created by adding perturbation in the direction of the gradient of the loss ℒ with respect
to input examples 𝑥 𝛻𝑥 ℒ for maximizing the loss function
Hessian Matrix
• To calculate the second-order partial derivatives of multivariate functions, we need to calculate the derivatives
for all combination of input variables
• That is, for a function 𝑓(𝐱) with an n-dimensional input vector 𝐱 = 𝑥1 , 𝑥2 , … , 𝑥𝑛 𝑇 , there are 𝑛2 second partial
derivatives for any choice of i and j
𝜕2𝑓 𝜕 𝜕𝑓
=
𝜕𝑥𝑖 𝜕𝑥𝑗 𝜕𝑥𝑖 𝜕𝑥𝑗
• The second partial derivatives are assembled in a matrix called the Hessian
𝜕2𝑓 𝜕2𝑓
⋯
𝜕𝑥1 𝜕𝑥1 𝜕𝑥1 𝜕𝑥𝑛
𝐇𝑓 = ⋮ ⋱ ⋮
2 2
𝜕 𝑓 𝜕 𝑓
…
𝜕𝑥𝑛 𝜕𝑥1 𝜕𝑥𝑛 𝜕𝑥𝑛
• Computing and storing the Hessian matrix for functions with high-dimensional inputs can be computationally
prohibitive
E.g., the loss function for a ResNet50 model with approximately 23 million parameters, has a Hessian of 23 M × 23 M =
529 T (trillion) parameters
Jacobian Matrix
• The concept of derivatives can be further generalized to vector-valued functions (or, vector fields) 𝑓: ℝ𝑛 → ℝ𝑚
• For an n-dimensional input vector 𝐱 = 𝑥1 , 𝑥2 , … , 𝑥𝑛 𝑇 ∈ ℝ𝑛 , the vector of functions is given as
𝑇
𝐟 𝐱 = 𝑓1 𝐱 , 𝑓2 𝐱 , … , 𝑓𝑚 𝐱 ∈ ℝ𝑚
• The matrix of first-order partial derivates of the vector-valued function 𝐟 𝐱 is an 𝑚 × 𝑛 matrix called a Jacobian
𝜕𝑓1 𝐱 𝜕𝑓1 𝐱
⋯
𝜕𝑥1 𝜕𝑥𝑛
𝐉= ⋮ ⋱ ⋮
𝜕𝑓𝑚 𝐱 𝜕𝑓𝑚 𝐱
…
𝜕𝑥1 𝜕𝑥𝑛
For example, in robotics a robot Jacobian matrix gives the partial derivatives of the translational and angular velocities of
the robot end-effector with respect to the joints (i.e., axes) velocities
Integral Calculus
• For a function 𝑓(𝑥) defined on the domain [𝑎, 𝑏], the definite integral of the function is denoted
𝑏
𝑓 𝑥 𝑑𝑥
𝑎
• Geometric interpretation of the integral is the area between the horizontal axis and the graph of 𝑓(𝑥) between
the points a and b
In this figure, the integral is the sum of blue areas (where 𝑓 𝑥 > 0) minus the pink area (where 𝑓 𝑥 < 0)
• Optimization is concerned with optimizing an objective function — finding the value of an argument that
minimizes of maximizes the function
Most optimization algorithms are formulated in terms of minimizing a function 𝑓(𝑥)
Maximization is accomplished vie minimizing the negative of an objective function (e.g., minimize −𝑓(𝑥))
In minimization problems, the objective function is often referred to as a cost function or loss function or error function
• For a given empirical function g (dashed purple curve), optimization algorithms attempt to find the point of
minimum empirical risk (error on the training dataset)
• The minimum and maximum points are collectively known as extremum points
• Note also that the point of a function at which the sign of the curvature changes is called an inflection point
An inflection point (𝑓′′ 𝑥 = 0) can also be a saddle point, but it does not have to be
saddle point
• A function of a single variable is concave if every line segment joining two points on its graph does not lie above
the graph at any point
• Symmetrically, a function of a single variable is convex if every line segment joining two points on its graph
does not lie below the graph at any point
• The Danish mathematician Johan Jensen showed that this can be generalized for all 𝛼𝑖 that are non-negative real
numbers and 𝑖 𝛼𝑖 =1, to the following:
𝛼1𝑓 𝑥1 + 𝛼2𝑓 𝑥2 + ⋯ + 𝛼𝑛𝑓 𝑥𝑛 ≥ 𝑓 𝛼1𝑥1 + 𝛼2𝑥2 + ⋯ + 𝛼𝑛𝑥𝑛
Convex Sets
• A set 𝒳 in a vector space is a convex set is for any 𝑎, 𝑏 ∈ 𝒳 the line segment connecting a and b is also in 𝒳
• For all 𝜆 ∈ [0,1], we have
𝜆 ∙ 𝑎 + 1 − 𝜆 ∙ 𝑏 ∈ 𝒳 for all 𝑎, 𝑏 ∈ 𝒳
• The optimization problem that involves a set of constraints which need to be satisfied to optimize the objective
function is called constrained optimization
• E.g., for a given objective function 𝑓(𝐱) and a set of constraint functions 𝑐𝑖 𝐱
minimize 𝑓(𝐱)
𝐱
subject to 𝑐𝑖 𝐱 ≤ 0 for all 𝑖 ∈ 1, 2, … , 𝑁
• The points that satisfy the constraints form the feasible region
• Various optimization algorithms have been developed for handling optimization problems based on whether
the constraints are equalities, inequalities, or a combination of equalities and inequalities
Lagrange Multipliers
• One approach to solving optimization problems is to substitute the initial problem with optimizing another
related function
• The Lagrange function for optimization of the constrained problem on the previous page is defined as
𝐿 𝐱, 𝛼 = 𝑓 𝐱 + 𝑖 𝛼𝑖 𝑐𝑖 𝐱 where 𝛼𝑖 ≥ 0
• The variables 𝛼𝑖 are called Lagrange multipliers and ensure that the constraints are properly enforced
They are chosen to ensure that 𝑐𝑖 𝐱 ≤ 0 for all 𝑖 ∈ 1, 2, … , 𝑁
• This is a saddle-point optimization problem where one wants to minimize 𝐿 𝐱, 𝛼 with respect to 𝐱 and
simultaneously maximize 𝐿 𝐱, 𝛼 with respect to 𝛼𝑖
The saddle point of 𝐿 𝐱, 𝛼 gives the optimal solution to the original constrained optimization problem
Projections
• E.g., gradient clipping in NNs can require that the norm of the gradient is bounded by a constant value c
• Approach:
At each iteration during training
𝑛𝑒𝑤 𝑔𝑜𝑙𝑑
If the norm of the gradient 𝑔 ≥ c, then the update is 𝑔 ←𝑐∙
𝑔𝑜𝑙𝑑
If the norm of the gradient 𝑔 < c, then the update is 𝑔𝑛𝑒𝑤 ← 𝑔 𝑜𝑙𝑑
𝑔𝑜𝑙𝑑 𝑔𝑜𝑙𝑑
• Note that since is a unit vector (i.e., it has a norm = 1), then the vector 𝑐 ∙ has a norm = 𝑐
𝑔𝑜𝑙𝑑 𝑔𝑜𝑙𝑑
• Such clipping is the projection of the gradient g onto the ball of radius c
Projection on the unit ball is for 𝑐 = 1
Projections
Proj 𝐱 = argmin 𝐱 − 𝐱′ 2
𝒳 𝐱′∈𝒳
• This means that the vector 𝐱 is projected onto the closest vector 𝐱′ that belongs to the set 𝒳
• For example, in the figure, the blue circle represents a convex set 𝒳
The points outside the circle project to the closest point inside the circle
o E.g., 𝐱 is the yellow vector, its closest point 𝐱′ in the set 𝒳 is the red vector
o Among all vectors in the set 𝒳, the red vector 𝐱′ has the smallest distance to 𝐱, i.e.,
𝐱 − 𝐱′ 2
• First-order optimization algorithms use the gradient of a function for finding the extrema points
• Second-order optimization algorithms use the Hessian matrix of a function for finding the extrema points
This is since the Hessian matrix holds the second-order partial derivatives
Methods: Newton’s method, conjugate gradient method, Quasi-Newton method, Gauss-Newton method, BFGS
(Broyden-Fletcher-Goldfarb-Shanno) method, Levenberg-Marquardt method, Hessian-free method
The second-order derivatives can be thought of as measuring the curvature of the loss function
Recall also that the second-order derivative can be used to determine whether a stationary points is a maximum
(𝑓 ′′ 𝑥 < 0), minimum (𝑓 ′′ 𝑥 > 0)
This information is richer than the information provided by the gradient
Disadvantage: computing the Hessian matrix is computationally expensive, and even prohibitive for high-dimensional
data
Lower Bound and Infimum
• Lower bound of a subset 𝒮 from a partially ordered set 𝒳 is an element 𝑎 of 𝒳, such that 𝑎 ≤ 𝑠 for all 𝑠 ∈ 𝒮
E.g., for the subset 𝒮 = 2, 3, 6, 8 from the natural numbers ℕ, lower bounds are the numbers 2, 1, 0, −3, and all other
natural numbers ≤ 2
• Infimum of a subset 𝒮 from a partially ordered set 𝒳 is the greatest lower bound in 𝒳, denoted inf𝑠∈𝒮 𝑠
It is the maximal quantity ℎ such that ℎ ≤ 𝑠 for all 𝑠 ∈ 𝒮
E.g., the infimum of the set 𝒮 = 2, 3, 6, 8 is ℎ =2, since it is the greatest lower bound
• Example: consider the subset of positive real numbers (excluding zero) ℝ≥0 = 𝑥 ∈ ℝ: 𝑥 ≥ 0
The subset ℝ≥0 does not have a minimum, because for every small positive number, there is a another even smaller
positive number
On the other hand, all real negative numbers and 0 are lower bounds on the subset ℝ≥0
0 is the greatest lower bound of all lower bounds, and therefore, the infimum of ℝ≥0 is 0
Upper Bound and Supremum
• Upper bound of a subset 𝒮 from a partially ordered set 𝒳 is an element 𝑏 of 𝒳, such that 𝑏 ≥ 𝑠 for all 𝑠 ∈ 𝒮
E.g., for the subset 𝒮 = 2, 3, 6, 8 from the natural numbers ℕ, upper bounds are the numbers 8, 9, 40, and all other
natural numbers ≥ 8
• Supremum of a subset 𝒮 from a partially ordered set 𝒳 is the least upper bound in 𝒳, denoted sup𝑠∈𝒮 𝑠
It is the minimal quantity 𝑔 such that g ≥ 𝑠 for all 𝑠 ∈ 𝒮
E.g., the supremum of the subset 𝒮 = 2, 3, 6, 8 is 𝑔 = 8, since it is the least upper bound
• Example: for the subset of negative real numbers (excluding zero) ℝ≤0 = 𝑥 ∈ ℝ: 𝑥 ≤ 0
All real positive numbers and 0 are upper bounds
0 is the least upper bound, and therefore, the supremum of ℝ≤0
Lipschitz Function
• A function 𝑓 𝑥 is a Lipschitz continuous function if a constant 𝜌 > 0 exists, such that for all points 𝑥1 , 𝑥2
𝑓 𝑥1 − 𝑓 𝑥2 ≤ 𝜌 𝑥1 − 𝑥2
• Such function is also called a 𝜌-Lipschitz function
𝛻𝑓 𝑥1 − 𝛻𝑓 𝑥2 ≤ 𝜌 𝑥1 − 𝑥2
• For a function 𝑓 𝑥 with a 𝜌-Lipschitz gradient, the second derivative 𝑓′′ 𝑥 is bounded everywhere by 𝜌
𝑓 𝑥 = 𝑥 2 is not a Lipschitz continuous function, since 𝑓 ′ (𝑥) = 2𝑥, so when 𝑥 → ∞ then 𝑓 ′ (𝑥) → ∞, i.e., the derivative is
not bounded everywhere
Since 𝑓 ′′ (𝑥) = 2, therefore the gradient 𝑓 ′ (𝑥) is 2-Lipschitz everywhere, since the second derivative is bounded
everywhere by 2
Probability
• Intuition:
In a process, several outcomes are possible
When the process is repeated many times, each outcome occurs with a relative frequency, or probability
If a particular outcome occurs more often, we say it is more probable
• Solving machine learning problems requires to deal with uncertain quantities, as well as with stochastic (non-
deterministic) quantities
Probability theory provides a mathematical framework for representing and quantifying uncertain quantities
Incomplete observability
oEven deterministic systems can appear stochastic when we cannot observe all the variables that drive the behavior of the system
Incomplete modeling
oWhen we use a model that must discard some of the information we have observed, the discarded information results in uncertainty
in the model’s predictions
oE.g., discretization of real-numbered values, dimensionality reduction, etc.
Random variables
• A random variable 𝑋 is a variable that can take on different values
Example: 𝑋 = rolling a die
oPossible values of 𝑋 comprise the sample space, or outcome space, 𝒮 = 1, 2, 3, 4, 5, 6
oWe denote the event of “seeing a 5” as 𝑋 = 5 or 𝑋 = 5
oThe probability of the event is 𝑃 𝑋 = 5 or 𝑃 𝑋 = 5
oAlso, 𝑃 5 can be used to denote the probability that 𝑋 takes the value of 5
• A probability distribution is a description of how likely a random variable is to take on each of its possible
states
A compact notation is common, where 𝑃 𝑋 is the probability distribution over the random variable 𝑋
oAlso, the notation X~𝑃 𝑋 can be used to denote that the random variable 𝑋 has probability distribution 𝑃 𝑋
• The probability of a random variable 𝑃 𝑋 must obey the axioms of probability over the possible
values in the sample space 𝒮
Discrete Vs. Continuous Variables
• Next, we will study probability distributions defined over multiple random variables
These include joint, conditional, and marginal probability distributions
• The individual random variables can also be grouped together into a random vector, because they represent
different properties of an individual statistical unit
• Given any values x and y of two random variables 𝑋 and 𝑌, what is the probability that 𝑋 = x and 𝑌 = y
simultaneously?
𝑃(𝑋 = 𝑥, 𝑌 = 𝑦) denotes the joint probability
We may also write 𝑃(𝑥, 𝑦) for brevity
Marginal Probability Distribution
𝑃 𝑋=𝑥, 𝑌=𝑦
• Note that: 𝑃 𝑋 = 𝑥| 𝑌 = 𝑦 =
𝑃 𝑌=𝑦
Bayes’ Theorem
• Bayes’ theorem – allows to calculate conditional probabilities for one variable when conditional probabilities for
another variable are known
𝑃 𝑌| 𝑋 𝑃 𝑋
𝑃 𝑋| 𝑌 =
𝑃 𝑌
• Two random variables 𝑋 and 𝑌 are conditionally independent given another random variable 𝑍 if and only if
𝑃 𝑋, 𝑌|𝑍 = 𝑃 𝑋|𝑍 𝑃 𝑌|𝑍
This is denoted as 𝑋 ⊥ 𝑌|𝑍
Continuous Multivariate Distributions
• Same concepts of joint, marginal, and conditional probabilities apply for continuous random variables
• The probability distributions use integration of continuous random variables, instead of summation of discrete
random variables
𝔼𝑋~𝑃 𝑋 = 𝑋 𝑃 𝑋 𝑑𝑋
When the identity of the distribution is clear from the context, we can write 𝔼 𝑋
1
E.g., for a sample of observations: μ = 𝔼 𝑋 = 𝑖𝑃 𝑋𝑖 ∙ 𝑋𝑖 = 𝑖 𝑋𝑖
𝑁
• By analogy, the expected value of a function 𝑓(𝑋) of a discrete random variable 𝑋 with respect to a probability
distribution 𝑃 𝑋 is:
𝔼𝑋~𝑃 𝑓 𝑋 = 𝑓 𝑋 𝑃 𝑋
𝑋
Variance
• Variance of a random variable 𝑋 gives the measure of how much the values of 𝑋 deviate from the expected
value as we sample 𝑋 from 𝑃 𝑋
2
Var 𝑋 = 𝔼 𝑋 − 𝔼 𝑋
• When the variance is low, the values of 𝑋 cluster near the expected value
• Variance is commonly denoted with 𝜎 2
2
The above equation is similar to an expected value of a function 𝑓 𝑋 = 𝑋𝑖 − μ
We can write:
𝜎 2 = 𝔼 𝑋𝑖 − μ 2
= 𝑋𝑖 − μ 2
∙ 𝑃 𝑋𝑖
𝑖
Similarly, the variance of a sample of observations can be calculated as:
1
𝜎2 = 𝑖 𝑋𝑖 − μ 2
𝑁
• The covariance measures the tendency for 𝑋 and 𝑌 to deviate from their means in the same (or opposite)
directions at same time
𝑌 No covariance 𝑌 High
covariance
𝑋 𝑋
Correlation
• Correlation coefficient is the covariance normalized by the standard deviations of the two variables
Cov 𝑋, 𝑌
corr 𝑋, 𝑌 =
𝜎𝑋 ∙ 𝜎𝑌
• i.e.,
Cov 𝑋1 , 𝑋1 Cov 𝑋1 , 𝑋2 ⋯ Cov 𝑋1 , 𝑋𝑛
Cov 𝑋2 , 𝑋1 Cov 𝑋2 , 𝑋𝑛
Cov 𝐗 = ⋱
⋮ ⋮
Cov 𝑋𝑛 , 𝑋1 Cov 𝑋𝑛 , 𝑋2 ⋯ Cov 𝑋𝑛 , 𝑋𝑛
• The diagonal elements of the covariance matrix are the variances of the elements of the random vector 𝐗
Cov 𝑋𝑖 , 𝑋𝑖 = Var 𝑋𝑖
• Also note that the covariance matrix is symmetric, since Cov 𝑋𝑖 , 𝑋𝑗 = Cov 𝑋𝑗 , 𝑋𝑖
Probability Distributions
• Bernoulli distribution
𝑝 = 0.3
• Uniform distribution
1
The probability of each value 𝑖 ∈ 1,2, … , 𝑛 is 𝑝𝑖 =
𝑛
Notation: 𝑋 ∼ 𝑈 𝑛
Figure: 𝑛 = 5, 𝑝 = 0.2
• Binomial distribution
Performing a sequence of n independent experiments, each of 𝑛 = 10, 𝑝 = 0.2
which has probability p of succeeding, where 𝑝 ∈ 0, 1
oE.g., tossing a coin 100 times, head probability is 0.5
The probability of getting k successes in n trials is
𝑛 𝑘 𝑛−𝑘
𝑃 𝑋=𝑘 = 𝑝 1−𝑝
𝑘
Notation: 𝑋 ∼ 𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙 𝑛, 𝑝
• Poisson distribution
A number of events occurring independently in a fixed
interval of time with a known rate 𝜆 𝜆=5
oE.g., number of arriving patients in ER
A discrete random variable 𝑋 with states 𝑘 ∈ 0, 1, 2, … has
probability
𝜆𝑘 ∙𝑒 −𝜆
𝑃 𝑋=𝑘 =
𝑘!
The rate 𝜆 is the average number of occurrences of the event
Notation: 𝑋 ∼ 𝑃𝑜𝑖𝑠𝑠𝑜𝑛 𝜆
o A categorical random variable is a discrete variable with more than two possible outcomes (such as the roll of
a die)
For example, in multi-class classification in machine learning, we have a set of data examples 𝐱1 , 𝐱 2 , … , 𝐱 𝑛 , and
corresponding to the data example 𝐱 𝑖 is a k-class label 𝐲𝑖 = 𝑦𝑖1 , 𝑦𝑖2 , … , 𝑦𝑖𝑘 representing one-hot encoding
o One-hot encoding is also called 1-of-k vector, where one element has the value 1 and all other elements have
the value 0
o Let’s denote the probabilities for assigning the class labels to a data example by 𝑝1 , 𝑝2 , … , 𝑝𝑘
o We know that 0 ≤ 𝑝𝑗 ≤ 1 and 𝑝𝑗 = 1 for the different classes 𝑗 = 1, 2, … , 𝑘
o The multinoulli probability of the data example 𝐱 𝑖 is 𝑃 𝐱 𝑖 = 𝑝1 𝑦𝑖1 ∙ 𝑝2 𝑦𝑖2 ⋯ 𝑝𝑘 𝑦𝑖𝑘 = 𝑗 𝑝𝑗 𝑦𝑖𝑗
o Similarly, we can calculate the probability of all data examples as 𝑖 𝑗 𝑝𝑗 𝑦𝑖𝑗
Information Theory
• Information theory studies encoding, decoding, transmitting, and manipulating information
It is a branch of applied mathematics that revolves around quantifying how much information is present in different
signals
• As such, information theory provides fundamental language for discussing the information processing in
computer systems
E.g., machine learning applications use the cross-entropy loss, derived from information theoretic considerations
• A seminal work in this field is the paper A Mathematical Theory of Communication by Clause E. Shannon, which
introduced the concept of information entropy for the first time
Information theory was originally invented to study sending messages over a noisy channel, such as communication via
radio transmission
Self-information
• The basic intuition behind information theory is that learning that an unlikely event has occurred is more
informative than learning that a likely event has occurred
E.g., a message saying “the sun rose this morning” is so uninformative that it is unnecessary to be sent
But a message saying “there was a solar eclipse this morning” is very informative
𝐼 𝑋 = −log 𝑃 𝑋
𝐼 𝑋 is the self-information, and𝑃 𝑋 is the probability of the event 𝑋
• The self-information outputs the bits of information received for the event 𝑋
For example, if we want to send the code “0010” over a channel
The event “0010” is a series of codes of length n (in this case, the length is 𝑛 =4)
1 1
Each code is a bit (0 or 1), and occurs with probability of ; for this event 𝑃 =
2 2𝑛
1 4
𝐼 "0010" = −log 𝑃 "0010" = −log = −log 2 1 + log 2 2 = 0 + 4 = 4 bits
24
Entropy
• For a discrete random variable 𝑋 that follows a probability distribution 𝑃 with a probability mass function 𝑃(𝑋),
the expected amount of information through entropy (or Shannon entropy) is
𝐻 𝑋 = − 𝑋𝑃 𝑋 log 𝑃 𝑋
• If 𝑋 is a continuous random variable that follows a probability distribution 𝑃 with a probability density function
𝑃(𝑋), the entropy is
𝐻 𝑋 = − 𝑃 𝑋 log 𝑃 𝑋 𝑑𝑋
𝑋
For continuous random variables, the entropy is also called differential entropy
Entropy
• Intuitively, we can interpret the self-information (𝐼 𝑋 = −log 𝑃(𝑋) ) as the amount of surprise we have at
seeing a particular outcome
We are less surprised when seeing a more frequent event
• Similarly, we can interpret the entropy (𝐻 𝑋 = 𝔼𝑋~𝑃 𝐼 𝑋 ) as the average amount of surprise from observing a
random variable 𝑋
Therefore, distributions that are closer to a uniform distribution have high entropy
Because there is little surprise when we draw samples from a uniform distribution, since all samples have similar values
Kullback–Leibler Divergence
• Kullback-Leibler (KL) divergence (or relative entropy) provides a measure of how different two probability
distribution are
• For two probability distributions 𝑃(𝑋) and 𝑄 𝑋 over the same random variable 𝑋, the KL divergence is
𝑃(𝑋)
𝐷𝐾𝐿 𝑃||𝑄 = 𝔼𝑋~𝑃 log
𝑄 𝑋
• For discrete random variables, this formula is equivalent to
𝑃(𝑋)
𝐷𝐾𝐿 𝑃||𝑄 = 𝑋𝑃 𝑋 log
𝑄 𝑋
• When base 2 logarithm is used, 𝐷𝐾𝐿 provides the amount of information in bits
In machine learning, the natural logarithm is used (with base e): the amount of information is provided in nats (natural
unit of information)
• KL divergence can be considered as the amount of information lost when the distribution 𝑄 is used to
approximate the distribution 𝑃
E.g., in GANs, 𝑃 is the distribution of true data, 𝑄 is the distribution of synthetic data
Kullback–Leibler Divergence
• KL divergence is non-negative: 𝐷𝐾𝐿 𝑃||𝑄 ≥ 0
• 𝐷𝐾𝐿 𝑃||𝑄 = 0 if and only if 𝑃(𝑋) and 𝑄 𝑋 are the same distribution
• The most important property of KL divergence is that it is non-symmetric, i.e.,
• Because 𝐷𝐾𝐿 is non-negative and measures the difference between distributions, it is often considered as a
“distance metric” between two distributions
However, KL divergence is not a true distance metric, because it is not symmetric
The asymmetry means that there are important consequences to the choice of whether to use 𝐷𝐾𝐿 𝑃||𝑄 or 𝐷𝐾𝐿 𝑄||𝑃
• An alternative divergence which is non-negative and symmetric is the Jensen-Shannon divergence, defined as
1 1
𝐷𝐽𝑆 𝑃||𝑄 = 𝐷𝐾𝐿 𝑃||𝑀 + 𝐷𝐾𝐿 𝑄||𝑀
2 2
1
In the above, M is the average of the two distributions, 𝑀 = 𝑃+𝑄
2
Cross-entropy
• Cross-entropy is closely related to the KL divergence, and it is defined as the summation of the entropy 𝐻 𝑃
and KL divergence 𝐷𝐾𝐿 𝑃||𝑄
𝐶𝐸 𝑃, 𝑄 = 𝐻 𝑃 + 𝐷𝐾𝐿 𝑃||𝑄
• Alternatively, the cross-entropy can be written as
• In ML, we want to find a model with parameters 𝜃 that maximize the probability that the data is assigned the correct class,
i.e., argmax𝜃 𝑃 model | data
For the classification problem from previous page, we want to find parameters 𝜃 so that for the data examples 𝑥1 , 𝑥2 , … , 𝑥𝑛 the
probability of outputting class labels 𝑦1 , 𝑦2 , … , 𝑦𝑛 is maximized
oI.e., for some data examples, the predicted class 𝑦𝑗 will be different than the true class 𝑦𝑗 , but the goal is to find 𝜃 that results
in an overall maximum probability
• From Bayes’ theorem, argmax 𝑃 model | data is proportional to argmax 𝑃 data | model
𝑃 𝑥1 , 𝑥2 , … , 𝑥𝑛 |𝜃 𝑃 𝜃
𝑃 𝜃|𝑥1 , 𝑥2 , … , 𝑥𝑛 =
𝑃 𝑥1 , 𝑥2 , … , 𝑥𝑛
This is true since 𝑃 𝑥1 , 𝑥2 , … , 𝑥𝑛 does not depend on the parameters 𝜃
Also, we can assume that we have no prior assumption on which set of parameters 𝜃 are better than any others
• Recall that 𝑃 data|model is the likelihood, therefore, the maximum likelihood estimate of 𝜃 is based on solving
arg max 𝑃 𝑥1 , 𝑥2 , … , 𝑥𝑛 |𝜃
𝜃
Maximum Likelihood
• For a total number of n observed data examples 𝑥1 , 𝑥2 , … , 𝑥𝑛 , the predicted class labels for the data example 𝑥𝑖
is 𝐲𝑖
Using the multinoulli distribution, the probability of predicting the true class label 𝐲𝑖 = 𝑦𝑖1 , 𝑦𝑖2 , … , 𝑦𝑖𝑘 is
𝒫 𝑥𝑖 |𝜃 = 𝑗 𝑦𝑖𝑗 𝑦𝑖𝑗 , where 𝑗 ∈ 1,2, … , 𝑘
E.g., we have a problem with 3 classes [car, house, tree], and an image of a car 𝑥𝑖 , the true label 𝐲𝑖 = 1,0,0 , and let’s
assume a predicted label 𝐲𝑖 = 0.7, 0.1, 02 , then the probability is
𝒫 𝑥𝑖 |𝜃 = 𝑗 𝑦𝑖𝑗 𝑦𝑖𝑗 = 0.71 ∙ 0.10 ∙ 0.20 = 0.7 ∙ 1 ∙ 1 = 0.7
• Assuming that the data examples are independent, the likelihood of the data given the model parameters 𝜃 can
be written as
𝑦1𝑗 𝑦2𝑗 𝑦𝑛𝑗 𝑦𝑖𝑗
𝒫 𝑥1 , 𝑥2 , … , 𝑥𝑛 |𝜃 = 𝒫 𝑥1 |𝜃 ⋯ 𝒫 𝑥𝑛 |𝜃 = 𝑗 𝑦1𝑗 ∙ 𝑗 𝑦2𝑗 ⋯ 𝑗 𝑦𝑛𝑗 = 𝑖 𝑗 𝑦𝑖𝑗
• Log-likelihood is often used because it simplifies numerical calculations, since it transforms a product with
many terms into a summation, e.g., log 𝑎1 𝑏1 ∙ 𝑎2 𝑏2 = 𝑏1 log 𝑎1 + 𝑏2 log 𝑎2
log 𝒫 𝑥1 , 𝑥2 , … , 𝑥𝑛 |𝜃 = log 𝑦𝑖𝑗 𝑦𝑖𝑗 = 𝑦𝑖𝑗 log 𝑦𝑖𝑗
𝑖 𝑗 𝑖 𝑗
A negative of the log-likelihood allows us to use minimization approaches, i.e.,
−log 𝒫 𝑥1 , 𝑥2 , … , 𝑥𝑛 |𝜃 = − 𝑦𝑖𝑗 log 𝑦𝑖𝑗 = 𝐶𝐸 𝐲, 𝐲
𝑖 𝑗
1. A. Zhang, Z. C. Lipton, M. Li, A. J. Smola, Dive into Deep Learning, https://d2l.ai, 2020.
2. I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, 2017.
3. Aleksandar (Alex) Vakanski— Machine Learning Math Essentials Presentation
4. Brian Keng – Manifolds: A Gentle Introduction blog
5. Martin J. Osborne – Mathematical Methods for Economic Theory (link)
Video Lectures
https://mml-book.github.io/