Mathematics Preliminaries for Machine Learning
CS 445/545
Outline
• Linear Algebra Overview
• Statistics/Probability Overview
Linear Algebra: Vectors & Matrices
x d
→ x = x1 ,..., xd
vector
a11 a1n
A mn
→
am1 amn
matrix
a111 a11n a11p a1pn
mn p
M →
tensor a1 a1mn a p a mnp
m1 m1
Vector & Matrix Operations
d
dot product: x y = xi yi results in a scalar; x, y are orthogonal if x y = 0
i =1
matrix multiplication : A B = C
mn n p m p
a11 a1n b11 a1 p
A B =
→ Cij = a rowi bcolj
col j
row i
am1 amn bn1 anp
• Matrix multiplication involves a sequence of dot products; element Cij in
the resultant matrix is equal to the dot product of row i (from the left matrix)
and column j (from the right matrix).
Vector & Matrix Operations
scalar multiplication: cx = c x1 ,..., xd = cx1 ,..., cxd
a11 a1n ca11 ca1n
scalar multiplication: cA=c =
am1 amn cam1 camn
• Matrix multiplication is associative: A(BC)= (AB)C (always holds),
but not commutative: AB ≠ BA (in general).
x1
vector transpose: x = x1 ,..., xd → xT =
1d d 1
xd
a11 a1n a11 a1m
matrix transpose: A= → AT =
am1 amn an1 anm
• A matrix is called “symmetric” if: AT=A.
Vector Norms
• Norms convey the notion of the “magnitude” (i.e. size) of a vector; note that the
equivalence (in magnitude) of two vectors is relative to the choice of norm.
• There are many types – even “families” – of norms relevant to ML/data science.
Here are several of the most commonly used norms in ML:
(1) L2 norm (i.e. “Euclidean norm”)
d d
x2= xi Note: x = xi 2 = x x
2 2
2
i =1 i =1
(2) L1 norm (i.e. Manhattan distance)
d
x 1 = xi
i =1
(3) ∞ norm
x = max xi
*For an ML practitioner, the “choice” of a norm is oftentimes a crucial part of feature engineering and the ML problem
formulation process itself; one can think of the different norm choices as striking a balance between “precision” and
computational overhead.
*There exist equivalent norms applied to matrices; the above norms are examples from the family of p-norms.
Dot Product
d
x y = xi yi
i =1
• The dot product has important geometric properties that are useful in ML.
(*) The dot product can be defined equivalently:
x y = x 2 y 2 cos ( )
From this equivalent definition of the dot product, we can show that the dot product
quantifies the “similarity” between two vectors. Consider (3) cases:
(i) Vectors x and y are “out of alignment” and meet at a 90 degree angle; in this
case:
x y = x 2 y 2 cos ( 90) = 0
(ii) Vectors x and y are “perfectly aligned” (i.e. parallel to one another):
x y = x 2 y 2 cos ( 0) = x 2 y 2
(iii) Vectors x and y are “oppositely aligned (i.e. they are anti-parallel):
x y = x 2 y 2 cos (180) = − x 2 y 2
Special Matrices
• The identity matrix I is a square (nxn matrix); the identity matrix multiplied by any
matrix A (appropriately shaped) results in the matrix A:
1 0 0
0 1 0
In = AI = IA = A
0 0 0
0 0 1 nn
• A matrix A is said to be symmetric if it equals its transpose:
T
1 2 1 2
AT = A e.g., =
2 1 2 1
Note:( A + B ) = AT + BT , ( AB ) = BT AT
T T
• For a diagonal matrix, all off-diagonal entries are zero (note that diagonal entries
are permitted to be zero).
d1 0 0
0 d 0
D= 2
0 0 0
0 0 dm
Special Matrices
• An upper-triangular matrix has zero elements below the main diagonal. Note that
Gaussian Elimination (from elementary linear algebra) yields an upper-triangular
matrix.
u11 u12 u1n
0 u u2 n
U = 22
0 0
0 0 umn
• A lower-triangular matrix has zero elements above the main diagonal.
l11 0 0
l l 0
L = 21 22
lm1 lm 2 lmn
• An orthogonal matrix is a matrix with orthonormal rows and columns; equivalently,
the inverse of an orthogonal matrix is its transpose.
QT Q = QQT = I
Special Matrices
• A square matrix A is Positive Definite if:
x n
\{0}, xT Ax 0
• Analogously, a square matrix A is Positive semi-Definite if (e.g. covariance
matrix):
x n
, xT Ax 0
• We say that the matrix Amxn is invertible (i.e. non-singular) if there exists A-1nxm ,
where:
AA−1 = A−1 A = I
Properties:
(A )
T −1
= A−T ( AB ) = B−1 A−1 (if A,B non-singular)
−1
Linear Systems
• Commonly we encode, and subsequently solve systems of linear equations :
a11 x1 + a1n xn = b1
a21 x1 + a2 n xn = b2
→ Ax = b
Canonical Matrix Form
am1 x1 + amn xn = bm
• When the coefficient matrix A is non-singular, the linear system gives rise to a unique
solution:
Ax = b → x = A−1b
* Note that matrix inversion requires roughly on the order of O(n3) arithmetic
operations.
Matrix Factorizations
• Matrix factorizations are immensely useful for identifying an underlying, inherent
structure in a matrix (i.e. data).
Here are several important examples:
LU Factorization
A = LU
• This factorization encodes the result of the Gaussian Elimination (GE) procedure
(note that not all matrices admit of an LU factorization). L: denotes a lower-triangular
matrix of “multipliers” used in GE. U denotes an upper-triangular (i.e. echelon form)
matrix resulting from GE.
PALU Factorization (Permuted LU factorization) PA = LU
• This technique is similar to LU Factorization, except that we perform a pivoting
operation first (i.e. permute the rows of A via a permutation matrix, P). LU
factorization is subsequently performed; all matrices admit of such a factorization.
Matrix Factorizations
Here are several important examples:
QR Factorization
A = QR
• Q is an orthogonal matrix and R is upper-triangular -- commonly used for solving both
regression problems and linear dynamical systems.
Eigendecomposition
A = V V T
• This is one of the most useful and commonly-used of all matrix factorizations. The
primary use of an eigendecomposition in ML is to perform dimensionality reduction; as such,
this technique is closely related to PCA (principal component analysis) and SVD (singular value
decomposition – see below) methods; Σ is a diagonal matrix consisting of the eigenvalues of
A, and V is the matrix of corresponding eigenvectors.
Matrix Factorizations
Here are several important examples:
Cholesky Factorization
A = LLT
• L is lower-triangular; Cholesky can be used to numerically solve linear systems; every
positive-definite matrix admits of a Cholesky factorization.
SVD (Singular Value Decomposition) A = U V T
• SVD is one of the most essential matrix factorizations for applications of ML. U and V
are orthogonal matrices, and Σ is a diagonal matrix containing the “singular values” (i.e. the
eigenvalues of ATA. SVD has many applications (an orthogonal matrix denotes the matrix
of eigenvectors of ATA, including dimensionality reduction and compression. All matrices
admit of a singular value decomposition.
Determinants
• Geometrically, the determinant of a square matrix A (written |A|) quantifies the unit increase in
volume of the linear transformation defined by A (note that matrix multiplication defines a linear
transformation).
Determinants can be computed through recursion; the general formula for a determinant is:
n
| A |= aij ( −1)
i+ j
M ij
i =1
the "ij-minor"
of A
Note that A = 0 if and only if A is singular.
Some Properties of Determinants:
AB = A B AT = A
Eigenvalues
• The eigenvalues λ and eigenvectors v of a matrix A satisfy:
Av = v ( v 0)
• Which means that the eigenvectors of a matrix A are precisely the vectors for which
multiplication by A is tantamount to scalar multiplication by λ.
• Determining the exact values of the set of eigenvalues for a matrix Anxn is requires solving the
so-called characteristic equation: |A- λ I|=0, which is an n-degree polynomial equation in the
variable λ.
Linear Independence, Span and Basis
• A set of vectors is called linearly independent if the set contains “no redundancy”; formally:
Def . A set of vectors v1 , , v n is linearly independent if:
1 v1 + n v n = 0 implies i = 0 1 i n
• The span of a set is defined as the set of all linear combinations of the set of vectors.
Def . span v1 , , vn = i1v1 + in vn | i
• A basis is a set of linearly independent vectors that spans the parent vector space.
1 0 0
e.g., 0 , 1 , 0 is the "standard" basis set for 3
0 0 1
The Four Fundamental Subspaces
• The Four Fundamental subspaces of a matrix Amxn:
1. (Column Space) Col(A): the span of the column vectors of A.
2. (Null Space) Null(A): the set of all vectors that satisfy Ax=0.
3. (Row Space) Row(A): the span of the row vectors of A.
4. (Null Space of AT) Null(AT): the set of all vectors that satisfy ATx=0.
Overview of Statistics/Probability
• We use statistics and probability to quantify and summarize our beliefs about a “state
of the world” in the face of incomplete or partial knowledge.
• Denote a random event E; the sample space S consists of the set of all possible outcomes
associated with E (e.g. if E=“coin flip”, S={H,T}).
• A random variable (e.g. X, Y) is a variable that is assigned a number based on the outcome
of the random event E.
• Random variables are either Discrete (e.g., 0/1) or Continuous (e.g., height, time).
Overview of Statistics/Probability
Probability Distributions
• A probability distribution summarizes our total knowledge about the random event E,
via the random variable X.
• For a discrete random variable, the probability distribution of X is called a probability
mass function (pmf); a pmf satisfies the following properties, with |S|=k:
1. 0 p ( X i ) 1 1 i k
k
2. p( X i ) = 1
i =1
• Similarly, for a continuous random variable, the probability distribution of X is called a
probability density function (pdf); a pdf satisfies the following properties, with |S|=∞:
1. 0 p ( X i ) 1
2. p ( X ) dx = 1
S
Overview of Statistics/Probability
Probability Distributions
• A cumulative density function (cdf) is defined as the cumulative probability up to a given
value of a random variable:
x
FX ( x ) = p ( X x ) = p(u ) du
−
• Percentiles and quartiles can be defined in a natural way with respect to a cdf:
FX ( x ) = 0.25 FX ( x ) = 0.5 FX ( x ) = 0.75
x =Q1 x =Q2 (median) x =Q3
• Note that due to the Fundamental Theorem of Calculus, it follows that:
d
FX ( x ) = p ( x )
dx pdf
cdf
Overview of Statistics/Probability
Properties of Probability Distributions
• Two random events E1 and E2 are disjoint if: S1∩S2 = Ø.
• If two events E1 and E2 are disjoint, then: P ( E1 E2 ) = P ( E1 ) + P ( E2 )
• More generally, the addition rule of probability states, that for any two events E1 and E2:
P ( E1 E2 ) = P ( E1 ) + P ( E2 ) − P ( E1 E2 )
Overview of Statistics/Probability
Conditional Probability
P ( A B)
P ( A | B) =
• Def. Conditional Probability: P ( B)
probability of
"A given B"
• From this definition, we can derive the multiplication rule of probability:
P ( A B) = P ( A | B) P ( B)
• Equivalently,
P ( A B ) = P ( B | A) P ( A)
Overview of Statistics/Probability
Independence
• We say that events A & B are independent if the outcome of A has no bearing on B (and
vice versa); more formally the joint probability distribution p(A,B) factors.
• Def. A & B are independent if: P ( A B ) = P ( A) P ( B )
• Equivalently, if A & B are independent, it also follows that:
P ( A | B ) = P ( A) P ( B | A) = P ( B )
Thus, in summary, if A & B are independent: P ( A B ) = ( A | B ) P ( B ) = P ( A) P ( B )
*Independence is commonly denoted: A⊥B
Overview of Statistics/Probability
• Two major theorems in elementary statistics: (1) the Law of Large Numbers and (2) the
Central Limit Theorem.
• The Law of Large Numbers states (paraphrasing): Experimental (i.e. empirical) probabilities
converge to their associate theoretical probability as the number of trials tends to infinity.
Overview of Statistics/Probability
The Central Limit Theorem (a conceptual pillar of statistics)
In words: given a sufficiently large sample size from a population (with a finite level of variance), the mean of
all samples from the same population will be approximately equal to the mean of the population.
Furthermore, all of the samples will follow an approximate normal distribution pattern, with all variances
being approximately equal to the variance of the population divided by each sample's size.
In a picture:
Whatever the form of the population distribution, the sampling distribution
tends to a Gaussian, and its dispersion is given by the Central Limit Theorem.
In a theorem: Suppose {X1,X2,…,} is a sequence of I.I.D. random variables with E[Xi]=μ and Var[Xi]=σ2<∞
Then as n approaches infinity, the random variable (1/n)(X1+…+Xn) converges to a normal N(0, σ2/n):
1 n d 2
X i − → N 0,
i =1
n n
Probability Distributions
• Here are some (but certainly not all) of the essential probability distributions for ML and
applied statistics:
1-D Gaussian (i.e. Normal)
• When μ=0 and σ=1 (i.e. N(0,1)) we call this the standard Normal model.
Probability Distributions
Multivariate Gaussian (i.e. MVN)
Probability Distributions
Bernoulli & Binomial Distributions
• The Bernoulli distribution is a single variable, discrete distribution, describing a random
variable with two discrete states (e.g. heads/tails for a single coin flip). The Bernoulli
distribution forms the basis of the Binomial distribution, which models repetitions of
independent Bernoulli trials (N total).
Bernoulli pmf
p ( X = 1) = p ( X = 0) = 1 − 0 1
Binomial pmf
n k
p( X = k) = (1 − )
n−k
probability of
k
"k successes in n trials"
• For example, with a biased coin (θ = 0.6), we have:
10
p ( exactly 7 H in 10 flips ) = 0.67 (1 − 0.6 )
10 − 7
7
Summary Statistics for Random Variables
Expectation and Variance of a Random Variable
• The Expected Value of a random variable X summarizes the outcome: “if the trial were
executed once, on average, this is the numerical value we would expect for X”; E[X]
accordingly computes the arithmetic mean of a random variable, i.e. E[X]=μ.
k
E X = xi P ( X = xi ) (Discrete RV) E X = xp ( x )dx (Continuous RV)
i =1 S
For example, to compute the expected number of heads X in 10 flips of a fair coin
(X~Binomial) we have:
n−k
10 1 1
10 k
E X = k =5
i =0 k 2 2
Summary Statistics for Random Variables
Properties of Expected Value and Variance
• Expected Value is a linear operator (as are matrix multiplication, limits, differentiation and
integration, among other common mathematical operators) -- meaning that it obeys the
following two linearity properties:
1.For any two random variables X, Y: E X + Y = E X + E Y
2.For any c : E cX = cE X
The following corollary is also useful: Var X = E X 2 − E X
2
Proof.
Var X = E ( X − ) = E X 2 − 2 X + 2
2
= E X 2 − E 2 X + E 2
by linearity
of E
= E X 2 − 2 E X + 2
by linearity E const = const
of E
= E X 2 − 2 2 + 2
Covariance
• Covariance is a measure of the linear relationship between two random variables, X and Y.
If Cov(X,Y) > 0, this indicates a positive linear relationship between the random variables
(i.e. as X increases, Y increases; as X decreases, Y decreases); when Cov(X,Y) < 0 the
variables share a negative linear relationship; Cov(X,Y) = 0 indicates the absence of a linear
relationship.
Def. Cov ( X , Y ) = E ( X − E Y ) ( X − E Y )
Lemma
If X ⊥ Y (i.e. if X and Y are independent), then Cov ( X , Y ) = 0
* Note that the converse of the lemma above fails; in other words Cov(X,Y)=0 need not imply
that X and Y are independent.
Covariance
• The Covariance Matrix (Σ) for a set of random variables {X1,…,XN} is defined as the
matrix of pairwise covariances:
Def. Let X = X 1 ,..., X N , ij
( )
= Cov( X i , X j ) = E ( X i − E X i ) X j − E X j
matrix of
Covariances
Var[ X 1 ] Cov( X 1 , X 2 ) Cov( X 1 , X N )
Cov( X , X ) Var[ X 2 ] Cov( X 2 , X N )
= 2 1
Cov( X N , X 1 ) Cov( X N , X 2 ) Var[ X N ]
Note that Σ is symmetric and positive semi-definite. The covariance matrix is used to
parameterize the MVN (multivariate normal distribution); the covariance matrix can likewise
be computed for a dataset.
Bayes’ Theorem
• Bayes’ Theorem is a vital (yet simple) conditional probability formula; today its use is
omnipresent across ML.
P ( B | A) P ( A)
Def. P ( A | B) =
P( B)
P( A B) P( B | A) P( A)
Derivation: P ( A | B ) = =
P( B) P( B)
by definition of by multiplication rule
conditional probability
• More importantly, Bayes’ Theorem can be generalized to encapsulate the whole of the
inductive element of the scientific method. To this end, consider H (hypothesis) and D
(data):
P ( D | H ) P( H )
P ( H | D) =
P ( D)
• In this case, Bayes’ Theorem yields a natural mechanism for updating our belief about the
world/the plausibility of a hypothesis (H) given an observation (D). P(H|D) is referred to as
the posterior probability of H, P(H) is called the prior probability of H, P(D|H) defines
the likelihood of the data, and P(D) is the data prior.
Bayesian and Frequentist Statistics
• There exist two general paradigms for modern statistics: the frequentist and Bayesian
approaches.
Frequentists: Generally consider model parameters (θ) as fixed; data are drawn from some
objective distribution, defined by θ. There exists various well-known pathologies associated
with frequentism, including the “problem of induction” (Hume), the Black Swan Paradox,
limited exact solutions and a heavy reliance upon long-term frequencies.
Bayesians: (Observed) data are fixed; data are observed from a realized sample; we encode
prior beliefs, and parameters values are described probabilistically.
• Frequentists use the Maximum Likelihood Estimate (MLE) for point estimates of
parameters θ :
ˆMLE = arg max P ( D | )
• Bayesians instead use the Maximum A Posterior (MAP) for parameter estimates:
ˆMAP = arg max P ( | D ) = arg max P ( D | ) P ( )
(Very Brief) Information Theory
• The entropy of a discrete random variable X (equivalently: the entropy of the pmf
associated with X) is defined:
H ( X ) = − p ( X = xi ) log p ( X = xi )
i
• The differential entropy of a continuous random variable is defined analogously:
H ( X ) = − p ( x ) log p ( x )dx
S
• Entropy quantifies disorder/”surprise”; the Principle of Insufficient Reasons (PIR)
states (paraphrasing) that in the absence of compelling evidence, one should adopt a
maximum entropy probability distribution. The uniform distribution is a maximum
entropy distribution; the Gaussian distribution is a likewise a maximum entropy
distribution (up to second moments). Entropy is minimized (i.e. zero) for deterministic
events, e.g. Dirac delta function.
(Very Brief) Information Theory
Ex. The entropy of a Bernoulli random variable X is given by:
H ( X ) = − p ( X = xi ) log p ( X = xi )
i
= − ( p ( X = 1) log p ( X = 1) + p ( X = 0 ) log p ( X = 0 ) )
= − ( log + (1 − ) log (1 − ) )
* Notice that entropy is maximized in this case when θ = 0.5, which corresponds with a
binary uniform distribution; conversely, entropy is minimized when either θ = 0 or θ = 1, in
which case the even is deterministic.
(Very Brief) Information Theory
• The Kullback-Leibler Divergence quantifies the difference between two probability
distributions, p(x) and q(x).
p ( X = xi )
Def. KL ( p || q ) = p ( X = xi ) log
i q ( X = xi )
p ( x)
KL ( p || q ) = p ( x ) log dx
S
q ( x)
The Information Inequality states:
KL ( p || q ) 0 and KL ( p || q ) = 0 p = q
(Very Brief) Information Theory
p ( X = xi )
KL ( p || q ) = p ( X = xi ) log
i q ( X = xi )
p ( x)
KL ( p || q ) = p ( x ) log dx
S
q ( x)
*Recall that covariance/correlation are inherent measures of the linear relationship between
two random variables. Using KL-divergence, we can develop a more general notion of
independence, called mutual information.
p ( X ,Y )
Def. MI ( X , Y ) = KL ( p ( X , Y ) || p ( X ) p (Y ) ) = p ( X , Y ) log
X Y p ( X ) p (Y )
p ( x, y )
MI ( X , Y ) = KL ( p ( X , Y ) || p ( X ) p (Y ) ) = p ( x, y ) log dxdy
S X SY
p ( x) p ( y)
*From the information inequality, it follows that:
MI ( X , Y ) 0 and MI ( X , Y ) = 0 p ( X , Y ) || p ( X ) p (Y )
(i.e. MI ( X , Y ) = 0 X ⊥ Y )
Thus, MI can be seen as a more general measure of statistical independence than covariance.
Fin