0% found this document useful (0 votes)

27 views40 pages

Math Prelims

The document provides an overview of mathematical concepts essential for machine learning, focusing on linear algebra and statistics/probability. It covers topics such as vectors, matrices, operations, special matrices, matrix factorizations, determinants, eigenvalues, and the fundamentals of probability distributions. This foundational knowledge is crucial for understanding and applying machine learning techniques effectively.

Uploaded by

harshitaj2022

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views40 pages

Math Prelims

Uploaded by

harshitaj2022

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

Mathematics Preliminaries for Machine Learning

CS 445/545
Outline
• Linear Algebra Overview

• Statistics/Probability Overview
Linear Algebra: Vectors & Matrices

x  d
→ x = x1 ,..., xd
vector

 a11 a1n 
A  mn
→  

 am1 amn 
matrix

 a111 a11n   a11p a1pn 

mn p
   
M  →   
tensor  a1 a1mn  a p a mnp 
 m1   m1 
Vector & Matrix Operations
d
dot product: x  y =  xi yi  results in a scalar; x, y are orthogonal if x  y = 0
i =1

matrix multiplication : A  B = C
mn n p m p

 a11 a1n  b11 a1 p 

A  B =   
 → Cij = a rowi  bcolj

col j
row i
 
 am1 amn  bn1 anp 

• Matrix multiplication involves a sequence of dot products; element Cij in

the resultant matrix is equal to the dot product of row i (from the left matrix)
and column j (from the right matrix).
Vector & Matrix Operations
scalar multiplication: cx = c x1 ,..., xd = cx1 ,..., cxd

 a11 a1n   ca11 ca1n 

scalar multiplication: cA=c  =
 


 am1 amn  cam1 camn 

• Matrix multiplication is associative: A(BC)= (AB)C (always holds),

but not commutative: AB ≠ BA (in general).
 x1 
vector transpose: x =  x1 ,..., xd  → xT =  
1d d 1
 xd 

 a11 a1n   a11 a1m 

matrix transpose: A=   → AT = 
 


 am1 amn   an1 anm 

• A matrix is called “symmetric” if: AT=A.

Vector Norms
• Norms convey the notion of the “magnitude” (i.e. size) of a vector; note that the
equivalence (in magnitude) of two vectors is relative to the choice of norm.

• There are many types – even “families” – of norms relevant to ML/data science.
Here are several of the most commonly used norms in ML:

(1) L2 norm (i.e. “Euclidean norm”)

d d
x2=  xi Note: x =  xi 2 = x  x
2 2
2
i =1 i =1

(2) L1 norm (i.e. Manhattan distance)

d
x 1 =  xi
i =1

(3) ∞ norm
x  = max xi
*For an ML practitioner, the “choice” of a norm is oftentimes a crucial part of feature engineering and the ML problem
formulation process itself; one can think of the different norm choices as striking a balance between “precision” and
computational overhead.
*There exist equivalent norms applied to matrices; the above norms are examples from the family of p-norms.
Dot Product
d
x  y =  xi yi
i =1
• The dot product has important geometric properties that are useful in ML.

(*) The dot product can be defined equivalently:

x  y = x 2 y 2 cos ( )

From this equivalent definition of the dot product, we can show that the dot product
quantifies the “similarity” between two vectors. Consider (3) cases:

(i) Vectors x and y are “out of alignment” and meet at a 90 degree angle; in this
case:
x  y = x 2 y 2 cos ( 90) = 0

(ii) Vectors x and y are “perfectly aligned” (i.e. parallel to one another):
x  y = x 2 y 2 cos ( 0) = x 2 y 2

(iii) Vectors x and y are “oppositely aligned (i.e. they are anti-parallel):
x  y = x 2 y 2 cos (180) = − x 2 y 2
Special Matrices
• The identity matrix I is a square (nxn matrix); the identity matrix multiplied by any
matrix A (appropriately shaped) results in the matrix A:

1 0 0
0 1 0 
In =  AI = IA = A
0 0 0
 
0 0 1  nn
• A matrix A is said to be symmetric if it equals its transpose:
T
1 2 1 2
AT = A e.g.,   = 
2 1 2 1
Note:( A + B ) = AT + BT , ( AB ) = BT AT
T T

• For a diagonal matrix, all off-diagonal entries are zero (note that diagonal entries
are permitted to be zero).
 d1 0 0
0 d 0 
D= 2

0 0 0
 
0 0 dm 
Special Matrices
• An upper-triangular matrix has zero elements below the main diagonal. Note that
Gaussian Elimination (from elementary linear algebra) yields an upper-triangular
matrix.
u11 u12 u1n 
0 u u2 n 
U = 22

0 0 
 
0 0 umn 

• A lower-triangular matrix has zero elements above the main diagonal.

 l11 0 0
l l 0 
L =  21 22
 
 
lm1 lm 2 lmn 

• An orthogonal matrix is a matrix with orthonormal rows and columns; equivalently,

the inverse of an orthogonal matrix is its transpose.

QT Q = QQT = I
Special Matrices
• A square matrix A is Positive Definite if:

x  n
\{0}, xT Ax  0

• Analogously, a square matrix A is Positive semi-Definite if (e.g. covariance

matrix):

x  n
, xT Ax  0

• We say that the matrix Amxn is invertible (i.e. non-singular) if there exists A-1nxm ,
where:

AA−1 = A−1 A = I
Properties:

(A )
T −1
= A−T ( AB ) = B−1 A−1 (if A,B non-singular)
−1
Linear Systems
• Commonly we encode, and subsequently solve systems of linear equations :

a11 x1 + a1n xn = b1
a21 x1 + a2 n xn = b2
→ Ax = b
Canonical Matrix Form

am1 x1 + amn xn = bm

• When the coefficient matrix A is non-singular, the linear system gives rise to a unique
solution:

Ax = b → x = A−1b

* Note that matrix inversion requires roughly on the order of O(n3) arithmetic
operations.
Matrix Factorizations
• Matrix factorizations are immensely useful for identifying an underlying, inherent
structure in a matrix (i.e. data).

Here are several important examples:

LU Factorization
A = LU
• This factorization encodes the result of the Gaussian Elimination (GE) procedure
(note that not all matrices admit of an LU factorization). L: denotes a lower-triangular
matrix of “multipliers” used in GE. U denotes an upper-triangular (i.e. echelon form)
matrix resulting from GE.

PALU Factorization (Permuted LU factorization) PA = LU

• This technique is similar to LU Factorization, except that we perform a pivoting
operation first (i.e. permute the rows of A via a permutation matrix, P). LU
factorization is subsequently performed; all matrices admit of such a factorization.
Matrix Factorizations
Here are several important examples:

QR Factorization
A = QR
• Q is an orthogonal matrix and R is upper-triangular -- commonly used for solving both
regression problems and linear dynamical systems.

Eigendecomposition
A = V V T

• This is one of the most useful and commonly-used of all matrix factorizations. The
primary use of an eigendecomposition in ML is to perform dimensionality reduction; as such,
this technique is closely related to PCA (principal component analysis) and SVD (singular value
decomposition – see below) methods; Σ is a diagonal matrix consisting of the eigenvalues of
A, and V is the matrix of corresponding eigenvectors.
Matrix Factorizations
Here are several important examples:

Cholesky Factorization
A = LLT
• L is lower-triangular; Cholesky can be used to numerically solve linear systems; every
positive-definite matrix admits of a Cholesky factorization.

SVD (Singular Value Decomposition) A = U V T

• SVD is one of the most essential matrix factorizations for applications of ML. U and V
are orthogonal matrices, and Σ is a diagonal matrix containing the “singular values” (i.e. the
eigenvalues of ATA. SVD has many applications (an orthogonal matrix denotes the matrix
of eigenvectors of ATA, including dimensionality reduction and compression. All matrices
admit of a singular value decomposition.
Determinants
• Geometrically, the determinant of a square matrix A (written |A|) quantifies the unit increase in
volume of the linear transformation defined by A (note that matrix multiplication defines a linear
transformation).

Determinants can be computed through recursion; the general formula for a determinant is:
n
| A |=  aij ( −1)
i+ j
 M ij
i =1
the "ij-minor"
of A

Note that A = 0 if and only if A is singular.

Some Properties of Determinants:

AB = A B AT = A
Eigenvalues
• The eigenvalues λ and eigenvectors v of a matrix A satisfy:

Av =  v ( v  0)
• Which means that the eigenvectors of a matrix A are precisely the vectors for which
multiplication by A is tantamount to scalar multiplication by λ.

• Determining the exact values of the set of eigenvalues for a matrix Anxn is requires solving the
so-called characteristic equation: |A- λ I|=0, which is an n-degree polynomial equation in the
variable λ.
Linear Independence, Span and Basis
• A set of vectors is called linearly independent if the set contains “no redundancy”; formally:

Def . A set of vectors v1 , , v n  is linearly independent if:

1 v1 +  n v n = 0 implies  i = 0  1  i  n

• The span of a set is defined as the set of all linear combinations of the set of vectors.

Def . span v1 , , vn  = i1v1 + in vn | i  

• A basis is a set of linearly independent vectors that spans the parent vector space.

 1   0   0  
 
e.g.,  0  , 1  , 0   is the "standard" basis set for 3

  0   0  1  
      
The Four Fundamental Subspaces
• The Four Fundamental subspaces of a matrix Amxn:

1. (Column Space) Col(A): the span of the column vectors of A.

2. (Null Space) Null(A): the set of all vectors that satisfy Ax=0.
3. (Row Space) Row(A): the span of the row vectors of A.
4. (Null Space of AT) Null(AT): the set of all vectors that satisfy ATx=0.
Overview of Statistics/Probability
• We use statistics and probability to quantify and summarize our beliefs about a “state
of the world” in the face of incomplete or partial knowledge.

• Denote a random event E; the sample space S consists of the set of all possible outcomes
associated with E (e.g. if E=“coin flip”, S={H,T}).

• A random variable (e.g. X, Y) is a variable that is assigned a number based on the outcome
of the random event E.

• Random variables are either Discrete (e.g., 0/1) or Continuous (e.g., height, time).
Overview of Statistics/Probability
Probability Distributions

• A probability distribution summarizes our total knowledge about the random event E,
via the random variable X.

• For a discrete random variable, the probability distribution of X is called a probability

mass function (pmf); a pmf satisfies the following properties, with |S|=k:

1. 0  p ( X i )  1 1  i  k
k
2.  p( X i ) = 1
i =1

• Similarly, for a continuous random variable, the probability distribution of X is called a

probability density function (pdf); a pdf satisfies the following properties, with |S|=∞:

1. 0  p ( X i )  1
2.  p ( X ) dx = 1
S
Overview of Statistics/Probability
Probability Distributions

• A cumulative density function (cdf) is defined as the cumulative probability up to a given

value of a random variable:

x
FX ( x ) = p ( X  x ) =  p(u ) du
−

• Percentiles and quartiles can be defined in a natural way with respect to a cdf:

FX ( x ) = 0.25 FX ( x ) = 0.5 FX ( x ) = 0.75

x =Q1 x =Q2 (median) x =Q3

• Note that due to the Fundamental Theorem of Calculus, it follows that:

d
FX ( x ) = p ( x )
dx pdf
cdf
Overview of Statistics/Probability
Properties of Probability Distributions

• Two random events E1 and E2 are disjoint if: S1∩S2 = Ø.

• If two events E1 and E2 are disjoint, then: P ( E1  E2 ) = P ( E1 ) + P ( E2 )

• More generally, the addition rule of probability states, that for any two events E1 and E2:

P ( E1  E2 ) = P ( E1 ) + P ( E2 ) − P ( E1  E2 )
Overview of Statistics/Probability
Conditional Probability
P ( A  B)
P ( A | B) =
• Def. Conditional Probability: P ( B)
probability of
"A given B"

• From this definition, we can derive the multiplication rule of probability:

P ( A  B) = P ( A | B) P ( B)

• Equivalently,

P ( A  B ) = P ( B | A) P ( A)
Overview of Statistics/Probability
Independence

• We say that events A & B are independent if the outcome of A has no bearing on B (and
vice versa); more formally the joint probability distribution p(A,B) factors.

• Def. A & B are independent if: P ( A  B ) = P ( A) P ( B )

• Equivalently, if A & B are independent, it also follows that:

P ( A | B ) = P ( A) P ( B | A) = P ( B )

Thus, in summary, if A & B are independent: P ( A  B ) = ( A | B ) P ( B ) = P ( A) P ( B )

*Independence is commonly denoted: A⊥B

Overview of Statistics/Probability
• Two major theorems in elementary statistics: (1) the Law of Large Numbers and (2) the
Central Limit Theorem.

• The Law of Large Numbers states (paraphrasing): Experimental (i.e. empirical) probabilities
converge to their associate theoretical probability as the number of trials tends to infinity.
Overview of Statistics/Probability
The Central Limit Theorem (a conceptual pillar of statistics)

In words: given a sufficiently large sample size from a population (with a finite level of variance), the mean of
all samples from the same population will be approximately equal to the mean of the population.
Furthermore, all of the samples will follow an approximate normal distribution pattern, with all variances
being approximately equal to the variance of the population divided by each sample's size.

In a picture:

Whatever the form of the population distribution, the sampling distribution

tends to a Gaussian, and its dispersion is given by the Central Limit Theorem.

In a theorem: Suppose {X1,X2,…,} is a sequence of I.I.D. random variables with E[Xi]=μ and Var[Xi]=σ2<∞
Then as n approaches infinity, the random variable (1/n)(X1+…+Xn) converges to a normal N(0, σ2/n):

 1 n   d  2 
   X i  −   → N  0, 
  i =1 
n   n 
Probability Distributions
• Here are some (but certainly not all) of the essential probability distributions for ML and
applied statistics:

1-D Gaussian (i.e. Normal)

• When μ=0 and σ=1 (i.e. N(0,1)) we call this the standard Normal model.
Probability Distributions
Multivariate Gaussian (i.e. MVN)
Probability Distributions
Bernoulli & Binomial Distributions

• The Bernoulli distribution is a single variable, discrete distribution, describing a random

variable with two discrete states (e.g. heads/tails for a single coin flip). The Bernoulli
distribution forms the basis of the Binomial distribution, which models repetitions of
independent Bernoulli trials (N total).

Bernoulli pmf

p ( X = 1) =  p ( X = 0) = 1 −  0   1

Binomial pmf
n k
p( X = k) =    (1 −  )
n−k

probability of
k 
"k successes in n trials"

• For example, with a biased coin (θ = 0.6), we have:

10 
p ( exactly 7 H in 10 flips ) =   0.67 (1 − 0.6 )
10 − 7

7
Summary Statistics for Random Variables
Expectation and Variance of a Random Variable

• The Expected Value of a random variable X summarizes the outcome: “if the trial were
executed once, on average, this is the numerical value we would expect for X”; E[X]
accordingly computes the arithmetic mean of a random variable, i.e. E[X]=μ.

k
E  X  =  xi P ( X = xi ) (Discrete RV) E  X  =  xp ( x )dx (Continuous RV)
i =1 S

For example, to compute the expected number of heads X in 10 flips of a fair coin
(X~Binomial) we have:

n−k
10   1   1 
10 k

E  X  =  k      =5
i =0  k  2   2 
Summary Statistics for Random Variables
Properties of Expected Value and Variance

• Expected Value is a linear operator (as are matrix multiplication, limits, differentiation and
integration, among other common mathematical operators) -- meaning that it obeys the
following two linearity properties:

1.For any two random variables X, Y: E  X + Y  = E  X  + E Y 

2.For any c  : E  cX  = cE  X 

The following corollary is also useful: Var  X  = E  X 2  − E  X 

Proof.
Var  X  = E ( X −  )  = E  X 2 − 2  X +  2 
2
 
= E  X 2  − E  2  X  + E   2 
by linearity
of E

= E  X 2  − 2  E  X  + 2
by linearity E const = const
of E

= E  X 2  − 2  2 +  2
Covariance
• Covariance is a measure of the linear relationship between two random variables, X and Y.
If Cov(X,Y) > 0, this indicates a positive linear relationship between the random variables
(i.e. as X increases, Y increases; as X decreases, Y decreases); when Cov(X,Y) < 0 the
variables share a negative linear relationship; Cov(X,Y) = 0 indicates the absence of a linear
relationship.

Def. Cov ( X , Y ) = E ( X − E Y ) ( X − E Y )

Lemma

If X ⊥ Y (i.e. if X and Y are independent), then Cov ( X , Y ) = 0

* Note that the converse of the lemma above fails; in other words Cov(X,Y)=0 need not imply
that X and Y are independent.
Covariance
• The Covariance Matrix (Σ) for a set of random variables {X1,…,XN} is defined as the
matrix of pairwise covariances:

Def. Let X =  X 1 ,..., X N  , ij

 ( )
= Cov( X i , X j ) = E ( X i − E  X i ) X j − E  X j  

matrix of
Covariances

 Var[ X 1 ] Cov( X 1 , X 2 ) Cov( X 1 , X N ) 

 Cov( X , X ) Var[ X 2 ] Cov( X 2 , X N ) 
= 2 1

 
 
Cov( X N , X 1 ) Cov( X N , X 2 ) Var[ X N ] 

Note that Σ is symmetric and positive semi-definite. The covariance matrix is used to
parameterize the MVN (multivariate normal distribution); the covariance matrix can likewise
be computed for a dataset.
Bayes’ Theorem
• Bayes’ Theorem is a vital (yet simple) conditional probability formula; today its use is
omnipresent across ML.

P ( B | A) P ( A)
Def. P ( A | B) =
P( B)

P( A  B) P( B | A) P( A)
Derivation: P ( A | B ) = =
P( B) P( B)
by definition of by multiplication rule
conditional probability

• More importantly, Bayes’ Theorem can be generalized to encapsulate the whole of the
inductive element of the scientific method. To this end, consider H (hypothesis) and D
(data):
P ( D | H ) P( H )
P ( H | D) =
P ( D)

• In this case, Bayes’ Theorem yields a natural mechanism for updating our belief about the
world/the plausibility of a hypothesis (H) given an observation (D). P(H|D) is referred to as
the posterior probability of H, P(H) is called the prior probability of H, P(D|H) defines
the likelihood of the data, and P(D) is the data prior.
Bayesian and Frequentist Statistics
• There exist two general paradigms for modern statistics: the frequentist and Bayesian
approaches.

Frequentists: Generally consider model parameters (θ) as fixed; data are drawn from some
objective distribution, defined by θ. There exists various well-known pathologies associated
with frequentism, including the “problem of induction” (Hume), the Black Swan Paradox,
limited exact solutions and a heavy reliance upon long-term frequencies.

Bayesians: (Observed) data are fixed; data are observed from a realized sample; we encode
prior beliefs, and parameters values are described probabilistically.

• Frequentists use the Maximum Likelihood Estimate (MLE) for point estimates of
parameters θ :
ˆMLE = arg max P ( D |  )


• Bayesians instead use the Maximum A Posterior (MAP) for parameter estimates:

ˆMAP = arg max P ( | D ) = arg max P ( D |  ) P ( )

 
(Very Brief) Information Theory
• The entropy of a discrete random variable X (equivalently: the entropy of the pmf
associated with X) is defined:

H ( X ) = − p ( X = xi ) log p ( X = xi )
i

• The differential entropy of a continuous random variable is defined analogously:

H ( X ) = −  p ( x ) log p ( x )dx
S

• Entropy quantifies disorder/”surprise”; the Principle of Insufficient Reasons (PIR)

states (paraphrasing) that in the absence of compelling evidence, one should adopt a
maximum entropy probability distribution. The uniform distribution is a maximum
entropy distribution; the Gaussian distribution is a likewise a maximum entropy
distribution (up to second moments). Entropy is minimized (i.e. zero) for deterministic
events, e.g. Dirac delta function.
(Very Brief) Information Theory
Ex. The entropy of a Bernoulli random variable X is given by:

H ( X ) = − p ( X = xi ) log p ( X = xi )
i

= − ( p ( X = 1) log p ( X = 1) + p ( X = 0 ) log p ( X = 0 ) )
= − ( log  + (1 −  ) log (1 −  ) )

* Notice that entropy is maximized in this case when θ = 0.5, which corresponds with a
binary uniform distribution; conversely, entropy is minimized when either θ = 0 or θ = 1, in
which case the even is deterministic.
(Very Brief) Information Theory
• The Kullback-Leibler Divergence quantifies the difference between two probability
distributions, p(x) and q(x).

p ( X = xi )
Def. KL ( p || q ) =  p ( X = xi ) log
i q ( X = xi )
p ( x)
KL ( p || q ) =  p ( x ) log dx
S
q ( x)

The Information Inequality states:

KL ( p || q )  0 and KL ( p || q ) = 0  p = q
(Very Brief) Information Theory
p ( X = xi )
KL ( p || q ) =  p ( X = xi ) log
i q ( X = xi )
p ( x)
KL ( p || q ) =  p ( x ) log dx
S
q ( x)

*Recall that covariance/correlation are inherent measures of the linear relationship between
two random variables. Using KL-divergence, we can develop a more general notion of
independence, called mutual information.

p ( X ,Y )
Def. MI ( X , Y ) = KL ( p ( X , Y ) || p ( X ) p (Y ) ) =  p ( X , Y ) log
X Y p ( X ) p (Y )
p ( x, y )
MI ( X , Y ) = KL ( p ( X , Y ) || p ( X ) p (Y ) ) =  p ( x, y ) log dxdy
S X SY
p ( x) p ( y)

*From the information inequality, it follows that:

MI ( X , Y )  0 and MI ( X , Y ) = 0  p ( X , Y ) || p ( X ) p (Y )
(i.e. MI ( X , Y ) = 0  X ⊥ Y )
Thus, MI can be seen as a more general measure of statistical independence than covariance.
Fin

Module 2 ML Mumbai University
No ratings yet
Module 2 ML Mumbai University
39 pages
Multivariate Analysis Notes
No ratings yet
Multivariate Analysis Notes
54 pages
00 Lectureslides LinAlg
No ratings yet
00 Lectureslides LinAlg
20 pages
Lin Al Rev
No ratings yet
Lin Al Rev
7 pages
Lin Agebra Rev
No ratings yet
Lin Agebra Rev
18 pages
ML - Lec 3 - Review of Linear Algebra
No ratings yet
ML - Lec 3 - Review of Linear Algebra
16 pages
Lec 3
No ratings yet
Lec 3
54 pages
Matrix Decomposition Guide
100% (3)
Matrix Decomposition Guide
17 pages
Refresher Algebra Calculus
100% (2)
Refresher Algebra Calculus
2 pages
Chapter 0 - Miscellaneous Preliminaries: EE 520: Topics - Compressed Sensing Linear Algebra Review
No ratings yet
Chapter 0 - Miscellaneous Preliminaries: EE 520: Topics - Compressed Sensing Linear Algebra Review
18 pages
Introduction To Linear Algebra
No ratings yet
Introduction To Linear Algebra
33 pages
Chiang - Chapter 4
100% (2)
Chiang - Chapter 4
14 pages
MATLAB Linear Algebra Tutorial
No ratings yet
MATLAB Linear Algebra Tutorial
39 pages
Mobrob Linear Algebra
No ratings yet
Mobrob Linear Algebra
42 pages
Module 3 - Supplementary Slides
No ratings yet
Module 3 - Supplementary Slides
36 pages
Linear Algebra Reivew: All Linear Algebra, So This Is A Fairly Serious Weakness. This Review Is
No ratings yet
Linear Algebra Reivew: All Linear Algebra, So This Is A Fairly Serious Weakness. This Review Is
10 pages
01 Section 2.1 QR Code Content
No ratings yet
01 Section 2.1 QR Code Content
23 pages
L02 Notes
No ratings yet
L02 Notes
6 pages
01 Linear Algebra To Robots
No ratings yet
01 Linear Algebra To Robots
40 pages
GEM 802 Chapter 1
No ratings yet
GEM 802 Chapter 1
52 pages
Selected Linear Algebra For Machine Learning
No ratings yet
Selected Linear Algebra For Machine Learning
30 pages
Lecture 6
No ratings yet
Lecture 6
53 pages
Linear Algebra
No ratings yet
Linear Algebra
4 pages
SVD Slides
No ratings yet
SVD Slides
17 pages
01 - Lab Notes
No ratings yet
01 - Lab Notes
8 pages
Matrix
No ratings yet
Matrix
10 pages
Linear Algebra Essentials
No ratings yet
Linear Algebra Essentials
13 pages
Lecture 6
No ratings yet
Lecture 6
53 pages
A Journey From Linear Algebra To Machine Learning
No ratings yet
A Journey From Linear Algebra To Machine Learning
50 pages
Module 3 - Supplementary Slides
No ratings yet
Module 3 - Supplementary Slides
36 pages
Basics in Maths: Laurent Navoret Laurent - Navoret@
No ratings yet
Basics in Maths: Laurent Navoret Laurent - Navoret@
19 pages
Decomp
No ratings yet
Decomp
27 pages
Summary
No ratings yet
Summary
115 pages
LinearAlgebraPrimer Ver 2010
No ratings yet
LinearAlgebraPrimer Ver 2010
15 pages
L3 Prerequisite Basic Maths Session 2
No ratings yet
L3 Prerequisite Basic Maths Session 2
37 pages
Matrix Algebra for Regression Analysis
No ratings yet
Matrix Algebra for Regression Analysis
6 pages
Linear Algebra for MATLAB Users
No ratings yet
Linear Algebra for MATLAB Users
310 pages
CC10 Group 5 1
No ratings yet
CC10 Group 5 1
39 pages
MATH 20250 Cheat Sheet
No ratings yet
MATH 20250 Cheat Sheet
2 pages
Linear Algebra - Class Notes
No ratings yet
Linear Algebra - Class Notes
5 pages
Linear Algebra111
No ratings yet
Linear Algebra111
9 pages
Matrix Algebra
No ratings yet
Matrix Algebra
18 pages
Dama50 2024 2025 Unit3n
No ratings yet
Dama50 2024 2025 Unit3n
56 pages
LA Lectures
No ratings yet
LA Lectures
148 pages
Matrix Algebra by A.S.Hadi
0% (2)
Matrix Algebra by A.S.Hadi
4 pages
Cis515 15 sl1 A
No ratings yet
Cis515 15 sl1 A
68 pages
Linear Algebra for ML Students
No ratings yet
Linear Algebra for ML Students
28 pages
Matlab
No ratings yet
Matlab
20 pages
Linear Algebra 1
No ratings yet
Linear Algebra 1
14 pages
Mathematical Treatise On Linear Algebra
No ratings yet
Mathematical Treatise On Linear Algebra
7 pages
Lecture Notes in Linear Algebra: Dr. Abdullah Al-Azemi
No ratings yet
Lecture Notes in Linear Algebra: Dr. Abdullah Al-Azemi
149 pages
Executive Summary of AI and ET
No ratings yet
Executive Summary of AI and ET
154 pages
SMTA022 Study Guide For 2024
No ratings yet
SMTA022 Study Guide For 2024
63 pages
Lecture Notes On Linear Algebra - Kuwait Uni
No ratings yet
Lecture Notes On Linear Algebra - Kuwait Uni
149 pages
Linear Algebra Review
No ratings yet
Linear Algebra Review
41 pages
Linear Algebra Notes
No ratings yet
Linear Algebra Notes
6 pages
Matrices
No ratings yet
Matrices
45 pages
Linear Algebra Primer: Daniel S. Stutts, PH.D
No ratings yet
Linear Algebra Primer: Daniel S. Stutts, PH.D
14 pages
Session 4
No ratings yet
Session 4
13 pages
CASEL CSI Emerging Insights Brief 2020
100% (1)
CASEL CSI Emerging Insights Brief 2020
37 pages
Introduction To Psychology
No ratings yet
Introduction To Psychology
9 pages
CHA Hyderabad (AutoRecovered) Jan2023
No ratings yet
CHA Hyderabad (AutoRecovered) Jan2023
18 pages
General Systems Theory Problems Perspectives and Practice 2nd Edition Lars Skyttner New Release 2025
No ratings yet
General Systems Theory Problems Perspectives and Practice 2nd Edition Lars Skyttner New Release 2025
118 pages
Four or Dead 1 PDF
17% (71)
Four or Dead 1 PDF
9 pages
Sustainability 2 Marks Answers
No ratings yet
Sustainability 2 Marks Answers
3 pages
SPM6-72L 380-400 Watt: Mono Crystalline Module
No ratings yet
SPM6-72L 380-400 Watt: Mono Crystalline Module
2 pages
Numerology Report Vedvastushastra
100% (2)
Numerology Report Vedvastushastra
11 pages
SAP PM - Key Figures For Order Costs
No ratings yet
SAP PM - Key Figures For Order Costs
3 pages
Catalogue Eurotruss 2016 PDF
No ratings yet
Catalogue Eurotruss 2016 PDF
312 pages
05 - m106 - Partie4-7e
No ratings yet
05 - m106 - Partie4-7e
34 pages
PfBlockerNGConfigurationGuide-StepByStep 1714660961696
No ratings yet
PfBlockerNGConfigurationGuide-StepByStep 1714660961696
4 pages
MP Material by Sravan
No ratings yet
MP Material by Sravan
189 pages
Resume 1-Pharm 1 PG
No ratings yet
Resume 1-Pharm 1 PG
2 pages
Cabin Interior System - Lavatory
No ratings yet
Cabin Interior System - Lavatory
66 pages
Soil Permeability Calculations
No ratings yet
Soil Permeability Calculations
2 pages
BCM MARKET SURVEY OF FLOORING AND PAVING N
No ratings yet
BCM MARKET SURVEY OF FLOORING AND PAVING N
14 pages
ANICAS, Jerimi V. - Project - in - IE203
No ratings yet
ANICAS, Jerimi V. - Project - in - IE203
12 pages
Corruption Analysis in Nigeria
No ratings yet
Corruption Analysis in Nigeria
12 pages
F150A FL150A: Service Manual
No ratings yet
F150A FL150A: Service Manual
313 pages
Humidity Meter: Model: YK-90HT
No ratings yet
Humidity Meter: Model: YK-90HT
2 pages
Listening Compre and Dictation Grade 3
No ratings yet
Listening Compre and Dictation Grade 3
3 pages
Paul and The Law
100% (1)
Paul and The Law
27 pages
MHD Power Generation
No ratings yet
MHD Power Generation
15 pages
SDG Quiz Answers
100% (2)
SDG Quiz Answers
2 pages
Cumulative Test 1-9 A: Grammar
No ratings yet
Cumulative Test 1-9 A: Grammar
6 pages
BMC Control-M For ZOS 9.0.19 User Guide
100% (1)
BMC Control-M For ZOS 9.0.19 User Guide
846 pages
Ame 8800
No ratings yet
Ame 8800
20 pages
Aggregate & Capacity Planning Guide
100% (2)
Aggregate & Capacity Planning Guide
10 pages

Math Prelims

Uploaded by

Math Prelims

Uploaded by

Mathematics Preliminaries for Machine Learning

 a111 a11n   a11p a1pn 

 a11 a1n  b11 a1 p 

• Matrix multiplication involves a sequence of dot products; element Cij in

 a11 a1n   ca11 ca1n 

• Matrix multiplication is associative: A(BC)= (AB)C (always holds),

 a11 a1n   a11 a1m 

• A matrix is called “symmetric” if: AT=A.

(1) L2 norm (i.e. “Euclidean norm”)

(2) L1 norm (i.e. Manhattan distance)

(*) The dot product can be defined equivalently:

• A lower-triangular matrix has zero elements above the main diagonal.

• An orthogonal matrix is a matrix with orthonormal rows and columns; equivalently,

• Analogously, a square matrix A is Positive semi-Definite if (e.g. covariance

Here are several important examples:

PALU Factorization (Permuted LU factorization) PA = LU

SVD (Singular Value Decomposition) A = U V T

Note that A = 0 if and only if A is singular.

Some Properties of Determinants:

Def . A set of vectors v1 , , v n  is linearly independent if:

Def . span v1 , , vn  = i1v1 + in vn | i  

1. (Column Space) Col(A): the span of the column vectors of A.

• For a discrete random variable, the probability distribution of X is called a probability

• Similarly, for a continuous random variable, the probability distribution of X is called a

• A cumulative density function (cdf) is defined as the cumulative probability up to a given

FX ( x ) = 0.25 FX ( x ) = 0.5 FX ( x ) = 0.75

• Note that due to the Fundamental Theorem of Calculus, it follows that:

• Two random events E1 and E2 are disjoint if: S1∩S2 = Ø.

• If two events E1 and E2 are disjoint, then: P ( E1  E2 ) = P ( E1 ) + P ( E2 )

• From this definition, we can derive the multiplication rule of probability:

• Def. A & B are independent if: P ( A  B ) = P ( A) P ( B )

• Equivalently, if A & B are independent, it also follows that:

Thus, in summary, if A & B are independent: P ( A  B ) = ( A | B ) P ( B ) = P ( A) P ( B )

*Independence is commonly denoted: A⊥B

Whatever the form of the population distribution, the sampling distribution

1-D Gaussian (i.e. Normal)

• The Bernoulli distribution is a single variable, discrete distribution, describing a random

• For example, with a biased coin (θ = 0.6), we have:

1.For any two random variables X, Y: E  X + Y  = E  X  + E Y 

The following corollary is also useful: Var  X  = E  X 2  − E  X 

Def. Cov ( X , Y ) = E ( X − E Y ) ( X − E Y )

If X ⊥ Y (i.e. if X and Y are independent), then Cov ( X , Y ) = 0

Def. Let X =  X 1 ,..., X N  , ij

 Var[ X 1 ] Cov( X 1 , X 2 ) Cov( X 1 , X N ) 

ˆMAP = arg max P ( | D ) = arg max P ( D |  ) P ( )

• The differential entropy of a continuous random variable is defined analogously:

• Entropy quantifies disorder/”surprise”; the Principle of Insufficient Reasons (PIR)

The Information Inequality states:

*From the information inequality, it follows that:

You might also like