0% found this document useful (0 votes)
25 views13 pages

机器学习前置数学知识概述

Uploaded by

caiyuzhu.cs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views13 pages

机器学习前置数学知识概述

Uploaded by

caiyuzhu.cs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Outline

机器学习 Linear Algebra

Machine Learning Probability

第 1.5讲:前置数学知识补遗
Information Theory
盛律/软件学院
2024 秋冬学期 Please refer to the appendices of Pattern Recognition and Machine Learning (C. M. Bishop) and
Machine Learning (Zhihua Zhou) for more details.

1 2

a2
Matrix
<latexit sha1_base64="yiXiQRbywkGJ2p95Nyrl8BrZD1Y=">AAAB83icbVDLSsNAFL3xWeur2qWbwSK4kJIURJcBNy4r2Ac0oUymk3boZBJmJkIJ2fgRblwo4tafcSH4B36DKydtF9p6YOBwzr3cMydIOFPatj+tldW19Y3N0lZ5e2d3b79ycNhWcSoJbZGYx7IbYEU5E7Slmea0m0iKo4DTTjC+KvzOHZWKxeJWTxLqR3goWMgI1kbyvAjrURBmOO83+pWaXbenQMvEmZOaW/34vv/Kz5r9yrs3iEkaUaEJx0r1HDvRfoalZoTTvOyliiaYjPGQ9gwVOKLKz6aZc3RilAEKY2me0Giq/t7IcKTUJArMZJFRLXqF+J/XS3V46WdMJKmmgswOhSlHOkZFAWjAJCWaTwzBRDKTFZERlphoU1PZlOAsfnmZtBt157xu3zg114UZSnAEx3AKDlyAC9fQhBYQSOABnuDZSq1H68V6nY2uWPOdKvyB9fYDst+V3w==</latexit>

2 3
a11 a12 ··· a1n
6 a21 a22 ··· a2n 7
6 7
A = [aij ]m⇥n =6 . .. .. .. 7 = [a1 , a2 , · · · , an ]
4 .. . . . 5
Linear Algebra am1 am2
amn
2
···
3
n Diagonal matrix: a11 0 ··· 0
6 0 a22 ··· 0 7
6 7
diag(a11 , a22 , · · · , ann ) = 6 . .. .. .. 7
4 .. . . . 5
n Identity matrix: I = diag(1, 1, · · · , 1) 0 0 ··· ann
Pn
n Trace: tr(A) = j=1 ajj
3 4
Matrix Addition/Subtraction Multiply a Vector by a Matrix
Ax = y
2 32 3 2 3
If C = A ± B, then [cij ] = [aij ] ± [bij ] a11 a12 ··· a1n x1 y1
<latexit sha1_base64="XBpUKcbxynlQKcYiLj01nvaLmi4=">AAACCXicbZBNS8MwGMdTX+d8q3oUJGwInkYriF6EghePE9wLdGWkabrFJU1JUmGUnQQvfgrvXjwo4tVv4M2P4s1s3UE3/xD45f88D8nzD1NGlXacL2thcWl5ZbW0Vl7f2Nzatnd2m0pkEpMGFkzIdogUYTQhDU01I+1UEsRDRlrh4GJcb90SqahIrvUwJQFHvYTGFCNtrK4NfdzN6c0ogOfQR1PspBz6YXHp2lWn5kwE58GdQtWrPN59H0Si3rU/O5HAGSeJxgwp5btOqoMcSU0xI6NyJ1MkRXiAesQ3mCBOVJBPNhnBQ+NEMBbSnETDift7IkdcqSEPTSdHuq9ma2Pzv5qf6fgsyGmSZpokuHgozhjUAo5jgRGVBGs2NICwpOavEPeRRFib8MomBHd25XloHtfck5pz5VY9DxQqgX1QAUfABafAA5egDhoAg3vwBF7Aq/VgPVtv1nvRumBNZ/bAH1kfP0vhnRw=</latexit>

n
6 a21 a22 ··· a2n 7 6 x2 7 6 y2 7
7 6 7 6 n
X
6 7
6 .. .. .. .. 7 6 .. 7 = 6 .. 7 and yi = aij xj
n Commutative: A + B = B + A 4 . . . . 54 . 5 4 . 5 j=1
n Associative: (A + B) + C = A + (B + C) am1 yn am2 ··· amn xn
Xn
write A = [a1 , a2 , . . . , an ] , then y = x j aj
j=1
l y can be written as a weighted sum of A’s column vectors
5 6

Matrix Multiplication Transpose


Pp
If Cm⇥n = Am⇥p Bp⇥n , then [cij ] = k=1 aik bkj
n If A> = B, then bij = aji
n (A> )> = A, (AB)> = B> A> , (A + B)> = A> + B>
<latexit sha1_base64="TzTuyz4RyLrUQng5h+aXhLWaYpI=">AAACV3icbVHdTsIwGO0GAuIf6KU3jcQEoyGbCdEbE8AbLyGRnwSQdKXThq5b2s6ELLyk8YYXMD6DN1o2AoJ+SduTc86Xfj11Akalsqy5YabSO5lsbje/t39weFQoHnekHwpM2thnvug5SBJGOWkrqhjpBYIgz2Gk60zuF3r3lQhJff6opgEZeuiZU5dipDQ1KvDywEPqxXGj+uxpoPzgIt7hHVzzVytcrjdWeiMB9fjY8FyuTYkKL5fu2ahQsipWXPAvsJegVIOtz49cttocFd4GYx+HHuEKMyRl37YCNYyQUBQzMssPQkkChCfomfQ15MgjchjFuczguWbG0PWFXlzBmP3dESFPyqnnaOdieLmtLcj/tH6o3NthRHkQKsJxcpEbMqh8uAgZjqkgWLGpBggLqmeF+AUJhJX+irwOwd5+8l/Qua7Y1YrV0mnUQFI5cArOQBnY4AbUwANogjbA4B18GSkjbcyNbzNj5hKraSx7TsBGmcUfM4q0gw==</latexit>

n In general, non-commutative: AB 6= BA
n Symmetric matrix: aij = aji or A = A>
n Associative: (AB)C = A(BC)
n Matrix A is orthogonal if A> A is diagonal
n Distributive: (A + B)C = AC + BC
n Matrix A is orthonormal if A> = A 1

7 8
Determinant Inverse

a11 a12 [cof(A)]>
n If A = , then |A| = a11 a22 a21 a12 <latexit sha1_base64="XC0rRSS/t29xdLLgtnaP4z3NSz0=">AAACBnicbVDLSsNAFJ34rPUVHzsRBovgxpIIohuxxYW6q2Af0MYymU7aoZNJmJkINWTlxl9x04Uibl34Be5c+idO2graeuDC4Zx7ufceN2RUKsv6NKamZ2bn5jML2cWl5ZVVc229IoNIYFLGAQtEzUWSMMpJWVHFSC0UBPkuI1W3e5b61VsiJA34teqFxPFRm1OPYqS01DS3Gz5SHdeLizfxvp0UE3gCf6TLpGnmrLw1AJwk9ojkTt/vvs77m3GpaX40WgGOfMIVZkjKum2FyomRUBQzkmQbkSQhwl3UJnVNOfKJdOLBGwnc1UoLeoHQxRUcqL8nYuRL2fNd3ZleKMe9VPzPq0fKO3ZiysNIEY6Hi7yIQRXANBPYooJgxXqaICyovhXiDhIIK51cVodgj788SSoHefswb13ZuUIBDJEBW2AH7AEbHIECuAAlUAYY3INH8ASejQejb7wYr8PWKWM0swH+wHj7BnscnDY=</latexit>

A 1
A=I and A 1
=
a21 a22 |A|
Pn Cofactor of element aij
n In general, |A| = j=1 aij cof(aij ) n (A 1 ) 1 = A
n Properties: 1
n (AB) = B 1A 1
p Determinant is a scalar quantity
p If |A| = 0, then A is singular, otherwise non-singular
n (A> ) 1 = (A 1 )>
p |A> | = |A|
p |AB| = |BA| = |A||B|

9 10

Inner Product, Outer Product Gradient Vector(写法不怎么严格)


n The inner product of two vectors x, y 2 Rn is a scalar
X n n Given: f (x) is a real valued function
> >  >
hx, yi = x y = y x = x i yi @f @f @f @f
i=1
rx f (x) = = , ,...,
p If hx, yi = 0, then x and y are orthogonal
@x @x1 @x2 @xn
p First order derivatives
n The outer product of two vectors x 2 Rm and y 2 Rn is
a matrix 2 3
x 1 y1 x 1 y 2 · · · x 1 y n n Extension: how about f (x) is a vector?
6 x 2 y1 x 2 y 2 · · · x 2 y n 7
> 6 7
x ⌦ y = xy = 6 . .. . .. 7
4 . . . . . . 5
x m y1 x m y2 ··· x m yn 11 12
Gradient Vector: Properties Hessian Matrix
n Second order derivatives
rx (x> y) = rx (y> x) = y  2
@ 2 f (x) @ f (x)
H(x) = =
rx (x> x) = 2x @x@x> @xi @xj
2 @ 2 f (x) 3
rx (x> Ay) = Ay 2
@ 2 f (x)
··· @ 2 f (x)
@x1 @x2 @x1 @xn
6 @ 2@x 1
7
rx (y> Ax) = A> y 6 f (x) @ 2 f (x)
··· @ 2 f (x) 7 Always
6 @x @x @x22 @x2 @xn 7
> >
rx (x Ax) = Ax + A x, if A is symmetric := 2Ax = 6 2. 1 .. .. .. 7 symmetric!!
6 . . 7
4 . . . 5
@ 2 f (x) @ 2 f (x) @ 2 f (x)
@xn @x1 @xn @x2 ··· @x2n
13 14

Eigenvalue and Eigenvector

Av = v
(A I)v = 0 Probability
|A I| = 0 (characteristic equation)

n Solutions to the characteristic equation are called


eigenvalues and their corresponding v eigenvectors

15 16
Motivation Axioms for Probabilities
n We have 25 PhD and 15 MPhil students. If a student is n All probabilities are between 0 and 1: 0  P (A)  1
randomly picked from these 2 groups, which group will you
guess (s)he is from? n The certain event has probability 1
n The impossible event has probability 0
2 classes: !1 = PhD , !2 = MPhil A\B
n If A and B are any two events,
n The state of nature is unpredictable
P (A [ B) = P (A) + P (B) P (A \ B) A B
n Use probability!!

17 18

Mutually Exclusive Events Conditional Probability


n Two events are mutually exclusive if they cannot n Let A and B be two events such that P (A) > 0, P (B) > 0
occur at the same time n P (B|A) : probability of B given that A has occurred

A single card is chosen at random from


a standard deck of 52 playing cards P (A \ B)
P (B|A) =
P (A)
p E1: the card chosen is a five, E2: the card A B
chosen is a king
p Mutually exclusive? P (A \ B) = P (A)P (B|A)
A\B
19 20
Conditional Probability Independence
n For any three events A1 , A2 , A3 : n Two random variables A and B are independent if
P (A1 \ A2 \ A3 ) = P (A1 )P (A2 |A1 )P (A3 |A1 \ A2 ) P (B|A) = P (B) or P (A|B) = P (A)
Pn p The probability of B occurring is not affected by the
n If A1 , . . . , An are mutually exclusive with i=1 P (Ai ) = 1 occurrence or non-occurrence of A
p Knowledge about A contains no information about B
P (B) = P (A1 )P (B|A1 ) + · · · + P (An )P (B|An ) p This is also equivalent to P (A \ B) = P (A)P (B)
Xn
n If n Boolean variables (A1 , A2 , . . . , An ) are independent
= P (B|Ai )P (Ai ) Yn
i=1 P (A1 \ A2 \ · · · \ An ) = P (Ai )
21 22
i=1

Bayes Theorem or Rule Bayes Theorem or Rule


P (x|!i )P (!i )
P (A \ B) P (B|A)P (A) P (!i |x) =
P (A|B) = = P (x)
P (B) P (B)
n P (!i ) : prior probability of !i
p Initial probability for !i , before observing the training data
A\B n P (!i |x) : posterior probability for !i after observing x
A B n P (x|!i ) : likelihood of observing x given class !i
n P (x) : probability that training data x will be observed

23 24
Discrete Probability Distributions Example: Uniform Distribution
X : discrete random variable n Outcome of throwing a fair die
n Probability function or probability distribution 1
P (X = 1) = P (X = 2) = · · · = P (X = 6) =
6
P (X = x)

n Cumulative distribution function (or distribution function)


F (x) = P (X  x)

25 26

Example: Binomial Distribution Continuous Probability Distributions


n Given: probability of getting a head is p, #heads when X : discrete random variable
the biased coin is tossed n times n The probability that X takes on any one value is general zero
⇣ ⌘
n x n The probability that X lies between two different values is
P (X = x) = Bi(x; n, p) = p (1 p)n x
x more meaningful
Z b
P (a < X < b) = p(x)dx Probability density function (PDF)
a
n Cumulative distribution function (or distribution function)
Z x
dF (x)
F (x) = P (X  x) = p(x)dx = p(x)
27
1 dx 28
Example:Normal (Gaussian)
Example: Uniform Distribution
Distribution
8
( > x
if a  x  b " ✓ ◆2 #
1
if a  x  b <b a 1 1 x µ
p(x) = b a F (x) = 0 if x < a p(x) = p exp
0 otherwise >
: 2⇡ 2
1 if x > b

29 30

Joint Distribution: Discrete Joint Distribution: Continuous


n If X and Y are two discrete random variables, we define
the joint probability function of X and Y by n If X and Y are continuous random variables
Z b Z d
P (X = x, Y = y) = p(x, y) P (a < X < b, c < Y < d) = p(x, y)dxdy
x=a y=c
XX Z 1 Z 1
where p(x, y) 0 and p(x, y) = 1
p(x, y) 0 and p(x, y)dxdy = 1
x y X 1 1
n Marginal probability function P (X = x) = p(x, yj ) Joint density function
Z 1
n Joint distribution function j
XX n Marginal density function p(x) = p(x, v)dv
F (x, y) = P (X  x, Y  y) = p(u, v) 1
ux vy 31 32
Joint Distribution: Continuous Example
n Joint distribution function
Z x Z y
F (x, y) = F (X  x, Y  y) = p(u, v)dudv n Random vector: X = [X1 , X2 , . . . , Xn ]>
u= 1 v= 1 n Multivariate Gaussian: X ⇠ N (µ, ⌃)
@ 2 F (x, y)
= p(x, y)
@x@y

nMarginal distribution function 1 1 1
Z x Z 1 p(X) = exp (X µ)> ⌃ (X µ)
(2⇡)n/2 |⌃|1/2 2
Distribution function
P (X  x) = p(u, v)dudv of X
u= 1 v= 1
33 34

Mathematical Expectation Moments


n Expected value, expectation, or mean of a random n rth moment: E(X r )
variable X Mean µ = E(X) is the 1st moment
n Discrete n rth central moment: µr = E[(X µ)r ]
Xn
E(X) = xj P (X = xj ) µ0 = 1, µ1 = 0, µ2 = variance
j=1 n For multivariate random vector X:
n Continuous p 2nd central moment: covariance matrix
Z 1
E(X) = xp(x)dx ⌃ = cov(X) = E[(X µ)(X µ)> ]
1
35 36
Covariance Matrix
n For a 2-D vector X = [X1 , X2 ]>
  !
>
⌃=E
X1 µ 1 X1 µ 1
X2 µ 2 X2 µ 2 Information Theory
✓ ◆
(X1 µ1 )2 (X1 µ1 )(X2 µ2 )
=E
(X2 µ2 )(X1 µ1 ) (X2 µ2 )2
  2
11 12 1 12
= = 2
21 22 12 2
37 38

Entropy: Intuitive Notion Entropy: Formal Definition


Measures the impurity, uncertainty, irregularity, surprise n X: discrete random variable with alphabet X = {x1 , x2 , . . . , xn }
and probability mass function p(x) = P (X = x), x 2 X
n Suppose we have two discrete classes X
p S: a set of training examples Entropy:
H(X) = p(x) logb p(x)
p p+: proportion of positive examples in S x2X

p p-: proportion of negative examples in S


n X: continuous random variable, the entropy is differentiable
n Optimal purity (impurity/uncertainty = 0) Z
p p+ = 1, p- = 0 or p+ = 0, p- = 1
Entropy: h(X) = p(x) logb p(x)dx
n Least pure (maximum impurity/uncertainty) x2X
p p+ = 0.5, p- = 0.5
39 40
Entropy: Formal Definition Example: Coding
a b c d
The unit of entropy
P(X) 0.5 0.25 0.125 0.125
n Depends on the base b of the log operation
p b = e: nats, b = 2: bits (adopted usually) a b c d
n Code 1:
n Entropy can be changed from one base to another Code 00 01 10 11

p Hb (X) = (logb a)Ha (X) Expected length to encode one symbol from X: 2bits

In general, when X can take n values a b c d


n Code 2: Code 0 10 110 111
n H(X) 0, with H(X) = 0 if there is a xk with p(xk ) = 1
n H(X)  log n, with H(X) = log n if p(x) = 1/n Expected length: 0.5x1 + 0.25x2 + 0.125x3 + 0.125x3 = 1.75bits
41 42

Relationship with Coding Joint Entropy


n Optimal length code assigns log2 p bits to message n Joint entropy of a pair of discrete random variables X
having probability p and Y with a joint distribution p(x, y):
XX
p Expected number of bits to encode + or - of a random H(X, Y ) = p(x, y) log p(x, y)
member of S x2X y2Y
p+ ( log2 p+ ) + p ( log2 p ) = H(S)
n If X and Y are two independent sample spaces, then
p Entropy = expected number of bits needed to encode class
of a randomly drawn member of S under the optimal, H(X, Y ) = H(X) + H(Y )
shortest-length code
43 44
Conditional Entropy Conditional Entropy
X
H(Y |X) = p(x)H(Y |X = x) n In general, H(Y |X) 6= H(X|Y )
x2X n Chain rule:
X X
= p(x) p(y|x) log p(y|x) H(X, Y ) = H(X) + H(Y |X) = H(Y ) + H(X|Y )
x2X y2Y n When X and Y are independent H(Y |X) = H(Y )
XX
= p(x, y) log(y|x) n For multiple variables
x2X y2Y H(X1 , X2 , . . . , Xn ) =
H(X1 ) + H(X2 |X1 ) + · · · + H(Xn |X1 , . . . , Xn 1 )
n Uncertainty about Y, given that we know X
p When X1, ..., Xn are i.i.d., H(X1 , X2 , . . . , Xn ) = nH(X1 )
45 46

Kullback-Leibler Divergence Kullback-Leibler Divergence


n Motivation n Information inequality
p Suppose there is a r.v. with true distribution p p KL(p||q) 0 , with equality if and only if p(x) = q(x), 8x
p However, we do not know p; instead we assume that the
distribution of the r.v. is q KL-divergence often used as a ``distance’’ measure
p The code would need more bits to represent the r.v., and the between distributions, but
difference in the number of bits is denoted as KL(p||q)
p Not symmetric: KL(p||q) 6= KL(q||p)
n KL-divergence from p(x) to q(x): p Does not satisfy the triangle inequality
X X p(x) X X
KL(p||q)  KL(p||r) + KL(r||q) does not hold in general
p(x)
KL(p||q)KL(p||q)
= =p(x) p(x)
log log = p(x) log q(x) + p(x) log p(x) p Not a distance between distributions
x2X q(x)
q(x)
x2X x2X
x2X
as relative entropy 47 48
Mutual Information Mutual Information
X ✓ ◆
How much information does one random variable X tell about 1 p(x, y)
another one Y? I(X; Y ) = p(x, y) log ·
x,y
p(x) p(y)
n Given: two random variables X and Y with a joint probability X p(x|y)
mass function p(x, y) and marginal probability mass functions = p(x, y) log
p(x) and p(y) x,y
p(x)
n Mutual information I(X; Y) X X
= p(x, y) log p(x) + p(x, y) log p(x|y)
I(X; Y ) = KL(p(x, y)||p(x)p(y)) x,y x,y
KL-divergence between
XX p(x, y) the join distribution and
= p(x, y) log = H(X) H(X|Y ) MI is the reduction in the uncertainty
p(x)p(y) the product distribution of X due to the knowledge of Y
x2X y2Y
49 50

Mutual Information
n I(X; Y ) = H(X) H(X|Y ) = H(Y ) H(Y |X) symmetric
n I(X; Y ) = H(X) + H(Y ) H(X, Y )
p Information that X tells about Y = uncertainty in X +
uncertainty in Y – uncertainty in both X and Y
n I(X; X) = H(X) H(X|X) = H(X) is the entropy itself
n I(X; Y ) 0 with the equality if and only if X and Y are
independent, i.e.,
H(X, Y )  H(X) + H(Y )

51

You might also like