机器学习前置数学知识概述
机器学习前置数学知识概述
第 1.5讲:前置数学知识补遗
Information Theory
盛律/软件学院
2024 秋冬学期 Please refer to the appendices of Pattern Recognition and Machine Learning (C. M. Bishop) and
Machine Learning (Zhihua Zhou) for more details.
1 2
a2
Matrix
<latexit sha1_base64="yiXiQRbywkGJ2p95Nyrl8BrZD1Y=">AAAB83icbVDLSsNAFL3xWeur2qWbwSK4kJIURJcBNy4r2Ac0oUymk3boZBJmJkIJ2fgRblwo4tafcSH4B36DKydtF9p6YOBwzr3cMydIOFPatj+tldW19Y3N0lZ5e2d3b79ycNhWcSoJbZGYx7IbYEU5E7Slmea0m0iKo4DTTjC+KvzOHZWKxeJWTxLqR3goWMgI1kbyvAjrURBmOO83+pWaXbenQMvEmZOaW/34vv/Kz5r9yrs3iEkaUaEJx0r1HDvRfoalZoTTvOyliiaYjPGQ9gwVOKLKz6aZc3RilAEKY2me0Giq/t7IcKTUJArMZJFRLXqF+J/XS3V46WdMJKmmgswOhSlHOkZFAWjAJCWaTwzBRDKTFZERlphoU1PZlOAsfnmZtBt157xu3zg114UZSnAEx3AKDlyAC9fQhBYQSOABnuDZSq1H68V6nY2uWPOdKvyB9fYDst+V3w==</latexit>
2 3
a11 a12 ··· a1n
6 a21 a22 ··· a2n 7
6 7
A = [aij ]m⇥n =6 . .. .. .. 7 = [a1 , a2 , · · · , an ]
4 .. . . . 5
Linear Algebra am1 am2
amn
2
···
3
n Diagonal matrix: a11 0 ··· 0
6 0 a22 ··· 0 7
6 7
diag(a11 , a22 , · · · , ann ) = 6 . .. .. .. 7
4 .. . . . 5
n Identity matrix: I = diag(1, 1, · · · , 1) 0 0 ··· ann
Pn
n Trace: tr(A) = j=1 ajj
3 4
Matrix Addition/Subtraction Multiply a Vector by a Matrix
Ax = y
2 32 3 2 3
If C = A ± B, then [cij ] = [aij ] ± [bij ] a11 a12 ··· a1n x1 y1
<latexit sha1_base64="XBpUKcbxynlQKcYiLj01nvaLmi4=">AAACCXicbZBNS8MwGMdTX+d8q3oUJGwInkYriF6EghePE9wLdGWkabrFJU1JUmGUnQQvfgrvXjwo4tVv4M2P4s1s3UE3/xD45f88D8nzD1NGlXacL2thcWl5ZbW0Vl7f2Nzatnd2m0pkEpMGFkzIdogUYTQhDU01I+1UEsRDRlrh4GJcb90SqahIrvUwJQFHvYTGFCNtrK4NfdzN6c0ogOfQR1PspBz6YXHp2lWn5kwE58GdQtWrPN59H0Si3rU/O5HAGSeJxgwp5btOqoMcSU0xI6NyJ1MkRXiAesQ3mCBOVJBPNhnBQ+NEMBbSnETDift7IkdcqSEPTSdHuq9ma2Pzv5qf6fgsyGmSZpokuHgozhjUAo5jgRGVBGs2NICwpOavEPeRRFib8MomBHd25XloHtfck5pz5VY9DxQqgX1QAUfABafAA5egDhoAg3vwBF7Aq/VgPVtv1nvRumBNZ/bAH1kfP0vhnRw=</latexit>
n
6 a21 a22 ··· a2n 7 6 x2 7 6 y2 7
7 6 7 6 n
X
6 7
6 .. .. .. .. 7 6 .. 7 = 6 .. 7 and yi = aij xj
n Commutative: A + B = B + A 4 . . . . 54 . 5 4 . 5 j=1
n Associative: (A + B) + C = A + (B + C) am1 yn am2 ··· amn xn
Xn
write A = [a1 , a2 , . . . , an ] , then y = x j aj
j=1
l y can be written as a weighted sum of A’s column vectors
5 6
n In general, non-commutative: AB 6= BA
n Symmetric matrix: aij = aji or A = A>
n Associative: (AB)C = A(BC)
n Matrix A is orthogonal if A> A is diagonal
n Distributive: (A + B)C = AC + BC
n Matrix A is orthonormal if A> = A 1
7 8
Determinant Inverse
a11 a12 [cof(A)]>
n If A = , then |A| = a11 a22 a21 a12 <latexit sha1_base64="XC0rRSS/t29xdLLgtnaP4z3NSz0=">AAACBnicbVDLSsNAFJ34rPUVHzsRBovgxpIIohuxxYW6q2Af0MYymU7aoZNJmJkINWTlxl9x04Uibl34Be5c+idO2graeuDC4Zx7ufceN2RUKsv6NKamZ2bn5jML2cWl5ZVVc229IoNIYFLGAQtEzUWSMMpJWVHFSC0UBPkuI1W3e5b61VsiJA34teqFxPFRm1OPYqS01DS3Gz5SHdeLizfxvp0UE3gCf6TLpGnmrLw1AJwk9ojkTt/vvs77m3GpaX40WgGOfMIVZkjKum2FyomRUBQzkmQbkSQhwl3UJnVNOfKJdOLBGwnc1UoLeoHQxRUcqL8nYuRL2fNd3ZleKMe9VPzPq0fKO3ZiysNIEY6Hi7yIQRXANBPYooJgxXqaICyovhXiDhIIK51cVodgj788SSoHefswb13ZuUIBDJEBW2AH7AEbHIECuAAlUAYY3INH8ASejQejb7wYr8PWKWM0swH+wHj7BnscnDY=</latexit>
A 1
A=I and A 1
=
a21 a22 |A|
Pn Cofactor of element aij
n In general, |A| = j=1 aij cof(aij ) n (A 1 ) 1 = A
n Properties: 1
n (AB) = B 1A 1
p Determinant is a scalar quantity
p If |A| = 0, then A is singular, otherwise non-singular
n (A> ) 1 = (A 1 )>
p |A> | = |A|
p |AB| = |BA| = |A||B|
9 10
Av = v
(A I)v = 0 Probability
|A I| = 0 (characteristic equation)
15 16
Motivation Axioms for Probabilities
n We have 25 PhD and 15 MPhil students. If a student is n All probabilities are between 0 and 1: 0 P (A) 1
randomly picked from these 2 groups, which group will you
guess (s)he is from? n The certain event has probability 1
n The impossible event has probability 0
2 classes: !1 = PhD , !2 = MPhil A\B
n If A and B are any two events,
n The state of nature is unpredictable
P (A [ B) = P (A) + P (B) P (A \ B) A B
n Use probability!!
17 18
23 24
Discrete Probability Distributions Example: Uniform Distribution
X : discrete random variable n Outcome of throwing a fair die
n Probability function or probability distribution 1
P (X = 1) = P (X = 2) = · · · = P (X = 6) =
6
P (X = x)
25 26
29 30
p Hb (X) = (logb a)Ha (X) Expected length to encode one symbol from X: 2bits
Mutual Information
n I(X; Y ) = H(X) H(X|Y ) = H(Y ) H(Y |X) symmetric
n I(X; Y ) = H(X) + H(Y ) H(X, Y )
p Information that X tells about Y = uncertainty in X +
uncertainty in Y – uncertainty in both X and Y
n I(X; X) = H(X) H(X|X) = H(X) is the entropy itself
n I(X; Y ) 0 with the equality if and only if X and Y are
independent, i.e.,
H(X, Y ) H(X) + H(Y )
51