MLSlides1 Selected Shared
MLSlides1 Selected Shared
(BITS F464)
Dr. Paresh Saxena
BITS Pilani Dpt. of Computer Science & Information Systems
Email: [email protected]
Hyderabad Campus
Course Handout Discussion
Lecturer:
Paresh Saxena – [email protected], https://psaxena86.github.io/
Announcements:
via CMS, in-class
Text Book:
T1. Christopher Bishop: Pattern Recognition and Machine Learning, Springer, 1st ed. 2006.
T2. Tom M. Mitchell: Machine Learning, The McGraw-Hill International Edition, 1997.
• Supervised Learning
• Unsupervised Learning
• 20,000 images
• 0 to 116 years
• Labeled (age, gender,
and ethnicity)
Ref: https://susanqq.github.io/UTKFace/
Task: Given a new image, identify the age !! or Given a new image, identify the
gender !!
• Hard to choose traditional rule-based approaches for prediction.
• Supervised Learning can use the dataset and make an accurate predictor
BITS Pilani, Hyderabad Campus
Unsupervised Learning
Link: https://data.sfgov.org/Transportation/Air-Traffic-Passenger-Statistics/rkru-6vcg
Customers
#1 #2 #3 #4 …….. #100000
Coffee 1 0 0 1 …….. 0
Tea 0 1 0 0 …….. 1
Milk 1 1 0 1 …….. 1
Products
Regression vs Classification
Regression vs Classification
4
BITS Pilani, Hyderabad Campus
Linear Models for Regression:
Notations
• N observations {xn}, n=1,2,…,N
• Target values {tn} labels
D = 3 (features/attributes) tn
11 2 3 87
xn 5 4 2 45
10 4 2 23
20 8 4 35
Example: N=5
Predict
Inputs Regressor Real Number
• Given:
• Linear Regression:
bias
parameters Basis functions (non-linear)
Other functions:
• Gaussian basis function
• Sigmoidal basis function, Radial Basis Function (RBF), Wavelets, etc.
• Identity function:
Polynomial Fitting
Error Function:
Inverse is computationally
expensive !!
Design
Matrix
BITS Pilani, Hyderabad Campus
Ridge Regression
• To counter Overfitting user regularization
• Remember from the previous lectures (overfitting):
Drawback:
Training Complexity!!
Take a log both side (handy mathematically) with the aim to maximize it. To
have a corresponding loss function, take negative logarithm (also known as
cross-entropy error function):
)( ( ) )
N 1 2
( å = =s
z1 = zt ln fz) and z2 = (1 - t ) ln[1 - s (wf )]
N
ÑE(w) = - f +-
1 2
where Let (w
å yn =- tå s f 2t ) ln[1 - s (wf )]
N
y t = z1(1
ÑE(w) =ÑE(w) f - f
where z t ln
where (w
z = )t and
ln s z
(w f ) and z2 = (1 - t ) ln[1 - s (wf )]
( )
n N n n 1 2
y t s (wf )[1 z1-=st(w f f) and zds == (1
sfs)](w
å n n 1 n= dz1 t s (wf)[1 - s (wf)]f = s (1- sd)s = sda(1- - ts))ln[1 - s
1
n n nn=1 dz t where
n =1 ÑE(w)
n =1 = n n
y - tdzf t s (w f )[1 - s
1
=
(w f )]f dln 2 s (1-
s)
Error x Feature Vector
dw s=(w
dw
f ) dz1 st s(w (wff) )[1
da - s (wfda )]f d ds = s (1-a s)
Error x Feature Vector
n=1
Error x Feature Vector dw s (w =f ) d Using a (ln ax) =
da
and dw Using s (wf )(ln ax) = d (ln ax) dx = a x
Error x andVector and
Feature Using
dx xdx x d
Contribution to gradient by data dz2 (1 - t )s (wf )[1 - s (wf )](-f )Using (ln ax)
tribution to gradient
Contribution to by data
gradient by data - s f = - s
and f f dx
point nContribution
is error between dz2
targetby (1 t )
tn =datadz2 =dw (w(1 - t )s (wf )[1 - s (wf )](-f )
)[1 (w )](- )
nt n ispoint
errornbetween target tto gradient
t [1--t )ss(w ff9)])[1 - s (w9f )](-9f)
is error
and prediction between n target
σ (w
yerror
n= T
Tφ ) ntimes
dw basis dwφn[1 - s (wf )]dz2 (1
[1=- s (wf )] (w
prediction y = σ
point
and prediction (w Tnφ is
) times basisn
between φ target t
yn=n σ (w φn) timesnbasis φn dz Therefore
n dz - s f 9
n dw = ( s (w f
[1 ) - t f
(w
) )]
and prediction yn= σ (wTTherefore
φn) times = (sφ
basis
Therefore (wndz
f) -=dw
t()sf(wf ) - t )f
dw dz
dw
Therefore = (s (wf ) - t )f
dw
𝒘𝑡+1 = 𝒘𝑡 − 𝜂∇𝐸(𝒘)