0% found this document useful (0 votes)
6 views

MLSlides1 Selected Shared

The document outlines the course structure for Machine Learning (BITS F464) taught by Dr. Paresh Saxena at BITS Pilani, including evaluation components, textbook references, and a detailed course plan covering various machine learning topics. It highlights key learning objectives such as supervised and unsupervised learning, linear regression, and model evaluation techniques. Additional resources and methods for addressing overfitting and validation in machine learning are also discussed.

Uploaded by

f20221227
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

MLSlides1 Selected Shared

The document outlines the course structure for Machine Learning (BITS F464) taught by Dr. Paresh Saxena at BITS Pilani, including evaluation components, textbook references, and a detailed course plan covering various machine learning topics. It highlights key learning objectives such as supervised and unsupervised learning, linear regression, and model evaluation techniques. Additional resources and methods for addressing overfitting and validation in machine learning are also discussed.

Uploaded by

f20221227
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Machine Learning

(BITS F464)
Dr. Paresh Saxena
BITS Pilani Dpt. of Computer Science & Information Systems
Email: [email protected]
Hyderabad Campus
Course Handout Discussion

Lecturer:
Paresh Saxena – [email protected], https://psaxena86.github.io/

Announcements:
via CMS, in-class

Text Book:

T1. Christopher Bishop: Pattern Recognition and Machine Learning, Springer, 1st ed. 2006.
T2. Tom M. Mitchell: Machine Learning, The McGraw-Hill International Edition, 1997.

BITS Pilani, Hyderabad Campus


Course Evaluation

Component Weightage Date&Time Mode


Mid-Term exam 35% As announced in the Closed Book
Timetable
Course Project (with 25% Details will be announced in Open Book
final presentation/viva) (5% will be September
evaluated before
the mid-sem)

Comprehensive 40% As announced in the Closed Book


Timetable

* Slides will not be shared.

BITS Pilani, Hyderabad Campus


Course Plan

Lecture No. (Each Chapter in the Text


Learning objectives Topics to be covered
lecture 1 hour) Book
To introduce several ML overview, Python and ML Lecture Notes
1–3 relevant materials to frameworks.
understand ML algorithms
Gradient Descent, Bias-Variance, T1 – Ch. 3
To understand linear
3–8 Bayesian Regression, Bayesian
models for regression Model Comparison.
Discriminant Functions, T1 – Ch. 4
To understand linear Probabilistic Generative and
9–14
models for classification Discriminative Models, Bayesian
Logistic Regression
Feed-forward Network Functions, T1 – Ch.5
To understand Neural
15-22 Network Training,
networks Backpropagation, Regularization.
23-32 To understand Kernel Radial basis function networks, T1 – Ch. 6 and 7
methods and Sparse Kernel Gaussian processes, SVMs,
Multiclass SVMs
Machines
To develop the K-means Clustering, Mixture T2 – Ch. 9 and Ch. 14
understanding of Mixture Models, EM, Bagging, Boosting,
32-40
models and combining Decision Trees.
models

BITS Pilani, Hyderabad Campus


Subfields of ML

• Supervised Learning

• Unsupervised Learning

• Reinforcement Learning (Not covered in this course, join


CS F317 in odd semseters)

• Other familiar terms: Online learning, Query Learning,


Semi-Supervised Learning, Anomaly Detection.

BITS Pilani, Hyderabad Campus


Supervised Learning

• 20,000 images
• 0 to 116 years
• Labeled (age, gender,
and ethnicity)

Ref: https://susanqq.github.io/UTKFace/

Task: Given a new image, identify the age !! or Given a new image, identify the
gender !!
• Hard to choose traditional rule-based approaches for prediction.
• Supervised Learning can use the dataset and make an accurate predictor
BITS Pilani, Hyderabad Campus
Unsupervised Learning

Link: https://data.sfgov.org/Transportation/Air-Traffic-Passenger-Statistics/rkru-6vcg

Customers
#1 #2 #3 #4 …….. #100000
Coffee 1 0 0 1 …….. 0
Tea 0 1 0 0 …….. 1
Milk 1 1 0 1 …….. 1
Products

Objective: to find some common


Soap 0 1 1 1 …….. 0 patterns in data. Example – If someone
Aspirin 1 0 1 0 …….. 0 buys milk, they are also likely to buy
. . . . . …….. .
Coffee.
. . . . . .
. . . . . .
Perfume 0 1 1 0 …….. 1
Supermarket Data
BITS Pilani, Hyderabad Campus
Machine Learning Srihari

Regression vs Classification
Regression vs Classification

4
BITS Pilani, Hyderabad Campus
Linear Models for Regression:
Notations
• N observations {xn}, n=1,2,…,N
• Target values {tn} labels
D = 3 (features/attributes) tn

House Age Distance from Number of House Price (in


(years) the Center (Km) Rooms lakhs)
4 2 3 68

11 2 3 87
xn 5 4 2 45

10 4 2 23

20 8 4 35

Example: N=5
Predict
Inputs Regressor Real Number

• Goal: Predict t for a new value of x !!


• Solution (Linear models): Find a function y(x) that will give a value of t for a new value of x.
BITS Pilani, Hyderabad Campus
Linear Regression: History

BITS Pilani, Hyderabad Campus


Linear Models for Regression

• Given:

• Linear Regression:

• Extension with basis functions:

bias
parameters Basis functions (non-linear)

BITS Pilani, Hyderabad Campus


Linear Models for Regression
(Polynomial Basis Functions)

For polynomial regression:


• Single input variable 𝑥,
• 𝜙𝑗 𝑥 = 𝑥 𝑗 and so,
• 𝑦 = 𝑤0 𝑥 0 + 𝑤1 𝑥 1 + 𝑤2 𝑥 2 +…… +𝑤𝑀−1 𝑥 𝑀−1

Other functions:
• Gaussian basis function
• Sigmoidal basis function, Radial Basis Function (RBF), Wavelets, etc.
• Identity function:

BITS Pilani, Hyderabad Campus


Polynomial Curve Fitting

Polynomial Fitting

Determine the value of coefficients with training data

Error Function:

Minimize the error and find w !!


BITS Pilani, Hyderabad Campus
Model Selection (order of M)

Comparing errors in training


and testing data set.
Best Fit Overfitting High values of coefficients
as M increases

Resolving Overfitting with more data (M=9)

BITS Pilani, Hyderabad Campus


Minimizing the Squared Error
(Maximum Likelihood)

• Let us assume we have N observations


• 𝑡 𝑛 is a single output for nth observation.
• Error is given by sum of squared difference between the sum of observed outputs and predictions:
𝑁

𝐸 𝒘 = ෍ (𝑡𝑛 − 𝑦(𝒙𝑛 , 𝒘))2


𝑛=1
𝑁

= ෍ (𝑡𝑛 − 𝒘𝑇 𝜙(𝒙𝑛 ))𝑻 (𝑡𝑛 − 𝒘𝑇 𝜙(𝒙𝑛 ))


𝑛=1

Derive with respect to w and equate it to 0.


Moore-Penrose
Pseudo Inverse

Inverse is computationally
expensive !!

Design
Matrix
BITS Pilani, Hyderabad Campus
Ridge Regression
• To counter Overfitting user regularization
• Remember from the previous lectures (overfitting):

Comparing errors in training and testing data set.


High values of coefficients
as M increases
Best Fit Overfitting

Resolving Overfitting with more data (M=9)


BITS Pilani, Hyderabad Campus
Training, Testing, Validation
and Cross-Validation

• Overfitting problems mainly result in using validation set in


addition to training set (compare models on validation set)
• Multiple iterations with limited data size may also result in
over-fitting with validation set, and hence test set is also
required.
• With limited data and small validation set, cross-validation is
also one of the solutions.

Drawback:
Training Complexity!!

BITS Pilani, Hyderabad Campus


Loss Function: Likelihood

• Data set: {𝜙(𝑥𝑛 ), 𝑡𝑛 }, where 𝑡𝑛 ∈ {0,1}


• Total number: n = 1,2,…,N
• The likelihood function can be written as,

Likelihood function is considered as a loss function that


prefers the correct class labels of the training examples to
be more likely.

BITS Pilani, Hyderabad Campus


Continue - 1
Likelihood function is:

Take a log both side (handy mathematically) with the aim to maximize it. To
have a corresponding loss function, take negative logarithm (also known as
cross-entropy error function):

where 𝑦𝑛 = 𝜎(𝑎𝑛 ) and 𝑎𝑛 = 𝑤 𝑇 𝜙𝑛 . Here, 𝜎 is a


sigmoid function and it is equalt to 1/(1+e^{-x})

Minimize Error E(w), and take derivative E(w) is zero:

BITS Pilani, Hyderabad Campus


Gradient
Gradient of
Gradient of
Error Error Function
Function
of Error Function
Gradient of Error Function
ror Continue
Error function
function
Error function
– 2
Error function
- {t t ) ln(1
lny -+y(1)- t ) ln(1 - y )}
N

E(w) = - ln p(t | w) = - å {p(t | w) =+ -(1å


} - y )}
N
E(w) = - ln
E(w) = - ln p(t | w) = - å {t ln y + (1 - t ) ln(1
N
t lny n n n n

Error function: E(w) = - ln p(t | w) = - å {t lny + (1 - t ) ln(1 - y )}


n n N
n =1 n n
n =1 n n n n
n =1 n n n n
Tϕ ) yn= σ(w ϕn)
where T n =1
where y =
where: where
n σ(w yn= σ(w ϕn)
n T
where yn= σ(wTϕn)
Using Derivative
singUsing
Derivative of logistic of sigmoid
logistic sigmoid
dσ/da=σ(1-σ)dσ/da=σ(1-σ)
Derivative
Using of
Derivative logistic sigmoid dσ/da=σ(1-σ)
is given as:of logistic sigmoid dσ/da=σ(1-σ)
adient Gradient
Gradient of
of the of function
the error function Proof of gradient expression
Error function
error
Gradient of the error function
Proof ofProof
gradient expression
z =gradient
Let of z +z expression
Gradient of the error
Let z function
= z + z
Let z = z + zProof of gradient expression
1 2

)( ( ) )
N 1 2

( å = =s
z1 = zt ln fz) and z2 = (1 - t ) ln[1 - s (wf )]
N
ÑE(w) = - f +-
1 2
where Let (w
å yn =- tå s f 2t ) ln[1 - s (wf )]
N
y t = z1(1
ÑE(w) =ÑE(w) f - f
where z t ln
where (w
z = )t and
ln s z
(w f ) and z2 = (1 - t ) ln[1 - s (wf )]
( )
n N n n 1 2
y t s (wf )[1 z1-=st(w f f) and zds == (1
sfs)](w
å n n 1 n= dz1 t s (wf)[1 - s (wf)]f = s (1- sd)s = sda(1- - ts))ln[1 - s
1
n n nn=1 dz t where
n =1 ÑE(w)
n =1 = n n
y - tdzf t s (w f )[1 - s
1
=
(w f )]f dln 2 s (1-
s)
Error x Feature Vector
dw s=(w
dw
f ) dz1 st s(w (wff) )[1
da - s (wfda )]f d ds = s (1-a s)
Error x Feature Vector
n=1
Error x Feature Vector dw s (w =f ) d Using a (ln ax) =
da
and dw Using s (wf )(ln ax) = d (ln ax) dx = a x
Error x andVector and
Feature Using
dx xdx x d
Contribution to gradient by data dz2 (1 - t )s (wf )[1 - s (wf )](-f )Using (ln ax)
tribution to gradient
Contribution to by data
gradient by data - s f = - s
and f f dx
point nContribution
is error between dz2
targetby (1 t )
tn =datadz2 =dw (w(1 - t )s (wf )[1 - s (wf )](-f )
)[1 (w )](- )
nt n ispoint
errornbetween target tto gradient
t [1--t )ss(w ff9)])[1 - s (w9f )](-9f)
is error
and prediction between n target
σ (w
yerror
n= T
Tφ ) ntimes
dw basis dwφn[1 - s (wf )]dz2 (1
[1=- s (wf )] (w
prediction y = σ
point
and prediction (w Tnφ is
) times basisn
between φ target t
yn=n σ (w φn) timesnbasis φn dz Therefore
n dz - s f 9
n dw = ( s (w f
[1 ) - t f
(w
) )]
and prediction yn= σ (wTTherefore
φn) times = (sφ
basis
Therefore (wndz
f) -=dw
t()sf(wf ) - t )f
dw dz
dw
Therefore = (s (wf ) - t )f
dw

BITS Pilani, Hyderabad Campus


Gradient Descent

• Find the optimal weights that minimizes the error


function
• For Logistic regression, the loss function is convex,
hence has just one minimum

𝒘𝑡+1 = 𝒘𝑡 − 𝜂∇𝐸(𝒘)

BITS Pilani, Hyderabad Campus

You might also like