0% found this document useful (0 votes)
3 views

5.1Loss Function, Optimization,Gd

The document outlines the syllabus for a course on Fundamentals of Machine Learning, focusing on loss functions, optimization, and gradient descent techniques. It explains the importance of loss functions in measuring model performance and the role of optimization algorithms like Gradient Descent in minimizing these loss functions. Additionally, it discusses various types of gradient descent, including batch, stochastic, and mini-batch, emphasizing the significance of hyperparameter tuning for model accuracy and efficiency.

Uploaded by

q7ak26tja0
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

5.1Loss Function, Optimization,Gd

The document outlines the syllabus for a course on Fundamentals of Machine Learning, focusing on loss functions, optimization, and gradient descent techniques. It explains the importance of loss functions in measuring model performance and the role of optimization algorithms like Gradient Descent in minimizing these loss functions. Additionally, it discusses various types of gradient descent, including batch, stochastic, and mini-batch, emphasizing the significance of hyperparameter tuning for model accuracy and efficiency.

Uploaded by

q7ak26tja0
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 39

MIT School of Computing

Department of Information Technology

Third Year Engineering

21BTIT504- Fundamentals of Machine Learning


Class - T.Y. (SEM-V)
PL
UnitD - V

AY 2024-2025 SEM-V

1
MIT School of Computing
Department of Information Technology

Unit-V Syllabus

Role of Loss Functions and Optimization, Gradient Descent


and Perceptron/Delta Learning, Regularization, Early
Stopping. PL
D

2
Role of Loss Function and Optimization
• In machine learning, loss functions and
optimization work together to improve a
model's performance by finding the best
parameters for a given data set
Loss function
• Loss function is a measure of the distance
between the model prediction and the correct
answer (the label), loss function answer the
question of how the model is doing?
• Loss function is what we need to minimize in
order to get the best model parameters (It is an
optimization problem).
• Loss function's output is higher when
predictions are off and lower when they're good.
• The choice of the loss function is an important
factor
• There are a lot of loss functions and which one
to use is depends on the problem to solve.
Loss function
Regression :
• MSE
• MAE
• RMSE
Classification
• Hunge loss
• Log likelihood
Linear Regression
• A simple regression model of life satisfaction:
• life_satisfac tion = θ0 + θ1× GDP_per_capita.
• a linear model makes a prediction by simply
computing a weighted sum of the input
features, plus a constant called the bias term
(also called the intercept term)
• Linear Regression model prediction
vectorized form

MSE cost function for a Linear Regression


model m
• Optimization
Uses an algorithm to minimize the loss
function and updates the model’s parameters.
Optimization algorithms like Gradient Descent
typically use the gradient of the loss function.
• Gradient Descent is a very generic
optimization algorithm capable of finding
optimal solutions to a wide range of problems.
• The general idea of Gradient Descent is to
tweak parameters iteratively in order to
minimize a cost function.
a)start by filling θ with
random values (this is
called random initializa
tion),
b) improve it gradually,
taking one baby step at a
time, each step
attempting to decrease
the cost function (e.g.,
the MSE), until the
algorithm converges to a
minimum
An important parameter
in Gradient Descent is
the size of the steps,
determined by the
learning rate
hyperparameter. If the
learning rate is too
small, then the
algorithm will have to
go through many
iterations to converge,
which will take a long
time
if the learning rate is too
high, you might jump
across the valley and end
up on the other side,
possibly even higher up
than you were before. This
might make the algorithm
diverge, with larger and
larger values, failing to find
a good solution
Not all cost functions look like nice regular bowls. There may be holes, ridges,
plateaus, and all sorts of irregular terrains, making convergence to the minimum
very difficult.
Some challenges with Gradient Descent: if the random initialization starts the
algorithm on the left, then it will converge to a local minimum, which is not as good
as the global minimum. If it starts on the right, then it will take a very long time to
cross the plateau, and if you stop too early you will never reach the global minimum.
• Fortunately, the MSE cost function for a Linear Regression
model happens to be a convex function, which means that if
you pick any two points on the curve, the line segment
joining them never crosses the curve.
• This implies that there are no local minima, just one global
minimum.
• It is also a continuous function with a slope that never
changes abruptly.
• These two facts have a great consequence: Gradient Descent
is guaranteed to approach arbitrarily close the global
minimum (if you wait long enough and if the learning rate is
not too high)
The cost function has the shape of a bowl, but it can be an elongated bowl if the
features have very different scales.
Figure shows Gradient Descent on a training set where features 1 and 2 have the
same scale (on the left), and on a training set where feature 1 has much smaller
values than feature 2 (on the right).

Note : When using Gradient Descent, you should ensure that all features have a
similar scale (e.g., using Scikit-Learn’s StandardScaler class), or else it will take much
longer to converge
Optimization Algorthms
• Gradient Descent, gradient is the derivative, we calculate the derivative of the loss
function with respect to the weights dl/dw and from that we get the direction to the
minimum, we need another thing beside the direction, we need the step size .

• The step size which is determined by the learning rate and it's one of the most
important hyperparameters that we tune during the training process

• If the learning rate was too large there will be divergence and we will not reach the
minimum and if we choose the learning rate to be very small the learning process will
be too slow, so the learning rate is a hyperparameter that we need to tune.

• And to find the new weight value:


updated_w = Current_w - learning_rate * dl/dw for each parameter.

• We put a minus sign because the derivative give us the upward direction and we need
the descent direction.
Variants of Gradient Descent

• Batch Gradient Descent where the entire training data used to compute
the gradients at every step so it's computationally expensive .

• when using the Batch Gradient Descent there is probability to stuck in


local minimum rather than reach the global minimum, so other variation of
Gradient Descent are used such as Stochastic Gradient Descent where
one randomly selected data point is used each step when we calculate the
gradients.

• Mini-Batch Gradient Descent where the training data is divided into n


mini-batches and we take a mini-batch from it a time, calculate the
derivatives, update the weights and the another mini-batch is selected and
so on.
Gradient Decent
• Gradient Descent is the workhorse behind most
of Machine Learning.

• When you fit a machine learning method to a


training dataset, you're probably using Gradient
Descent.

• It can optimize parameters in a wide variety of


settings
Gradient Decent
Step 1: Take the derivate of the Loss function for each parameter in
it. [take the gradient of the loss function]
Step 2: Pick random values for the parameters.
Step 3: Plug the parameter values into the derivatives [the
gradient]
Step 4: Calculate the step Sizes : Step Size = Slope * learning rate.
Step 5: Calculate the new parameters:
New Parameter = Old Parameter – Step Size
Step 6: Repeat from step 3 until step size is very small or you reach
the maximum number of steps.
Gradient Decent
Let's consider we have height and weight data. If we fit a line to
the data, we can identify information such as what is the
corresponding Height of a person given that the person’s
weight is x.
Gradient Decent

Gradient Decent

Gradient Decent
Gradient Decent
Gradient Decent
Gradient Decent
Gradient Decent

Now, lets plot this SSE on the graph, where SSE is on


the y axis and intercept is on the x-axis.

Note, the plotted point on the graph is when the


intercept is 0.

What is if change the values of intercept and


calculate the SSE for all the datapoints and plot
them again.
Gradient Decent
What is if change the values of intercept and
calculate the SSE for all the datapoints and plot
them again?
Gradient Decent
Now we can take the derivative of this function and
determine the slope at any value for the intercept.
1. Using derivative find slope
2. Using slope find new new stepsize
3. Using Step size, find new intercept

Iterate through this Unitil optimal solution is


found.
Gradient Descent
Gradient Descent
Gradient Descent

You can see the difference in shift in line equation


when the intercept is 0.57 and when intercept is 0.
Gradient Descent
Gradient Descent
Gradient Descent

After 6 steps, the Gradient descent estimate for the


intercept is 0.95.

Gradient descent stops when the step size is very


close to 0.
Optimization
• Hyperparameters are set by the designer of the model and may include elements
like the rate of learning, structure of the model, or count of clusters used to
classify data.

• This is different to the parameters developed during machine learning training


such as the data weighting, which change relative to the input training data.

• Hyperparameter optimization means the machine learning model can solve the
problem it was designed to solve as efficiently and effectively as possible.

• Optimizing the hyperparameters is an important part of achieving the most


accurate model. The process can be described as hyperparameter tuning or
optimization.

• The aim is to achieve maximum accuracy and efficiency, and minimum errors.

You might also like