Gradient Descent in AI Optimization
Gradient Descent in AI Optimization
● Gain a clear understanding of optimization and its significance in AI and machine learning.
● Comprehend the concept of gradient descent, its mathematical foundation, and how it helps in
minimizing functions.
● Analyze how gradient descent is applied in training AI models and optimizing machine learning
algorithms.
● Develop basic Python skills to implement the gradient descent algorithm and apply it to solve
AI-related problems.
One specific application is in Google's RankBrain algorithm, an AI system Google uses to understand
complex search queries. Gradient Descent plays a vital role in training RankBrain to analyze and
interpret patterns in vast amounts of search data, gradually improving the model’s predictive
accuracy. Google employs Mini-Batch Gradient Descent, where small subsets of data (rather than
entire datasets) help the model update its parameters more frequently and efficiently, balancing
computational load and precision.
Through this approach, Google has significantly enhanced the user experience, making search results
faster and more relevant. This case study exemplifies how Gradient Descent in a high-stakes, real-
time environment like search optimization can support continuous improvements in AI, refining the
accuracy of predictions and relevancy, and ultimately meeting user expectations more effectively.
Discussion Questions
● Evaluate the effectiveness of Gradient Descent in handling search ranking challenges such
as relevancy and accuracy. How could other optimization methods compare in terms of
performance and computational efficiency?
● Analyze how Google’s choice of Mini-Batch Gradient Descent might influence the speed and
accuracy of its search ranking model. What potential trade-offs could arise from using mini-
batches instead of the full dataset?
Optimization
Optimization is the process of finding the best solution from a set of possible choices by maximizing
or minimizing a specific objective function. In simpler terms, it involves selecting the most efficient or
effective option based on desired outcomes, often within certain constraints.
In mathematical terms, optimization aims to find values for variables that maximize or minimize an
objective function, subject to given conditions. This process is widely used in fields like engineering,
economics, and machine learning to improve decision-making and performance.
Optimization is essentially about making things as good as possible. In everyday life, it’s like planning
the quickest route home, trying to spend the least money, or finding the best way to solve a problem.
In these cases, you’re making choices to achieve a particular goal—whether that’s saving time,
cutting costs, or reaching the best outcome.
In machine learning and AI, optimization is a crucial part of training models. We use it to find the best
settings or parameters that help the model perform as accurately as possible. The better the
optimization, the better the model becomes at making accurate predictions or decisions. The process
of optimization involves adjusting the model’s parameters (like weights and biases) to reduce the error
or cost as much as possible. In machine learning, we commonly use algorithms like Gradient Descent
to perform this optimization. Gradient Descent is a technique that helps us navigate towards the
minimum value of the objective function by:
Calculating the gradient (slope) of the objective function with respect to each parameter.
Moving the parameters in the direction that reduces the error (down the slope) until we reach the
minimum. The ultimate goal of optimization is to find the values of the parameters that minimize the
error in our model’s predictions. This way, the model becomes more accurate and reliable in making
predictions.
○ This is the function we want to either maximize (get the highest possible value) or
minimize (get the lowest possible value).
○ For example, in AI, f(x) could be the error rate of a model. In this case, we’d aim to
minimize it to make the model more accurate.
2. Variables x:
○ These are the parameters or inputs we can adjust to achieve the best outcome.
○ For instance, in machine learning, x could be the weights of a neural network, which
are adjusted to minimize error during training.
3. Constraints:
○
Mathematically, these are expressed as conditions on x that must be met (e.g., x0.
In a machine learning context, we aim to minimize a function that represents the error of the model
(known as the loss function) or maximize a function that represents accuracy or performance. During
training, optimization algorithms adjust the model’s parameters to find the “best” values—those that
make the error as small as possible or the accuracy as high as possible.
2. Stochastic Optimization: In machine learning, we often deal with large datasets, so instead
of computing exact changes, we use random samples (like in stochastic gradient descent) to
approximate the best direction to update parameters.
In mathematics and machine learning, optimization is all about finding the best possible outcome or
parameters for a given problem. Think of it like a quest to improve or perfect a model's performance.
Usually, this means adjusting certain parameters or values to either maximize or minimize a specific
outcome, known as the objective function.
2
x Example : A simple math function like f(x) = can be optimized without any constraints.
We can freely adjust x to get the lowest value for f(x) (in this case, it’s zero when x=0).
● Constrained Optimization: This involves optimizing a function, but with specific conditions or
boundaries on the variables. Imagine you’re climbing a hill again, but this time there’s a fence
around part of it, so you can’t explore beyond that fence. Constraints could be things like
limits on resources, budgets, or physical boundaries.
Minimizing production costs (the function to optimize) while staying within a specific budget
constraint. Or, maximizing profits but ensuring the investment doesn’t exceed a certain risk level.
Example :
When training a machine learning model, we may want to minimize the error rate, focusing only on
getting the lowest error. This is a single-objective optimization because our sole concern is model
accuracy.
● Multi-Objective Optimization: Here, we aim to balance multiple objectives that often conflict
with each other. Think of it as trying to achieve the best trade-off among several goals. This
situation often arises when there are competing demands or when trying to satisfy different
requirements simultaneously.
Example :
In designing a neural network, we might want both high accuracy and low computational cost.
Optimizing for both can be tricky since increasing accuracy might mean using a more complex
model, which could also increase computation time. So, we need to find a balance that meets both
needs reasonably well.
During model training, the model learns from data by adjusting internal parameters to minimize errors
and make accurate predictions. This process involves an optimization algorithm that adjusts these
parameters to minimize a loss function, which essentially measures the difference between predicted
and actual results.
For instance, gradient descent is one of the most common optimization algorithms used here. In
gradient descent, the algorithm updates parameters iteratively, moving them toward values that
minimize the loss function. This is done step-by-step, like finding the quickest path downhill by taking
small, calculated steps toward the lowest point, which corresponds to the least error.
2. Hyperparameter Tuning
Hyperparameters are like settings that define how the training process should be conducted, including
factors like the learning rate (how big each step in gradient descent should be) and batch size (how
much data the model processes at a time). By optimizing these hyperparameters, we can help the
model learn better and avoid issues like overfitting (where the model performs well on training data
but poorly on new data).
1. Weight Optimization: Neural networks adjust weights through optimization algorithms like
gradient descent, ensuring the model accurately captures patterns in the data. This process is
repeated across each layer of neurons, creating a network that improves with each step.
1. The Main Road might have fewer turns but often has more traffic.
2. The Side Roads could have fewer cars but more stop signs or slower speed limits.
3. The Shortcut Through the Park may be shorter but only works if it’s a dry day.
Here, the optimization problem is to find the route that minimizes the time it takes to get to school.
Key Elements:
So, you might consider factors like distance, traffic, weather, and time of day. If it’s a rainy morning,
the shortcut might not work because it gets muddy, while the main road could be more reliable, albeit
longer. Without realizing it, you’re using optimization to weigh all these factors and pick the best
option.
In the end, you choose the route that has the best balance of distance, time, and ease. By doing this,
you’re solving an optimization problem, aiming to minimize time while still getting to school.
You might start by listing the items you need, say paint, brushes, and paper, and then check prices
online or visit a few stores. You see that Store A has the cheapest brushes, Store B offers a discount
on paint, and Store C has lower prices on paper.
Now, you have an optimization problem: find the combination of items from different stores that will
give you everything you need for the lowest cost. Your decision could look like this:
The result? You’ve spent the least amount possible while getting all the necessary supplies. This is a
classic example of cost minimization, where you’re finding the combination that lets you save the most
while meeting your needs.
In machine learning, we usually start with a model that isn’t very accurate, and we need to "train" it.
Training involves adjusting the model’s parameters (like weights) to reduce the error between the
model’s predictions and actual outcomes. Gradient Descent is an algorithm that helps us make these
adjustments efficiently, so the model improves step by step.
Imagine standing at the top of a mountain and wanting to reach the lowest point. You can't see the
bottom directly, so you look around to find the steepest downhill direction. You take a small step in
that direction, then reassess and take another step, repeating this process until you reach the
lowest point.
In Gradient Descent, the mountain represents the "error function" (a measure of how far off our
predictions are), and the bottom of the mountain is the minimum error we aim to achieve. Each step
downhill represents an adjustment to the model's parameters, guided by the steepness (or
"gradient") of the slope.
Gradient Descent is widely used because it’s efficient and works well in a variety of machine learning
models, especially where functions are too complex to solve directly. Gradient Descent is an
optimization algorithm, meaning it helps find the best (or "optimal") solution to a problem. In machine
learning, this often means finding the best model parameters that minimize error. The algorithm
doesn’t reach the best solution in one go. Instead, it iteratively adjusts the parameters, moving closer
● Gradient: A mathematical term that represents the slope or rate of change of a function. In
machine learning, it shows how much the error changes with each parameter.
● Learning Rate: This is the size of each step we take downhill. A large learning rate means
bigger steps (faster, but riskier), while a small learning rate means smaller steps (slower, but
safer).
● Initialize Parameters: Start with some initial values for the model’s parameters (like weights
in a neural network). These values can be random or set based on prior knowledge.
● Calculate Gradient: At each step, calculate the gradient (essentially, the slope) of the error
function. The gradient tells you the direction and rate at which the error is increasing or
decreasing. It’s like checking the steepness of the mountain to decide which way to step.
● Update Parameters: Using the gradient, adjust the parameters slightly in the opposite
direction of the gradient (downhill) to reduce the error. This step is like taking a small step
down the mountain.
● Repeat: Continue this process of calculating the gradient and updating the parameters until
the error can’t be reduced any further, or it reaches an acceptable low level. This is like
reaching the bottom of the mountain, where there’s no more slope to go down.
● Benefits: Since it uses the entire dataset, the gradient is calculated very accurately in each
step, making it a highly reliable approach for finding the global minimum. Batch Gradient
Descent is also stable, producing consistent updates that lead the model towards better
accuracy.
● Downsides: The main downside is speed. For large datasets, computing gradients for every
data point at each iteration can be very slow and computationally expensive, requiring
significant memory and processing power. This can make it challenging to use Batch Gradient
Descent in real-world applications with large datasets, especially without specialized
hardware like GPUs.
Batch Gradient Descent: Uses the entire dataset for each update, providing accuracy and stability
but often slow for large datasets.
● Downsides: The downside of SGD is that it introduces noise. The path toward the minimum
is less stable, with the model parameters fluctuating around the ideal solution instead of
following a smooth path. This fluctuation can lead to a longer time to converge, though often
methods like learning rate decay or momentum are applied to counterbalance the instability.
Stochastic Gradient Descent (SGD): Updates after each data point, fast and efficient for large
datasets, but introduces noise and less stability.
● Benefits: Mini-Batch Gradient Descent combines the strengths of both approaches. It’s faster
than Batch Gradient Descent because it doesn’t need the entire dataset to update. It’s also
more stable than SGD, as it reduces the noise that comes with using just one data point,
providing smoother convergence. Many modern machine learning frameworks and neural
network training use Mini-Batch Gradient Descent by default for its efficiency and reliability.
● Downsides: Choosing the right batch size can be a challenge. If the batch size is too large,
the algorithm starts behaving more like Batch Gradient Descent, slowing down the training
process. If the batch size is too small, it might resemble SGD, reintroducing instability.
Additionally, it still requires more memory than pure SGD, though it’s far more manageable
than Batch Gradient Descent.
Mini-Batch Gradient Descent: Balances speed and stability by using small batches, making it well-
suited for practical use cases.
Imagine you’re looking at the possible values that a random variable X can take. For example, if X
represents the number of hours it takes to finish a project, the CDF tells us, for any given number х,
the likelihood (or probability) that X is less than or equal to that number х.
1. Always Increasing: The CDF is a function that starts at 0 for the lowest possible value of X
and gradually increases to 1 as X approaches its maximum possible value. It’s always
moving upward (or staying flat), but it never decreases. This makes sense because as you
consider larger and larger values of х, you're capturing more of the possible outcomes for X ,
meaning the probability accumulates.
2. Range Between 0 and 1: The CDF gives probabilities, so it always stays between 0 and 1.
When х is very small (for instance, less than the smallest possible value of X ), the CDF will
be 0, meaning the probability that X is less than or equal to х is 0. As х grows larger and
larger, eventually it will reach 1, indicating that there's a 100% probability that X is less than
or equal to х.
3. Complete Description of the Distribution: The CDF provides a full picture of the distribution
of the random variable. By knowing the CDF, you can answer any question about the
probability of X falling within any range. For example:
Let’s say we have a random variable X that represents a test score, which can range from 0 to 100.
Now, imagine we want to know the probability that someone scores 70 or below on the test.
● The CDF at 70, or F(70), gives us this probability. If F(70)=0.85, it means there’s an 85%
chance that someone will score 70 or less.
● As we increase the score x, say to 90, the CDF value will increase because the probability
that someone scores 90 or less is higher than the probability of scoring 70 or less. If
F(90)=0.95, it means there's a 95% chance of scoring 90 or less.
(MSE). The algorithm adjusts the weights iteratively until it finds the values that minimize the error
between the predicted and actual target values.
Neural Networks: In deep learning, gradient descent is essential for training neural networks. During
the backpropagation phase, the gradients of the loss function are calculated with respect to each
weight and bias in the network. These gradients are used to adjust the weights in the direction that
minimizes the loss. Optimization algorithms like Stochastic Gradient Descent (SGD) and Adam are
used to enhance the basic gradient descent algorithm by introducing momentum or adaptive learning
rates.
Real-World Example: In image recognition, a convolutional neural network (CNN) uses gradient
descent to update its weights during training. The CNN processes images in layers, where each layer
learns to identify features (e.g., edges, textures, and objects). Gradient descent ensures the weights
in these layers are adjusted to minimize the difference between the predicted output and the actual
label, improving the model’s accuracy over time.
NLP: In training models for sentiment analysis or language translation, gradient descent is used to
optimize the weights of recurrent neural networks (RNNs) or transformers. This allows models to
adjust their parameters so they can correctly predict the sentiment of a sentence or translate phrases
between languages.
Computer Vision: Gradient descent helps train models for image classification or object detection.
For example, training a deep CNN for facial recognition involves updating weights in response to the
gradients calculated during backpropagation, which helps the model learn to differentiate between
various facial features.
Recommendation Systems: In systems like Netflix or Amazon, gradient descent can be used in
collaborative filtering algorithms to optimize the parameters of the recommendation function,
improving the accuracy of user-item recommendations.
Real-World Example: A company like Google uses gradient descent to train language models for
Google Translate. The model learns to translate between languages by minimizing the loss function
that measures the difference between the predicted translation and the actual translation.
maps the input to a value between 0 and 1, which can be interpreted as a probability.
Real-World Example: In medical diagnosis, logistic regression can be used to predict the probability
of a patient having a disease based on input features like age, blood pressure, and cholesterol levels.
Gradient descent adjusts the model parameters to find the best fit that minimizes the prediction error,
making the model more accurate in classifying patients as having or not having the disease.
Example 1: Using how to implement gradient descent for a simple linear regression model
python code:
import numpy as np
# Sigmoid function
def sigmoid(z):
return 1 / (1 + [Link](-z))
X = [Link]([[1, 2], [1, 3], [2, 3], [3, 4], [4, 5]])
y = [Link]([0, 0, 1, 1, 1])
# Initialize parameters
w = [Link]([Link][1]) # weights
b = 0 # bias
learning_rate = 0.01
iterations = 1000
for i in range(iterations):
linear_model = [Link](X, w) + b
predictions = sigmoid(linear_model)
db = (1/m) * [Link](predictions - y)
w -= learning_rate * dw
b -= learning_rate * db
if i % 100 == 0:
Python libraries like Scikit-learn simplify gradient descent implementations. Here’s an example using
Scikit-learn for linear regression:
python code :
import numpy as np
# Sigmoid function
def sigmoid(z):
return 1 / (1 + [Link](-z))
y = [Link]([0, 0, 1, 1, 1])
# Initialize parameters
w = [Link]([Link][1]) # weights
b = 0 # bias
learning_rate = 0.01
iterations = 1000
m = len(X)
for i in range(iterations):
linear_model = [Link](X, w) + b
predictions = sigmoid(linear_model)
db = (1/m) * [Link](predictions - y)
w -= learning_rate * dw
b -= learning_rate * db
if i % 100 == 0:
Theory to Practice
● A company wants to optimize its delivery routes to reduce fuel costs and delivery times.
Discuss how concepts from optimization and gradient descent could be adapted to solve
this problem. What challenges might arise in implementing such a solution?
● Imagine you’re picking a movie to watch, and you try different types until you find one you
love. How can this process of trial and error relate to the way models are trained to make
better predictions?
Summary
● The lesson starts with an introduction to optimization, explaining its definition, types of
problems, and applications in AI and Machine Learning. It progresses to real-world
examples to show how optimization techniques are used across industries.
● Optimization is the process of finding the best solution by minimizing or maximizing an
objective function within constraints.
● Types of optimization problems include linear, non-linear, convex, and combinatorial
optimization.
● Applications of optimization in AI and ML include model training, hyperparameter tuning,
and resource allocation.
● Real-world examples of optimization include route optimization in logistics and portfolio
optimization in finance.
● Gradient Descent is an iterative optimization algorithm that minimizes loss functions by
updating parameters in the direction of the steepest descent.
● Types of Gradient Descent include Batch, Stochastic, and Mini-batch Gradient Descent,
each with unique advantages for efficiency and stability.
● Gradient Descent is used in machine learning algorithms like linear regression, logistic
regression, and neural networks.
● Applications of Gradient Descent in AI include deep learning tasks such as image
classification and language translation.
● Logistic Regression uses Gradient Descent to minimize the cross-entropy loss function for
binary classification.
● Python enables Gradient Descent implementation from scratch using NumPy or through
advanced tools like TensorFlow, PyTorch, and Scikit-learn.
MCQs
3. Identify Which variant of gradient descent processes the entire training dataset in each
iteration?
a) Batch gradient descent
b) Stochastic gradient descent
c) Mini-batch gradient descent
d) Adaptive gradient descent
4. Analyse What is the main advantage of stochastic gradient descent over batch gradient
descent?
a) It converges faster
b) It uses less computational resources
c) It avoids local minima
d) It provides more precise parameter updates
1. What is the role of the learning rate in the gradient descent algorithm, and how does it
impact the optimization process?
2. Explain the concept of Lagrange's theorem and its application in optimization problems.
How does it help handle constraints in gradient descent optimization?
3. Compare and contrast the different variants of gradient descent, namely batch gradient
descent, stochastic gradient descent, and mini-batch gradient descent. What are the
advantages and limitations of each approach?
5. Reflect on the challenges and future advancements in gradient descent for Artificial
Intelligence. How can researchers and practitioners overcome the limitations and further
enhance the effectiveness of gradient descent in optimising AI systems?
Answers
MCQs
1. Correct answer: b) To optimise parameters and minimise a cost function. Explanation:
Gradient descent is primarily used in AI to optimise parameters by minimising a cost
function. It iteratively adjusts the parameters based on the computed gradients to find the
optimal values that minimise the cost.
3. Correct answer: a) Batch gradient descent. Explanation: Batch gradient descent processes
the entire training dataset in each iteration. It computes the gradient using all the training
examples, which can be computationally expensive but provides accurate updates.
● Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT press.
● Russell, S. J., & Norvig, P. (2010). Artificial intelligence: A modern approach. Pearson
Education.
● Sutskever, I., Martens, J., Dahl, G. E., & Hinton, G. E. (2013). On the importance of
initialization and momentum in deep learning. arXiv preprint arXiv:1308.0859.
● Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980.