In PyTorch, an optimizer is a specific implementation of the optimization algorithm that is used to update the parameters of a neural network. The optimizer updates the parameters in such a way that the loss of the neural network is minimized. PyTorch provides various built-in optimizers such as SGD, Adam, Adagrad, etc. that can be used out of the box. However, in some cases, the built-in optimizers may not be suitable for a particular problem or may not perform well. In such cases, one can create their own custom optimizer.
A custom optimizer in PyTorch is a class that inherits from the torch.optim.Optimizer base class. The custom optimizer should implement the init and step methods. The init method is used to initialize the optimizer’s internal state, and the step method is used to update the parameters of the model.
Creating a Custom Optimizer:
In PyTorch, creating a custom optimizer is a two-step process. First, we need to create a class that inherits from the torch.optim.Optimizer class, and override the following methods:
- __init__(self, params): This method is used to initialize the optimizer and store the model parameters in the params attribute.
- step(): This method is used to perform a single optimization step. It should update the model parameters based on the current gradients.
- zero_grad(): This method is used to set the gradients of all parameters to zero.
Init Method:
The init method is used to initialize the optimizer’s internal state. In this method, we define the hyperparameters of the optimizer and set the internal state. For example, let’s say we want to create a custom optimizer that implements the Momentum optimization algorithm. The init method for this optimizer would look something like this:
In the below example, we define the hyperparameters of the optimizer to be the learning rate lr and the momentum. We then call the super() method to initialize the internal state of the optimizer. We also set up a state dictionary that we will use to store the velocity vector for each parameter.
Python3
import torch
import torch.nn as nn
class MomentumOptimizer(torch.optim.Optimizer):
def __init__( self , params, lr = 1e - 3 , momentum = 0.9 ):
super (MomentumOptimizer, self ).__init__(params, defaults = { 'lr' : lr})
self .momentum = momentum
self .state = dict ()
for group in self .param_groups:
for p in group[ 'params' ]:
self .state[p] = dict (mom = torch.zeros_like(p.data))
def step( self ):
for group in self .param_groups:
for p in group[ 'params' ]:
if p not in self .state:
self .state[p] = dict (mom = torch.zeros_like(p.data))
mom = self .state[p][ 'mom' ]
mom = self .momentum * mom - group[ 'lr' ] * p.grad.data
p.data + = mom
|
The Step Method:
The step method is used to update the parameters of the model. This method takes no arguments and updates the internal state and the model parameters. In the case of our MomentumOptimizer, the step method would look something like this:
In the above example, we iterate over all the parameters in the model and check if they are in the state dictionary. If they are not, we add them to the state dictionary with an initial velocity vector of zero. We then calculate the new velocity vector using the momentum and the learning rate and update the parameter’s value using this velocity vector.
Using the custom optimizer is similar to using the built-in optimizers, in that we instantiate it and pass in the model’s parameters and the hyperparameters.
Illustration 1:
Let’s create a simple training loop that shows how to use the custom optimizer to train a model. The loop would perform the following steps:
- Initialize the gradients of the model’s parameters to zero using the optimizer’s zero_grad method.
- Compute the forward pass of the model on some input data and calculate the loss.
- Compute the gradients of the model’s parameters with respect to the loss using the backward method.
- Call the step method of the optimizer to update the model’s parameters based on the current gradients and the optimizer’s internal state.
Step 1. Import the necessary libraries:
Python3
import torch
import torch.nn as nn
import matplotlib.pyplot as plt
|
Step 2: Define a custom optimizer class that inherits from torch.optim.Optimizer. In this example, we will create a custom optimizer that implements the Momentum optimization algorithm.
Python3
class MomentumOptimizer(torch.optim.Optimizer):
def __init__( self , params, lr = 1e - 3 , momentum = 0.9 ):
super (MomentumOptimizer, self ).__init__(params, defaults = { 'lr' : lr})
self .momentum = momentum
self .state = dict ()
for group in self .param_groups:
for p in group[ 'params' ]:
self .state[p] = dict (mom = torch.zeros_like(p.data))
def step( self ):
for group in self .param_groups:
for p in group[ 'params' ]:
if p not in self .state:
self .state[p] = dict (mom = torch.zeros_like(p.data))
mom = self .state[p][ 'mom' ]
mom = self .momentum * mom - group[ 'lr' ] * p.grad.data
p.data + = mom
|
Step 3: Define a simple model, loss function and also initialize an instance of the custom optimizer:
Python3
model = nn.Linear( 2 , 2 )
criterion = nn.MSELoss()
optimizer = MomentumOptimizer(model.parameters(), lr = 1e - 3 , momentum = 0.9 )
|
Step 4: Generate some random data to train the model
Python3
X = torch.randn( 100 , 2 )
y = torch.randn( 100 , 1 )
|
Step 5:Train the model with custom optimizer and Plot the training loss.
Python3
for i in range ( 2500 ):
optimizer.zero_grad()
y_pred = model(X)
loss = criterion(y_pred, y)
if i % 100 = = 0 :
plt.plot(i,loss.item(), 'ro-' )
loss.backward()
optimizer.step()
plt.title( 'Losses over iterations' )
plt.xlabel( 'iterations' )
plt.ylabel( 'Losses' )
plt.show()
|
Output:

Losses
You will notice that your custom optimizer is correctly updating the parameters of the model and minimizing the loss function.
Note: The above loop is an example on how to use the custom optimizer and it will help you understand how the step method of optimizer is working.
Customizing Optimizers:
There are many ways to customize optimizers in PyTorch, Some of them are as follows:
Changing the learning rate schedule:
The learning rate of the optimizer can be changed during training using a learning rate scheduler. PyTorch provides several built-in schedulers such as torch.optim.lr_scheduler.StepLR and torch.optim.lr_scheduler.ExponentialLR. We can also create our own scheduler by inheriting from the torch.optim.lr_scheduler._LRScheduler class.
In below code, we are using the torch.optim.lr_scheduler.StepLR scheduler which will multiply the learning rate by a factor of gamma every step_size iterations.
Python3
optimizer = torch.optim.SGD(model.parameters(), lr = 0.01 )
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size = 10 , gamma = 0.1 )
num_epochs = 200
for i in range (num_epochs):
optimizer.zero_grad()
y_pred = model(X)
loss = criterion(y_pred, y)
loss.backward()
optimizer.step()
scheduler.step()
|
Adding regularization
To add regularization to the optimizer, we can modify the step() method to include the regularization term in the update of the model parameters. For example, we can add L1 or L2 regularization by modifying the step() method to include a term that penalizes the absolute or squared values of the parameters respectively.
Python3
class MyAdam(torch.optim.Adam):
def __init__( self , params, lr = 1e - 3 , betas = ( 0.9 , 0.999 ), weight_decay = 0 ):
super ().__init__(params, lr = lr, betas = betas)
self .weight_decay = weight_decay
def step( self ):
for group in self .param_groups:
for p in group[ 'params' ]:
if p.grad is None :
continue
grad = p.grad.data
if grad.is_sparse:
raise RuntimeError( "Adam does not support sparse gradients" )
state = self .state[p]
if len (state) = = 0 :
state[ "step" ] = 0
state[ "exp_avg" ] = torch.zeros_like(p.data)
state[ "exp_avg_sq" ] = torch.zeros_like(p.data)
exp_avg, exp_avg_sq = state[ "exp_avg" ], state[ "exp_avg_sq" ]
beta1, beta2 = group[ "betas" ]
state[ "step" ] + = 1
if self .weight_decay ! = 0 :
grad = grad.add(p.data, alpha = self .weight_decay)
exp_avg.mul_(beta1).add_( 1 - beta1, grad)
exp_avg_sq.mul_(beta2).addcmul_( 1 - beta2, grad, grad)
denom = exp_avg_sq.sqrt().add_(group[ "eps" ])
bias_correction1 = 1 - beta1 * * state[ "step" ]
bias_correction2 = 1 - beta2 * * state[ "step" ]
step_size = group[ "lr" ] * math.sqrt(bias_correction2) / bias_correction1
p.data.addcdiv_( - step_size, exp_avg, denom)
optimizer = MyAdam(model.parameters(), weight_decay = 0.00002 )
|
In the above code, we are creating a custom Adam optimizer that includes weight decay regularization by adding a weight_decay parameter to the optimizer, and modifying the step() method to include the weight decay term in the update of the parameters. The weight decay term is applied to the gradients by grad = grad.add(p.data, alpha=group[“weight_decay”]) , this will penalize large parameter values by decreasing their update.
Implementing a new optimization algorithm:
PyTorch provides several built-in optimization algorithms, such as SGD, Adam, and Adagrad. However, there are many other optimization algorithms that are not included in the library. By creating a custom optimizer, we can implement any optimization algorithm that we want.
Python3
class MyOptimizer(torch.optim.Optimizer):
def __init__( self , params, lr = 0.01 ):
defaults = dict (lr = lr)
super (MyOptimizer, self ).__init__(params, defaults)
def step( self ):
for group in self .param_groups:
for p in group[ 'params' ]:
if p.grad is None :
continue
p.data = p.data - group[ 'lr' ] * p.grad.data * * 2
optimizer = MyOptimizer(model.parameters(), lr = 0.001 )
|
In this example, we created a new optimization algorithm called MyOptimizer, that performs updates to the parameters based on the squared gradient values, instead of the gradients themselves.
Using multiple optimizers:
In some cases, we may want to use different optimizers for different parts of the model. For example, we may want to use Adam for the parameters of the convolutional layers, and SGD for the parameters of the fully-connected layers. This can be achieved by creating multiple instances of the optimizer, one for each set of parameters.
Python3
params1 = model.conv_layers.parameters()
params2 = model.fc_layers.parameters()
optimizer1 = torch.optim.Adam(params1)
optimizer2 = torch.optim.SGD(params2, lr = 0.01 )
for i in range (num_epochs):
...
optimizer1.zero_grad()
optimizer2.zero_grad()
loss.backward()
optimizer1.step()
optimizer2.step()
|
In this example, we are using Adam optimizer for the parameters of the convolutional layers, and SGD optimizer with a fixed learning rate of 0.01 for the parameters of the fully-connected layers. This can help fine-tune the training of specific parts of the model.
Illustration 2:
Build a handwritten digit classifications model using a custom optimizer
Step 1:
Import the necessary libraries
Python3
import torch
import torch.nn as nn
from torch.optim import Optimizer
from torch.utils.data import DataLoader
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor
from torch.utils.tensorboard import SummaryWriter
import math
import matplotlib.pyplot as plt
device = torch.device( "cuda" if torch.cuda.is_available() else "cpu" )
|
Step 2:
Now, we’ll load the MNIST dataset, and create a data loader for it.
Python3
dataset = MNIST(root = '.' , train = True , download = True , transform = ToTensor())
dataloader = DataLoader(dataset, batch_size = 32 , shuffle = True )
dataloader.dataset
|
Output:
Dataset MNIST
Number of datapoints: 60000
Root location: .
Split: Train
StandardTransform
Transform: ToTensor()
Step 3:
Let’s visualize the first batch of our dataset.
Python3
sample_idx = torch.randint( len (dataloader), size = ( 1 ,)).item()
len (dataloader)
for i, batch in enumerate (dataloader):
figure = plt.figure(figsize = ( 16 , 16 ))
img, label = batch
for j in range (img.shape[ 0 ]):
figure.add_subplot( 8 , 8 , j + 1 )
plt.imshow(img[j].squeeze(), cmap = "gray" )
plt.title(label[j])
plt.axis( "off" )
plt.show()
break
|
Output:

First batch input images
Step 4:
Next, we’ll define our model architecture, a simple fully connected network with two hidden layers
Python3
class Net(nn.Module):
def __init__( self ):
super (Net, self ).__init__()
self .fc1 = nn.Linear( 28 * 28 , 512 )
self .fc2 = nn.Linear( 512 , 512 )
self .fc3 = nn.Linear( 512 , 10 )
def forward( self , x):
x = x.view( - 1 , 28 * 28 )
x = torch.relu( self .fc1(x))
x = torch.relu( self .fc2(x))
x = self .fc3(x)
return x
model = Net().to(device)
|
Step 4:
we’ll define our loss function, in this case, we’ll use the cross-entropy loss.
Python3
loss_fn = nn.CrossEntropyLoss()
|
Step 5:
Next, we’ll define our custom optimizer
Python3
class MyAdam(torch.optim.Adam):
def __init__( self , params, lr = 1e - 3 , betas = ( 0.9 , 0.999 ), weight_decay = 0 ):
super ().__init__(params, lr = lr, betas = betas)
self .weight_decay = weight_decay
def step( self ):
for group in self .param_groups:
for p in group[ 'params' ]:
if p.grad is None :
continue
grad = p.grad.data
if grad.is_sparse:
raise RuntimeError( "Adam does not support sparse gradients" )
state = self .state[p]
if len (state) = = 0 :
state[ "step" ] = 0
state[ "exp_avg" ] = torch.zeros_like(p.data)
state[ "exp_avg_sq" ] = torch.zeros_like(p.data)
exp_avg, exp_avg_sq = state[ "exp_avg" ], state[ "exp_avg_sq" ]
beta1, beta2 = group[ "betas" ]
state[ "step" ] + = 1
if self .weight_decay ! = 0 :
grad = grad.add(p.data, alpha = self .weight_decay)
exp_avg.mul_(beta1).add_( 1 - beta1, grad)
exp_avg_sq.mul_(beta2).addcmul_( 1 - beta2, grad, grad)
denom = exp_avg_sq.sqrt().add_(group[ "eps" ])
bias_correction1 = 1 - beta1 * * state[ "step" ]
bias_correction2 = 1 - beta2 * * state[ "step" ]
step_size = group[ "lr" ] * math.sqrt(bias_correction2) / bias_correction1
p.data.addcdiv_( - step_size, exp_avg, denom)
optimizer = MyAdam(model.parameters(), weight_decay = 0.00001 )
|
Step 6:
Now, Train the model with custom optimizer and Plot the training loss.
Python3
num_epochs = 10
for i in range (num_epochs):
for inputs, labels in dataloader:
inputs, labels = inputs.to(device), labels.to(device)
outputs = model(inputs)
loss = loss_fn(outputs, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
plt.plot(i,loss.item(), 'ro-' )
print (i, '>> Loss :' , loss.item())
plt.title( 'Losses over iterations' )
plt.xlabel( 'iterations' )
plt.ylabel( 'Losses' )
plt.show()
|
Output:
0 >> Loss : nan
1 >> Loss : 1.2611686178923354e-44
2 >> Loss : nan
3 >> Loss : 8.407790785948902e-45
4 >> Loss : nan
5 >> Loss : 1.401298464324817e-45
6 >> Loss : nan
7 >> Loss : 0.0
8 >> Loss : nan
9 >> Loss : 1.401298464324817e-45

Losses
Note: Losses will be different for different devices.
Conclusion:
Creating custom optimizers in PyTorch is a powerful technique that allows us to fine-tune the training process of a machine learning model. By inheriting from the torch.optim.Optimizer class and implementing the __init__, step, and zero_grad methods, we can create our own optimization algorithm, adding regularization, changing learning rate schedule, or using multiple optimizers. Custom optimizers can help to improve the performance of a model and make it more suitable for a specific problem.
Similar Reads
Optimizers in Tensorflow
Optimizers adjust weights of the model based on the gradient of loss function, aiming to minimize the loss and improve model accuracy. In TensorFlow, optimizers are available through tf.keras.optimizers. You can use these optimizers in your models by specifying them when compiling the model. Here's
3 min read
Adam Optimizer in Tensorflow
Adam (Adaptive Moment Estimation) is an optimizer that combines the best features of two well-known optimizers: Momentum and RMSprop. Adam is used in deep learning due to its efficiency and adaptive learning rate capabilities. To use Adam in TensorFlow, we can pass the string value 'adam' to the opt
3 min read
Create Custom Neural Network in PyTorch
PyTorch is a popular deep learning framework, empowers you to build and train powerful neural networks. But what if you need to go beyond the standard layers offered by the library? Here's where custom layers come in, allowing you to tailor the network architecture to your specific needs. This compr
5 min read
How to optimize memory usage in PyTorch?
Memory optimization is essential when using PyTorch, particularly when training deep learning models on GPUs or other devices with restricted memory. Larger model training, quicker training periods, and lower costs in cloud settings may all be achieved with effective memory management. This article
4 min read
How to normalize images in PyTorch ?
Image transformation is a process to change the original values of image pixels to a set of new values. One type of transformation that we do on images is to transform an image into a PyTorch tensor. When an image is transformed into a PyTorch tensor, the pixel values are scaled between 0.0 and 1.0.
6 min read
Pytorch - Index-based Operation
PyTorch is a python library developed by Facebook to run and train deep learning and machine learning algorithms. Tensor is the fundamental data structure of the machine or deep learning algorithms and to deal with them, we perform several operations, for which PyTorch library offers many functional
7 min read
PyTorch Loss Functions
Loss functions are a crucial component in neural network training, as every machine learning model requires optimization, which helps in reducing the loss and making correct predictions. Without loss functions, there's no way to drive your model to make correct predictions. But what exactly are loss
12 min read
Data Preprocessing in PyTorch
Data preprocessing is a crucial step in any machine learning pipeline, and PyTorch offers a variety of tools and techniques to help streamline this process. In this article, we will explore the best practices for data preprocessing in PyTorch, focusing on techniques such as data loading, normalizati
5 min read
Create Model using Custom Module in Pytorch
Custom module in Pytorch A custom module in PyTorch is a user-defined module that is built using the PyTorch library's built-in neural network module, torch.nn.Module. It's a way of creating new modules by combining and extending the functionality provided by existing PyTorch modules. The torch.nn.M
8 min read
Install Pytorch on Linux
In this article, we are going to see how you can install PyTorch in the Linux system. We are using Ubuntu 20 LTS you can use any other one. To successfully install PyTorch in your Linux system, follow the below procedure: First, check if you are using pythonâs latest version or not. Because PyGame r
2 min read