Custom Optimizers in Pytorch

Last Updated : 24 Apr, 2025

In PyTorch, an optimizer is a specific implementation of the optimization algorithm that is used to update the parameters of a neural network. The optimizer updates the parameters in such a way that the loss of the neural network is minimized. PyTorch provides various built-in optimizers such as SGD, Adam, Adagrad, etc. that can be used out of the box. However, in some cases, the built-in optimizers may not be suitable for a particular problem or may not perform well. In such cases, one can create their own custom optimizer.

A custom optimizer in PyTorch is a class that inherits from the torch.optim.Optimizer base class. The custom optimizer should implement the init and step methods. The init method is used to initialize the optimizer’s internal state, and the step method is used to update the parameters of the model.

Creating a Custom Optimizer:

In PyTorch, creating a custom optimizer is a two-step process. First, we need to create a class that inherits from the torch.optim.Optimizer class, and override the following methods:

__init__(self, params): This method is used to initialize the optimizer and store the model parameters in the params attribute.
step(): This method is used to perform a single optimization step. It should update the model parameters based on the current gradients.
zero_grad(): This method is used to set the gradients of all parameters to zero.

Init Method:

The init method is used to initialize the optimizer’s internal state. In this method, we define the hyperparameters of the optimizer and set the internal state. For example, let’s say we want to create a custom optimizer that implements the Momentum optimization algorithm. The init method for this optimizer would look something like this:

In the below example, we define the hyperparameters of the optimizer to be the learning rate lr and the momentum. We then call the super() method to initialize the internal state of the optimizer. We also set up a state dictionary that we will use to store the velocity vector for each parameter.

Python3

# Import the necessary libraries 
import torch 
import torch.nn as nn 
  
# MomentumOptimizer 
class MomentumOptimizer(torch.optim.Optimizer): 
      
    # Init Method: 
    def __init__(self, params, lr=1e-3, momentum=0.9): 
        super(MomentumOptimizer, self).__init__(params, defaults={'lr': lr}) 
        self.momentum = momentum 
        self.state = dict() 
        for group in self.param_groups: 
            for p in group['params']: 
                self.state[p] = dict(mom=torch.zeros_like(p.data)) 
      
    # Step Method 
    def step(self): 
        for group in self.param_groups: 
            for p in group['params']: 
                if p not in self.state: 
                    self.state[p] = dict(mom=torch.zeros_like(p.data)) 
                mom = self.state[p]['mom'] 
                mom = self.momentum * mom - group['lr'] * p.grad.data 
                p.data += mom

The Step Method:

The step method is used to update the parameters of the model. This method takes no arguments and updates the internal state and the model parameters. In the case of our MomentumOptimizer, the step method would look something like this:

In the above example, we iterate over all the parameters in the model and check if they are in the state dictionary. If they are not, we add them to the state dictionary with an initial velocity vector of zero. We then calculate the new velocity vector using the momentum and the learning rate and update the parameter’s value using this velocity vector.

Using the custom optimizer is similar to using the built-in optimizers, in that we instantiate it and pass in the model’s parameters and the hyperparameters.

Illustration 1:

Let’s create a simple training loop that shows how to use the custom optimizer to train a model. The loop would perform the following steps:

Initialize the gradients of the model’s parameters to zero using the optimizer’s zero_grad method.
Compute the forward pass of the model on some input data and calculate the loss.
Compute the gradients of the model’s parameters with respect to the loss using the backward method.
Call the step method of the optimizer to update the model’s parameters based on the current gradients and the optimizer’s internal state.

Step 1. Import the necessary libraries:

Python3

# Import the necessary libraries 
import torch 
import torch.nn as nn 
# To plot the figure 
import matplotlib.pyplot as plt

Step 2: Define a custom optimizer class that inherits from torch.optim.Optimizer. In this example, we will create a custom optimizer that implements the Momentum optimization algorithm.

Python3

# MomentumOptimizer 
class MomentumOptimizer(torch.optim.Optimizer): 
      
    # Init Method: 
    def __init__(self, params, lr=1e-3, momentum=0.9): 
        super(MomentumOptimizer, self).__init__(params, defaults={'lr': lr}) 
        self.momentum = momentum 
        self.state = dict() 
        for group in self.param_groups: 
            for p in group['params']: 
                self.state[p] = dict(mom=torch.zeros_like(p.data)) 
      
    # Step Method 
    def step(self): 
        for group in self.param_groups: 
            for p in group['params']: 
                if p not in self.state: 
                    self.state[p] = dict(mom=torch.zeros_like(p.data)) 
                mom = self.state[p]['mom'] 
                mom = self.momentum * mom - group['lr'] * p.grad.data 
                p.data += mom

Step 3: Define a simple model, loss function and also initialize an instance of the custom optimizer:

Python3

# Define a simple model 
model = nn.Linear(2, 2) 
  
# Define a loss function 
criterion = nn.MSELoss() 
  
# Define the optimizer 
optimizer = MomentumOptimizer(model.parameters(), lr=1e-3, momentum=0.9)

Step 4: Generate some random data to train the model

Python3

# Generate some random data 
X = torch.randn(100, 2) 
y = torch.randn(100, 1)

Step 5:Train the model with custom optimizer and Plot the training loss.

Python3

# Training loop 
for i in range(2500): 
    optimizer.zero_grad() 
    y_pred = model(X) 
    loss = criterion(y_pred, y) 
      
    # Plot losses 
    if i%100 ==0: 
        plt.plot(i,loss.item(),'ro-') 
      
    loss.backward() 
    optimizer.step() 
      
plt.title('Losses over iterations') 
plt.xlabel('iterations') 
plt.ylabel('Losses') 
plt.show()

Output:

Losses

You will notice that your custom optimizer is correctly updating the parameters of the model and minimizing the loss function.

Note: The above loop is an example on how to use the custom optimizer and it will help you understand how the step method of optimizer is working.

Customizing Optimizers:

There are many ways to customize optimizers in PyTorch, Some of them are as follows:

Changing the learning rate schedule:

The learning rate of the optimizer can be changed during training using a learning rate scheduler. PyTorch provides several built-in schedulers such as torch.optim.lr_scheduler.StepLR and torch.optim.lr_scheduler.ExponentialLR. We can also create our own scheduler by inheriting from the torch.optim.lr_scheduler._LRScheduler class.

In below code, we are using the torch.optim.lr_scheduler.StepLR scheduler which will multiply the learning rate by a factor of gamma every step_size iterations.

Python3

# Initialize an optimizer with a fixed learning rate 
optimizer = torch.optim.SGD(model.parameters(), lr=0.01) 
  
# Create a learning rate scheduler 
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1) 
  
num_epochs = 200
# In the training loop 
for i in range(num_epochs): 
    # Perform the training step 
    optimizer.zero_grad() 
      
    y_pred = model(X) 
    loss = criterion(y_pred, y) 
      
    loss.backward() 
    optimizer.step() 
    # Update the learning rate 
    scheduler.step()

Adding regularization

To add regularization to the optimizer, we can modify the step() method to include the regularization term in the update of the model parameters. For example, we can add L1 or L2 regularization by modifying the step() method to include a term that penalizes the absolute or squared values of the parameters respectively.

Python3

# Define custom optimizer 
class MyAdam(torch.optim.Adam): 
    def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), weight_decay=0): 
        super().__init__(params, lr=lr, betas=betas) 
        self.weight_decay = weight_decay 
  
    def step(self): 
        for group in self.param_groups: 
            for p in group['params']: 
                if p.grad is None: 
                    continue
                grad = p.grad.data 
                if grad.is_sparse: 
                    raise RuntimeError("Adam does not support sparse gradients") 
  
                state = self.state[p] 
  
                # State initialization 
                if len(state) == 0: 
                    state["step"] = 0
                    # Exponential moving average of gradient values 
                    state["exp_avg"] = torch.zeros_like(p.data) 
                    # Exponential moving average of squared gradient values 
                    state["exp_avg_sq"] = torch.zeros_like(p.data) 
  
                exp_avg, exp_avg_sq = state["exp_avg"], state["exp_avg_sq"] 
                beta1, beta2 = group["betas"] 
  
                state["step"] += 1
  
                if self.weight_decay != 0: 
                    grad = grad.add(p.data, alpha=self.weight_decay) 
  
                # Decay the first and second moment running average coefficient 
                exp_avg.mul_(beta1).add_(1 - beta1, grad) 
                exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad) 
  
                denom = exp_avg_sq.sqrt().add_(group["eps"]) 
  
                bias_correction1 = 1 - beta1 ** state["step"] 
                bias_correction2 = 1 - beta2 ** state["step"] 
                step_size = group["lr"] * math.sqrt(bias_correction2) / bias_correction1 
  
                p.data.addcdiv_(-step_size, exp_avg, denom) 
  
# Optimizer 
optimizer = MyAdam(model.parameters(), weight_decay=0.00002)

In the above code, we are creating a custom Adam optimizer that includes weight decay regularization by adding a weight_decay parameter to the optimizer, and modifying the step() method to include the weight decay term in the update of the parameters. The weight decay term is applied to the gradients by grad = grad.add(p.data, alpha=group[“weight_decay”]) , this will penalize large parameter values by decreasing their update.

Implementing a new optimization algorithm:

PyTorch provides several built-in optimization algorithms, such as SGD, Adam, and Adagrad. However, there are many other optimization algorithms that are not included in the library. By creating a custom optimizer, we can implement any optimization algorithm that we want.

Python3

class MyOptimizer(torch.optim.Optimizer): 
    def __init__(self, params, lr=0.01): 
        defaults = dict(lr=lr) 
        super(MyOptimizer, self).__init__(params, defaults) 
  
    def step(self): 
        for group in self.param_groups: 
            for p in group['params']: 
                if p.grad is None: 
                    continue
                p.data = p.data - group['lr']*p.grad.data**2
  
optimizer = MyOptimizer(model.parameters(), lr=0.001)

In this example, we created a new optimization algorithm called MyOptimizer, that performs updates to the parameters based on the squared gradient values, instead of the gradients themselves.

Using multiple optimizers:

In some cases, we may want to use different optimizers for different parts of the model. For example, we may want to use Adam for the parameters of the convolutional layers, and SGD for the parameters of the fully-connected layers. This can be achieved by creating multiple instances of the optimizer, one for each set of parameters.

Python3

# Define different optimizers for different parts of the model 
params1 = model.conv_layers.parameters() 
params2 = model.fc_layers.parameters() 
  
optimizer1 = torch.optim.Adam(params1) 
optimizer2 = torch.optim.SGD(params2, lr=0.01) 
  
# In the training loop 
for i in range(num_epochs): 
    # Perform the training step 
    ... 
    optimizer1.zero_grad() 
    optimizer2.zero_grad() 
    loss.backward() 
    optimizer1.step() 
    optimizer2.step() 

In this example, we are using Adam optimizer for the parameters of the convolutional layers, and SGD optimizer with a fixed learning rate of 0.01 for the parameters of the fully-connected layers. This can help fine-tune the training of specific parts of the model.

Illustration 2:

Build a handwritten digit classifications model using a custom optimizer

Step 1:

Import the necessary libraries

Python3

import torch 
import torch.nn as nn 
from torch.optim import Optimizer 
from torch.utils.data import DataLoader 
from torchvision.datasets import MNIST 
from torchvision.transforms import ToTensor 
from torch.utils.tensorboard import SummaryWriter 
import math 
import matplotlib.pyplot as plt 
  
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Step 2:

Now, we’ll load the MNIST dataset, and create a data loader for it.

Python3

# Loading the dataset 
dataset = MNIST(root='.', train=True, download=True, transform=ToTensor()) 
dataloader = DataLoader(dataset, batch_size=32, shuffle=True) 
dataloader.dataset

Output:

Dataset MNIST
    Number of datapoints: 60000
    Root location: .
    Split: Train
    StandardTransform
Transform: ToTensor()

Step 3:

Let’s visualize the first batch of our dataset.

Python3

sample_idx = torch.randint(len(dataloader), size=(1,)).item() 
len(dataloader) 
for i, batch in enumerate(dataloader): 
    figure = plt.figure(figsize=(16, 16)) 
    img, label = batch 
    for j in range(img.shape[0]): 
        figure.add_subplot(8, 8, j+1) 
        plt.imshow(img[j].squeeze(), cmap="gray") 
        plt.title(label[j]) 
        plt.axis("off") 
          
    plt.show() 
    break

Output:

First batch input images

Step 4:

Next, we’ll define our model architecture, a simple fully connected network with two hidden layers

Python3

class Net(nn.Module): 
    def __init__(self): 
        super(Net, self).__init__() 
        self.fc1 = nn.Linear(28*28, 512) 
        self.fc2 = nn.Linear(512, 512) 
        self.fc3 = nn.Linear(512, 10) 
  
    def forward(self, x): 
        x = x.view(-1, 28*28) 
        x = torch.relu(self.fc1(x)) 
        x = torch.relu(self.fc2(x)) 
        x = self.fc3(x) 
        return x 
        
# Model 
model = Net().to(device)

Step 4:

we’ll define our loss function, in this case, we’ll use the cross-entropy loss.

Python3

# Loss functions 
loss_fn = nn.CrossEntropyLoss()

Step 5:

Next, we’ll define our custom optimizer

Python3

# Define custom optimizer 
class MyAdam(torch.optim.Adam): 
    def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), weight_decay=0): 
        super().__init__(params, lr=lr, betas=betas) 
        self.weight_decay = weight_decay 
  
    def step(self): 
        for group in self.param_groups: 
            for p in group['params']: 
                if p.grad is None: 
                    continue
                grad = p.grad.data 
                if grad.is_sparse: 
                    raise RuntimeError("Adam does not support sparse gradients") 
  
                state = self.state[p] 
  
                # State initialization 
                if len(state) == 0: 
                    state["step"] = 0
                    # Exponential moving average of gradient values 
                    state["exp_avg"] = torch.zeros_like(p.data) 
                    # Exponential moving average of squared gradient values 
                    state["exp_avg_sq"] = torch.zeros_like(p.data) 
  
                exp_avg, exp_avg_sq = state["exp_avg"], state["exp_avg_sq"] 
                beta1, beta2 = group["betas"] 
  
                state["step"] += 1
  
                if self.weight_decay != 0: 
                    grad = grad.add(p.data, alpha=self.weight_decay) 
  
                # Decay the first and second moment running average coefficient 
                exp_avg.mul_(beta1).add_(1 - beta1, grad) 
                exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad) 
  
                denom = exp_avg_sq.sqrt().add_(group["eps"]) 
  
                bias_correction1 = 1 - beta1 ** state["step"] 
                bias_correction2 = 1 - beta2 ** state["step"] 
                step_size = group["lr"] * math.sqrt(bias_correction2) / bias_correction1 
  
                p.data.addcdiv_(-step_size, exp_avg, denom) 
  
# Optimizer 
optimizer = MyAdam(model.parameters(), weight_decay=0.00001)

Step 6:

Now, Train the model with custom optimizer and Plot the training loss.

Python3

# Training loop 
num_epochs = 10
for i in range(num_epochs): 
    for inputs, labels in dataloader: 
        inputs, labels = inputs.to(device), labels.to(device) 
        outputs = model(inputs) 
        loss = loss_fn(outputs, labels) 
  
        optimizer.zero_grad() 
        loss.backward() 
        optimizer.step() 
        #scheduler.step() 
          
    plt.plot(i,loss.item(),'ro-') 
    print(i,'>> Loss :', loss.item()) 
  
plt.title('Losses over iterations') 
plt.xlabel('iterations') 
plt.ylabel('Losses') 
plt.show()

Output:

0 >> Loss : nan
1 >> Loss : 1.2611686178923354e-44
2 >> Loss : nan
3 >> Loss : 8.407790785948902e-45
4 >> Loss : nan
5 >> Loss : 1.401298464324817e-45
6 >> Loss : nan
7 >> Loss : 0.0
8 >> Loss : nan
9 >> Loss : 1.401298464324817e-45

Losses

Note: Losses will be different for different devices.

Conclusion:

Creating custom optimizers in PyTorch is a powerful technique that allows us to fine-tune the training process of a machine learning model. By inheriting from the torch.optim.Optimizer class and implementing the __init__, step, and zero_grad methods, we can create our own optimization algorithm, adding regularization, changing learning rate schedule, or using multiple optimizers. Custom optimizers can help to improve the performance of a model and make it more suitable for a specific problem.

Optimizers in Tensorflow

geekmuzammil

Improve

Article Tags :

Practice Tags :

Machine Learning

Custom Optimizers in Pytorch

Creating a Custom Optimizer:

Init Method:

Python3

The Step Method:

Illustration 1:

Step 1. Import the necessary libraries:

Python3

Step 2: Define a custom optimizer class that inherits from torch.optim.Optimizer. In this example, we will create a custom optimizer that implements the Momentum optimization algorithm.

Python3

Step 3: Define a simple model, loss function and also initialize an instance of the custom optimizer:

Python3

Step 4: Generate some random data to train the model

Python3

Step 5:Train the model with custom optimizer and Plot the training loss.

Python3

Customizing Optimizers:

Changing the learning rate schedule:

Python3

Adding regularization

Python3

Implementing a new optimization algorithm:

Python3

Using multiple optimizers:

Python3

Illustration 2:

Step 1:

Python3

Step 2:

Python3

Step 3:

Python3

Step 4:

Python3

Step 4:

Python3

Step 5:

Python3

Step 6:

Python3

Conclusion:

Similar Reads

Thank You!

What kind of Experience do you want to share?