Initialize weights in PyTorch

Last Updated : 23 Jul, 2025

If we are trying to build a neural network then we have to initialize the layers of the network with some initial weights which we try to optimize as the training process of the model goes on. The method by which the weights of a neural network are initialized does affect the time required to reach the optimized solution and solve the problem of vanishing or exploding gradients. In this article, we will try to learn the method by which effective initialization of weights can be done by using the PyTorch machine learning framework.

Why initialize weights?

Initializing the weights of a neural network is a vital step in the training process as appropriate weight initialization is an instrumental factor impacting the convergence and performance of a network. Weights that are initialized to the same value can cause the model to converge to the same suboptimal solution, regardless of the optimization algorithm being used.

Weights that are initialized to large values can lead to vanishing or exploding gradients, depending on the activation function being used. This can cause the model to converge slowly or not at all. Weights that are initialized to small random values can lead to more efficient training, as the optimization algorithm is able to make larger updates to the weights at the beginning of training. Different initialization methods can be more suitable for different types of problems and model architectures.

Using the nn.init Module for Weights Initialization

The PyTorch nn.init module is a conventional way to initialize weights in a neural network, which provides a multitude of weight initialization methods such as:

Uniform initialization
Xavier initialization
Kaiming initialization
Zeros initialization
One's initialization
Normal initialization

An example implementation of the same is provided below:

Uniform Initialization

Using a uniform distribution to initialize the weights can help prevent the 'vanishing gradient' problem, as the distribution has a finite range and the weights are distributed evenly across that range. However, this method can suffer from the 'exploding gradient' problem if the range is too large.

Python3

import torch

# Initializing a linear layer with 2
# independent features and 3 dependent features
linear_layer = torch.nn.Linear(2, 3)

# Initializing the weights with a uniform distribution
torch.nn.init.uniform_(linear_layer.weight)

# Displaying the initialized weights
print(linear_layer.weight)

Output:

Parameter containing:
tensor([[-0.1768, -0.4942],
       [ 0.0756, -0.0967],
       [-0.3923,  0.3283]], requires_grad=True)

Xavier Initialization

Using Xavier initialization can help prevent the 'vanishing gradient' problem, as it scales the weights such that the variance of the outputs of each layer is the same as the variance of the inputs.

Python3

import torch

# Initializing a linear layer with 
# 2 independent features and 3 dependent features
linear_layer = torch.nn.Linear(2, 3)

# Initializing the weights with the Xavier initialization method
torch.nn.init.xavier_uniform_(linear_layer.weight)

# Displaying the initialized weights
print(linear_layer.weight)

Output:

Parameter containing:
tensor([[ 0.4442, -0.3890],
        [-0.2876, -0.3379],
        [-0.5261,  0.5227]], requires_grad=True)

Kaiming Initialization

Using Kaiming initialization can help prevent the 'vanishing gradient' problem, as it scales the weights such that the variance of the outputs is the same as the variance of the inputs, taking into account the nonlinearity of the activation function.

Python3

import torch

# Initializing a linear layer with
# 2 independent features and 3 dependent features
linear_layer = torch.nn.Linear(2, 3)

# Initializing the weights with the Kaiming initialization method
torch.nn.init.kaiming_uniform_(linear_layer.weight,
                               a=0, mode="fan_in",
                               nonlinearity="relu")

# Displaying the initialized weights
print(linear_layer.weight)

Output:

Parameter containing:
tensor([[ 0.0582,  0.4701],
        [ 0.4982,  0.5452],
        [-0.0384,  0.5999]], requires_grad=True)

Zeros and Ones Initialisation

Initializing the weights to zeros can cause the model to converge slowly, as all of the weights will be updated in the same direction. This can also lead to the 'vanishing gradient' problem.

Python3

import torch

# Initializing a linear layer with
# 2 independent features and 3 dependent features
linear_layer = torch.nn.Linear(2, 3)

# Initializing the weights with the 
# zeros initialization method
torch.nn.init.zeros_(linear_layer.weight)

# Displaying the initialized weights
print(linear_layer.weight)

Output:

Parameter containing:
tensor([[0., 0.],
        [0., 0.],
        [0., 0.]], requires_grad=True)

Initializing the weights to ones can cause the model to converge slowly, as all of the weights will be updated in the same direction. This can also lead to the 'exploding gradient' problem.

Python3

import torch

# Initializing a linear layer with
# 2 independent features and 3 dependent features
linear_layer = torch.nn.Linear(2, 3)

# Initializing the weights with the
# ones initialization method
torch.nn.init.ones_(linear_layer.weight)

# Displaying the initialized weights
print(linear_layer.weight)

Output:

Parameter containing:
tensor([[1., 1.],
        [1., 1.],
        [1., 1.]], requires_grad=True)

Normal Initialisation

Using a normal distribution to initialize the weights can help prevent the 'exploding gradient' problem, as the distribution has a finite range and the weights are distributed evenly around the mean. It must be noted that the neural network's performance is not impacted by the weights alone; the learning rate, the optimization algorithms and the hyperparameters used also play a crucial role in increasing the efficiency of the neural network.

Python3

import torch

# Initializing a linear layer with 
# 2 independent features and 3 dependent features
linear_layer = torch.nn.Linear(2, 3)

# Initializing the weights with the 
# normal initialization method
torch.nn.init.normal_(linear_layer.weight,
                      mean=0, std=1)

# Displaying the initialized weights
print(linear_layer.weight)

Output:

Parameter containing:
tensor([[-0.1759,  0.5192],
        [-0.5621, -0.3871],
        [-0.6071,  0.3538]], requires_grad=True)

Applying a Custom Function for Weights Initialization

An alternative method is to create a customized function to initialize the weights, which can be applied to the layer using the apply attribute.

Python3

import torch

# User defined function to initialize the weights
def custom_weights(m):
    torch.nn.init.uniform_(m.weight,
                           -0.5, 0.5)

# Initializing a linear layer with 
# 2 independent features and 3 dependent features
linear_layer = torch.nn.Linear(2, 3)

# Applying the user defined function to the layer
linear_layer.apply(custom_weights)

# Displaying the initialized weights
print(linear_layer.weight)

Output:

Parameter containing:
tensor([[ 0.4341, -0.3424],
        [ 0.2095,  0.1782],
        [-0.4244,  0.1719]], requires_grad=True)

Using a user-defined Layer Class for Weights Initialization

Another method involves creating a user-defined class that inherits from the torch.nn.Module class. Therein, the constructor can be overridden in order to implement custom weights.

Python3

import torch

# User defined Layer
class MyLayer(torch.nn.Module):

    # Overriding the constructor
    def __init__(self, independent, dependent):
        # Calling the super-class' constructor
        super(MyLayer, self).__init__()
        self.linear = torch.nn.Linear(independent,
                                      dependent)
        torch.nn.init.uniform_(self.linear.weight,
                               -0.5, 0.5)

    def forward(self, x):
        return self.linear(x)


# Initializing a linear layer with
# 2 independent features and 3 dependent features
linear_layer = MyLayer(2, 3)

# Displaying the initialized weights
print(linear_layer.linear.weight)

Output:

Parameter containing:
tensor([[-0.1566,  0.2461],
        [-0.3361, -0.0551],
        [ 0.4607,  0.3077]], requires_grad=True)

In conclusion, initializing the weights of a neural network model is an important step in the training process, as it can have a significant impact on the model's performance. PyTorch provides several built-in initialization methods, including uniform, normal, Xavier, Kaiming, ones, and zeros. Each of these methods has its own advantages and disadvantages, and the choice of method will depend on the specific problem and model architecture being used. It is important to choose an initialization method that is suitable for the problem at hand, as it can help prevent vanishing or exploding gradient problems and improve the convergence speed and final accuracy of the model.

phasing17

Improve

Article Tags :

Initialize weights in PyTorch

Why initialize weights?

Using the nn.init Module for Weights Initialization

Uniform Initialization

Xavier Initialization

Kaiming Initialization

Zeros and Ones Initialisation

Normal Initialisation

Applying a Custom Function for Weights Initialization

Using a user-defined Layer Class for Weights Initialization

Explore

Python Fundamentals

Python Data Structures

Advanced Python

Data Science with Python

Web Development with Python

Python Practice

Thank You!

What kind of Experience do you want to share?