0% found this document useful (0 votes)
5 views

The Backpropagation Algorithm for a Math Student

This paper discusses the Backpropagation (BP) algorithm, which is essential for training Deep Neural Networks (DNNs) by efficiently calculating the gradient of the loss function using Jacobian matrices. The authors aim to express the gradient in terms of matrix multiplication, making the process comprehensible to a wider audience, including those without a deep mathematical background. The paper provides a detailed algorithm and mathematical justifications for the gradient calculations, facilitating understanding of DNN training processes.

Uploaded by

Tigabu Yaya
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

The Backpropagation Algorithm for a Math Student

This paper discusses the Backpropagation (BP) algorithm, which is essential for training Deep Neural Networks (DNNs) by efficiently calculating the gradient of the loss function using Jacobian matrices. The authors aim to express the gradient in terms of matrix multiplication, making the process comprehensible to a wider audience, including those without a deep mathematical background. The paper provides a detailed algorithm and mathematical justifications for the gradient calculations, facilitating understanding of DNN training processes.

Uploaded by

Tigabu Yaya
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

The Backpropagation algorithm for a math student

Saeed Damadi1 , Golnaz Moharrer2 , Mostafa Cham2 , Jinglai Shen1


1
Department of Mathematics and Statistics, 2 Department of Information Systems
University of Maryland, Baltimore County (UMBC)
Baltimore, USA
{sdamadi1, golnazm1, mcham2, shenj}@umbc.edu

Abstract—A Deep Neural Network (DNN) is a composite


function of vector-valued functions, and in order to train a DNN,
it is necessary to calculate the gradient of the loss function with
respect to all parameters. This calculation can be a non-trivial
task because the loss function of a DNN is a composition of
2023 International Joint Conference on Neural Networks (IJCNN) | 978-1-6654-8867-9/23/$31.00 ©2023 IEEE | DOI: 10.1109/IJCNN54540.2023.10191596

several nonlinear functions, each with numerous parameters.


The Backpropagation (BP) algorithm leverages the composite
structure of the DNN to efficiently compute the gradient. As a
result, the number of layers in the network does not significantly
impact the complexity of the calculation. The objective of this
paper is to express the gradient of the loss function in terms of
a matrix multiplication using the Jacobian operator. This can be
achieved by considering the total derivative of each layer with Fig. 1. Relationship between training of a NN and BP algorithm
respect to its parameters and expressing it as a Jacobian matrix.
The gradient can then be represented as the matrix product of
these Jacobian matrices. This approach is valid because the chain function of a DNN. As its name suggests, the SGD algorithm
rule can be applied to a composition of vector-valued functions, requires calculating the gradient1 of the loss function of a
and the use of Jacobian matrices allows for the incorporation of
multiple inputs and outputs. By providing concise mathematical DNN. All different variants of the SGD algorithm require
justifications, the results can be made understandable and useful calculating at least a single gradient associated with a single
to a broad audience from various disciplines. sample, i.e., x in Fig. 1. This calculation is a non-trivial
task because the loss function of a DNN is a composition of
I. I NTRODUCTION several nonlinear vector-valued functions where each one has
Understanding the process of training a (Deep) Neural numerous parameters. The Backpropagation (BP) algorithm,
Networks (D)NNs, as illustrated in Fig 1, is not straightfor- introduced by [13] is an efficient way to calculate the gradient
ward because it involves many detailed parts. Additionally, of the loss function of a DNN. This algorithm leverages
modern and sophisticated DNNs are often provided as pre- the composite structure of a DNN to efficiently calculate
trained models in Python packages such as Pytorch [1] and the gradient of the loss function with respect to the model’s
TensorFlow [2], which can make it difficult for users to fully parameters, i.e., θ in Fig. 1.
understand the training process. These packages abstract away In this paper we are going to calculate the gradient of the
many of the implementation details, making it more accessible loss function of a DNN associated with a single sample. The
for users to use these models for their own tasks, but also gradient will be derived as the matrix multiplication of Jaco-
making it less transparent for the user to understand the inner bian matrices 2 . The derivation will be done by considering the
workings of the model. total derivative of each layer with respect to its parameters and
These pre-trained off-the-shelf models can solve variety of expressing it as a Jacobian matrix. The gradient can then be
tasks such as computer vision or language processing. Convo- represented as the matrix product of these Jacobian matrices.
lutional neural networks (CNNs) are currently the most widely This approach is well-founded because the chain rule is valid
used architecture for image classification and other computer for the Jacobian operator. Hence, the Jacobian operator can
vision tasks. Some examples of successful CNN architectures be applied to a composition of vector-valued functions. We
for image classification include ResNet [3], Inception [4], provide concise mathematical justifications so the results can
DenseNet [5], and EfficientNet [6]. CNNs models are capable be made understandable and useful to a broad audience from
of solving image segmentation and object detection which can various disciplines, even those without a deep understanding of
be done using U-Net [7] and YOLO [8]. For natural language the mathematics involved. This is particularly important when
processing tasks, transformer models like BERT [9], GPT-3 communicating complex technical concepts to non-experts, as
[10], and T5 [11] have achieved state-of-the-art performance
1 Please refer Def. 2 in Appendix for the definition of the gradient of a
on many benchmarks.
scalar-valued function.
Fig 1 shows the training process that utilizes the Stochastic 2 Please refer Def. 1 in Appendix for the definition of a Jacobian matrix
Gradient Descent (SGD) algorithm [12] to minimize the loss of a vector-valued function.

Authorized licensed use limited to: Universita degli Studi di Bologna. Downloaded on February 10,2025 at 16:08:52 UTC from IEEE Xplore. Restrictions apply.
Algorithm 1 The backpropagation algorithm
Require: Given an L-layer DNN or NN with a loss function ℓ, and a data pair (x, y). Let a[0] := x.
1: 
Calculate ∇z[L] ℓ y, f (z[L] ) from Tab. I.
2: for l = 1, . . . , L do
3: if l ̸= L then
 ⊤ 
    a[l] 0 0
⊤ ..
JW[l] ,b[l] z[l] = W[L] JW[L−1] ,b[L−1] f [L−1] (z[L−1] ) · · · JW[l] ,b[l] f [l] (z[l] ) 
 
 0 . 0 I

⊤
0 0 a[l]
4: else if l = L then
 ⊤ 
a[L] 0 0
JW[L] ,b[L] z[L] = 
  .. 
 0 . 0 I

⊤
0 0 a[L]

5: Construct
h    i
Jθ z[L] (θ) = JW[1] ,b[1] z[L] (θ) · · · JW[L] ,b[L] z[L] (θ) .

6: Calculate the gradient


⊤
∇θ ℓ y, ŷ(θ) = Jθ z[L] (θ) ∇z[L] ℓ y, f (z[L] ) .
 

it allows for a clear and accurate understanding of the results. i-th element of a vector x is denoted by xi . Likewise, wij
Additionally, using mathematical notation allows for precise denotes the ij-th element of a matrix W located at the i-th
and unambiguous statements of results, which can facilitate row and j-th column (sometimes written as wi,j for clarity).
replication and further research in the field. Also, Wi• and W•j denote the i-th row and the j-th column
Our results is summarized in Alg. 1 for L number of layers. of the matrix, respectively. The vector form of a matrix W
As Alg. 1 shows the matrix multiplication is done iteratively. is denoted by Vec(W), where each column of W is stacked
The iterative nature comes from the fact that the loss function on top of each other, with the first column at the top. The
of a DNN is defined as a composition of L + 1 functions letter f is reserved for a vector-valued non-linear activation
where the last one ℓ is the final function which measures the function of a layer in a DNN (NN), where z and a are its
loss (error) of the prediction and the actual value, i.e., ℓ(y, ŷ) input and output, respectively, i.e., a = f (z). The letter x is
in Fig 1 where ŷ is the prediction and y the actual value. reserved for the input to a DNN (NN), and y or y are reserved
The algorithm presented in Alg. 1 is explained and justified for the scalar or vector label of the input x. The predictions
by calculating the gradient of the loss function for networks of a DNN (NN) associated with y or y are denoted by ŷ
with one, two, and three layers. These calculations provide or ŷ, respectively. Superscripted index inside square brackets
insights into the gradients of loss functions for generic neural denotes the layer of a DNN (NN), e.g., z[l] is the z vector
networks and demonstrate how the gradient of a single-layer corresponding to the l-th layer.
network can serve as a model for the last layer of any DNN.
Additionally, the calculation of a two-layer network is used to III. R ESULT
extend the calculation beyond two-layer networks, as seen in
the calculation of the gradient of the loss function for LeNet- As we have explained earlier, the goal of this paper is to
100-300 [14], which is a three-layer network. Finally, we show take the first step towards training a DNN, which involves
how convolutional layers can be converted to linear layers in calculating the gradient of a loss function with respect to all
order to calculate their Jacobian matrices. These results can be parameters of the DNN, i.e., ∇θ ℓ(y, ŷ(θ)), where θ is the vec-
used to calculate the gradient of the loss function of a CNN. tor of parameters, ŷ is the predicted value by the network, y
is the true value, and ℓ(y, ŷ(θ)) is the loss incurred to predict
II. N OTATION
the output. To achieve this goal, two important observations
The letters x, x, and W denote a scalar, vector, and matrix, can make the task easier. First, it involves separating the last-
respectively. The letter I represents the identity matrix. The layer activation function from the network. Second, it involves

Authorized licensed use limited to: Universita degli Studi di Bologna. Downloaded on February 10,2025 at 16:08:52 UTC from IEEE Xplore. Restrictions apply.
using the relationship between the Jacobian operator and the
gradient.
To fulfill the first step, we observe the following equality:
 
ℓ y, ŷ(θ) = ℓ y, f [L] z[L] (θ)



which is a consequence of the fact that ŷ = f [L] z[L] (θ) ,
where f [L] is the last activation layer and z[L] (θ) is its Fig. 3. Last-Layer activation function combined with its associated loss
corresponding input. This equality can be illustrated more
clearly as shown in Fig. 2.
Second observation utilizes the relationship between the 3, which shows three common combinations, each associated
Jacobian operator and the gradient of a scalar-valued function with a different problem.
stated in the following lemma. In a binary classification problem, a sigmoid function is
Lemma 1 (Jacobian and gradient of a scalar-valued func- used as the last activation function together with Binary Cross
tion): For f : Rn → R as a scalar-valued differentiable func- Entropy (BCE) loss, as defined in Definition 3 and Definition

tion ∇x f (x) = Jx f (x) where ∇x f (x) is the gradient 5 in the Appendix, respectively. Similarly, in a classification
problem with more than two classes, a softmax function
and Jx f (x) is the Jacobian matrix of f at point x ∈ Rn
is used as the last activation function together with Cross
respectively.
Entropy (CE) loss, as defined in Definition 4 and Definition
Proof 1: The equality follows from the definitions of a
6 in the Appendix, respectively. In a regression problem,
Jacobian matrix as defined in Def. 1 and the gradient of a
where the goal is to predict a continuous value, the last-layer
scalar-valued function as defined in Def. in 2 in Appendix.
 ⊤ activation function is an identity function, and Square Error
By using the second observation as ∇x f (x) = Jx f (x) , (SE) is used as the loss function, as defined in Definition
and making use of the first one as 7. This means that the predicted output ŷ is equal to the
  final layer’s output z[L] . Table I shows the combination of
ℓ y, ŷ(θ) = ℓ y, f [L] z[L] (θ) ,

the last activation function with its corresponding loss and

the expression for ∇z[L] ℓ y, f [L] z[L] . For the derivation,
one can write the following:
please refer to Appendix A.
   ⊤ More work is required to calculate the second term, i.e.,
∇θ ℓ y, ŷ(θ) = Jθ ℓ y, ŷ(θ)
Jθ z[L] (θ). In the following subsections, we show how to
!⊤
  derive this calculation concisely for any number of layers.
[L] [L]
= Jθ ℓ y, f z (θ)
A. Gradient of a one-layer network
 
!⊤ We will now focus on computing Jθ z[L] (θ). This calcula-
= Jz[L] ℓ y, f [L] z[L] Jθ z[L] (θ) tion can be facilitated by starting with a single layer neural
network that is capable of solving a classification problem.
!⊤ !⊤ The single layer network plays a crucial role in gradient
 
= Jθ z[L] (θ) Jz[L] ℓ y, f [L] z[L] computation as it can be considered as the last layer of any
deep neural network (DNN). Due to the presence of only one
!⊤
  layer, the superscripts in z[L] and f [L] can be omitted, giving
[L]
= Jθ z (θ) ∇z[L] ℓ y, f [L] z[L] us Jθ z(θ) := Jθ z[L] (θ). The network depicted in Fig. 4 is
designed to perform three-class classification based on inputs
(1)
with four features. Consequently, the weight matrix W is a
The advantage of Equation (1) is that it separates the orig-
4×3 and the bias vector b is a 3×1, i.e., W ∈ R4×3 and b ∈
inal gradient calculation
 into two separate calculations, i.e.,
[L] [L] [L]
R3×1 . As shown in the concise representation of the network
Jθ z (θ) and ∇z[L] ℓ y, f z . ⊤
in the bottom part of Fig. 4, z ∈ R3 , i.e., z = W x+b. This
The calculation of the second term is straightforward be- vector z is a vector-valued function including 15 parameters
cause the choice of a loss function and the last activation where these parameters are all elements in W and b, i.e.,
function in a DNN are not arbitrary. This is illustrated in Fig. 15 = 4 × 3 + 3 × 1. According to the definition of Jacobian 
matrix as defined in Def. 1 in Appendix,  JW,b z(W, b) is
a 3 × 15 matrix, i.e., JW,b z(W, b) ∈ R3 × R15 .
For notational simplicity, all the parameters are denoted by
θ which is a vector in R15 , and is constructed by the process
of vectorization. The vectorization process stacks each column
of W on top of each other, with the first column on the top
Fig. 2. Separating the last layer activation from a DNN to create Vec(W). Then, the vector of network parameters

Authorized licensed use limited to: Universita degli Studi di Bologna. Downloaded on February 10,2025 at 16:08:52 UTC from IEEE Xplore. Restrictions apply.
TABLE I
L AST-L AYER ACTIVATION FUNCTION COMBINED WITH ITS ASSOCIATED LOSS
   
Loss ℓ(y, ŷ) ∇z[L] ℓ y, ŷ = ∇z[L] ℓ y, f [L] z[L] Proof

BCE −y log(ŷ) − (1 − y) log(1 − ŷ) −(y − ŷ) (y, ŷ ∈ R) Appendix A


− ci=1 yi log(ŷi ) Rc
P
CE −(y − ŷ) (y, ŷ) ∈ Appendix B
SE ∥y − ŷ∥2 −2(y − ŷ) (y, ŷ ∈ Rm ) Appendix C

h i⊤
can be written as θ ⊤ := Vec(W) ⊤ (b)⊤ . Therefore will be considered, as shown in Fig. 5 where block-wise model


to calculate Jθ z(θ) one can write the following: helps calculating the gradient of its
 loss. Similar to a one-

 ⊤  layer network, one can write ∇z[2] ℓ y, f [2] z[2] = −(y− ŷ)
Jθ z(θ) = JW,b W x + b from Tab. I. To calculate Jθ z[2] (θ), observe that it can be
 ⊤ 
separated into parameters of second and first layer as does θ.
 W•1  + b1 
The separation for θ can be written as the following:
= JW,b  W•2 ⊤ + b2 
 
⊤
W•3 + b3 h ⊤ ⊤ i⊤
 ⊤  θ ⊤ := Vec(W[1] ) (b[1] )⊤ Vec(W[2] ) (b[2] )⊤
x 0 0 1 0 0
=  0 x⊤ 0 0 1 0
which is used to write the below separation of Jacobian
0 0 x⊤ 0 0 1
 ⊤  matrices:
x 0 0
=  0 x⊤ 0 I3×3  ∈ R3×15
h    i
Jθ z[2] (θ) = JW[1] ,b[1] z[2] (θ) JW[2] ,b[2] z[2] (θ) .
0 0 x⊤
where W•i is a column of W for i = 1, 2, 3 and I3×3 appears Calculating the second term is straight forward because one
because the derivative of each element of z with respect to can write the following:
components of b are either zero or one. Therefore the gradient    
⊤
of the loss function is calculated as follows JW[2] ,b[2] z[2] (θ) = JW[2] ,b[2] W[2] a[1] + b[2]
 ⊤ 
∇θ ℓ y, ŷ(θ) = Jθ z(θ) ∇z ℓ y, f (z) ⊤
 
  a[1] 0 0
x 0 0 ⊤
= 0 a[1]
 
0 I3×3 
0 x 0  ⊤
= − 0
 (y − ŷ) 0 0 a[1]
0 x
I3×3
 where a[1] ∈ R5 because W[2] ∈ R5×3 . To calculate the
where the value for ∇z ℓ y, f (z) is obtained from Tab. I. second term first observe that W[2] is not a function of W[1]
The above gradient is similar to the gradient of one-layer
networks whose weights are vectors not matrices as shown in
Tab. II. As it can be seen from Tab. II famous problems such as
simple/multiple linear regression, simple binary classification,
and logistic regression can be written as a one-layer network
whose weight are vectors not matrices.
Although a one-layer network provides valuable understand-
ing of the relationship between the Jacobian and the gradient
of the loss function with respect to z, it is not practical in
terms of performance, i.e., accuracy in prediction. To improve
performance, adding more layers is recommended. As a result,
the following subsection will demonstrate the calculation of
the gradient of a two-layer network.
B. Gradient of a two-layer network
Studying a two-layer network can not only improve per-
formance (accuracy) but also aid in developing a method for
calculating the gradient of any deep neural network (DNN)
with multiple layers. To demonstrate this, a two-layer network Fig. 4. 3-class classifier with 4 features and a single layer.

Authorized licensed use limited to: Universita degli Studi di Bologna. Downloaded on February 10,2025 at 16:08:52 UTC from IEEE Xplore. Restrictions apply.
TABLE II
FAMOUS PROBLEMS REPRESENTED AS SIMPLE NEURAL NETWORKS .

Simple Linear Regression Simple Binary Classifier Multiple Linear Regression Logistic Regression

Architecture
   
x1 x1
   x2   x2 
x1
   
Input x∈R x= ∈ R2 x =  ..  ∈ Rn x =  ..  ∈ Rn
   
x2  .   . 
xn−1  xn−1 
xn xn
   
w1 w1
     .     .   
w w  .  w  .  w
Parameters θ= ∈ R2 θ == ∈ R3 θ= . = ∈ Rn+1 θ= . = ∈ Rn+1
b b w  b w  b
n n
b b
⊤ ⊤
Predictions ŷ = wx + b ŷ = σ(θ x) ŷ = w x + b ŷ = σ(θ ⊤ x)
Loss ℓ = (y − ŷ)2 ℓ = −y log(ŷ) − (1 − y) log(1 − ŷ) ℓ = (y − ŷ)2 ℓ = −y log(ŷ) − (1 − y) log(1 − ŷ)
       
x x x x
Gradient ∇ℓθ (y, ŷ) = 2 (y − ŷ) ∇ℓθ (y, ŷ) = − (y − σ(θ ⊤ x)) ∇ℓθ (y, ŷ) = 2 (y − ŷ) ∇ℓθ (y, ŷ) = − (y − σ(θ ⊤ x))
1 1 1 1

shown below:
[2] [2] ⊤ [1] [1]  [1] ⊤ [1]
     
J z (θ) = W J [1] f z J W x+b
W[1] ,b[1] z W[1] ,b[1]
  

 x 0 0 
.
..
 
[2] ⊤ [1] [1]  
  
= W J [1] f z
.

 
z  0
 0 I 

 
x ⊤

0 0

Finally, the gradient would be the following:


 [2] ⊤ [2] 
∇θ ℓ y, ŷ(θ) = Jθ z (θ) ∇z[2] ℓ y, f (z )
  
x 0 0
 
..
  


 0 . 0

 J [1]

f [1]
z [1]
⊤ [2] 
W 

  z 
 0 0 x

 
  .
I5×5
 
= −
  
 (y − ŷ)

[1]
Fig. 5. Two-layer network

 a 0 0 

   


 0
 a[1] 0  


  
[1] 

  0 0 a  
  
I3×3
nor b[1] . Then one can write the following:
The next subsection will demonstrate the calculation of the
gradient for a three-layer network, thereby illustrating the
    extension of a two-layer network gradient to an arbitrary
⊤
JW[1] ,b[1] z[2] (θ) = JW[1] ,b[1] W[2] a[1] + b[2] number of layers.
⊤  
= W[2] JW[1] ,b[1] a[1] C. Gradient of a three-layer network
⊤   In this subsection, we evaluate the gradient of a loss function
= W[2] JW[1] ,b[1] f [1] (z[1] )
  for a LeNet-100-300-10 network architecture, which consists
⊤ of 100, 300, and 10 units (neurons) in the first, second
= W[2] JW[1] ,b[1] f [1] (z[1] )
and third layer respectively [14]. This network is used for
image classification tasks [15]. The input to the network is
a vectorized representation of 28 × 28 digit images (784
The last equality is where the chain rule needs to be used in elements) and the output is a vector in R10 . The input to
order to obtain an expression in terms of W[1] and b[1] , as the LeNet-100-300-10 is a vector x ∈ R784 representing the

Authorized licensed use limited to: Universita degli Studi di Bologna. Downloaded on February 10,2025 at 16:08:52 UTC from IEEE Xplore. Restrictions apply.
Fig. 6. LeNet-100-300

vectorized
 image. Similar
 to a one- and two-layer networks D. Jacobian of activation functions
[3] [3]
∇z[3] ℓ y, f z = −(y − ŷ) ∈ R10 which is obtained
 
In this subsection we elaborate on Jz[l] f [l] z[l] where f [l]
from Tab. I. To calculate Jθ z[3] (θ), three Jacobian are needed is the l-th activation layer whose corresponding input is z[l] for
to separate out the parameters of the layers as the following: l = 1, . . . , L − 1. When l ̸= L, the vector f [l] (z[l] ) is typically
[3]
       obtained by applying a single univariate function f to each
Jθ z (θ) = J z[3] (θ) J z[3] (θ) J z[3] (θ) .
W[1] ,b[1] W[2] ,b[2] W[3] ,b[3]
element of z[l] . The most common activation function used in
By following the same steps as for the two-layer network, the DNNs is the Rectified Linear Unit (ReLU) function, defined
gradient can be calculated as: as f (x) = max(0, x) [16], [17], [18], [19]. Algebraically, this
operation can be represented as
⊤
∇θ ℓ y, ŷ(θ) = Jθ z[3] (θ) ∇z[3] ℓ y, f (z[3] )
 
⊤
f ⊤ z = f (z1 ) · · · f (zd )
 
 
where Jacobian matrices of layers J⊤ [3] z
[3]
(θ) ,
    W ,b
[3]
where f is the same univariate function applied to all elements
⊤ [3] ⊤ [3] of z, d is the output size of the layer, and we have omitted
JW[2] ,b[2] z (θ) , JW[1] ,b[1] z (θ) are:
superscripts for clarity. This special structure results in a
diagonal matrix, i.e.,
 [3] 
a 0 0
 ..   
 0
 . 0  , Jz f (z) = diag f ′ (z1 ), · · · , f ′ (zd )

 0 0 a[3] 
I10×10 where f ′ is the derivative of a univariate function.  For the
 [2]  special case of the ReLU function J z f (z) is a diagonal
a 0 0 matrix of zeros and ones which are associated to the negative
 . .   ⊤ and positive elements of z. Multiplying such a matrix from
 0 . 0   Jz[2] f [2] z[2]

W[3] ,
 the left to any matrix W results in removing  the rows of
 0 0 a[2]  W associated to zero elements in Jz f (z) which greatly
I100×100 decreases the computation. The zero-th norm of the parameter
  vector can be minimized in sparse optimizations using these
x 0 0
intuitions, as noted in [20].
..
 
 
. Note that the derivations so far have only considered fully
 
[1] [1]  ⊤ [2] [2]  ⊤
 
[2] [3]
    
0 0  J [1] f z W J [2] f z W ,


z z
connected networks. However, for computer vision tasks, CNN
 
 
0 0 x 
 
models are utilized, which employ convolution operations
 
I300×300

 instead of matrix multiplication in some layers. In the next


and ∇z[3] ℓ y, f (z[3] ) = −(y − ŷ) which is consistent with subsection, we will demonstrate how a convolutional layer can
Alg. 1. be transformed into a fully connected network.

Authorized licensed use limited to: Universita degli Studi di Bologna. Downloaded on February 10,2025 at 16:08:52 UTC from IEEE Xplore. Restrictions apply.
Lemma 3: For an input X to a convolutional layer that has r
number of filters, each calculation X∗Ki +Bi is equivalent to
(Wi )⊤ x+bi where Ki is the matrix of the i-th filter, Bi is the
 ⊤
Fig. 7. An illustration of convolution
matrix of associated bias for each filter, Wi = Teop(Ki ) ,
bi = Vec(Bi ) for i = 1, . . . , r, and x := Vec(X⊤ ).
According to the above lemma, we can convert a convolutional
E. Convolution as matrix multiplication neural network to a typical fully connected one and find its
A 2-D convolution operation is a mathematical operation gradient.
that is used to extract features or patterns from a 2-dimensional IV. C ONCLUSION
input (matrix), such as an image. It works by applying a filter
In this paper, we demonstrated the utilization of the Jacobian
or kernel, which is also a matrix, to the input image [14].
operator to simplify the gradient calculation process in DNNs.
The filter is moved across the image, performing element-wise
We presented a matrix multiplication-based algorithm that ex-
multiplications with the overlapping regions of the image and
presses the BP algorithm using Jacobian matrices and applied
filter, and then summing the results. This process is repeated
it to determine gradients for single, double, and three-layer net-
for every position of the filter on the image, resulting in a
works. Our calculations offered insights into the gradients of
new matrix output, known as a feature map. Fig. 7 shows
loss functions in DNNs; for instance, the gradient of a single-
the process of a convolution operation where the 3x3 and
layer network can serve as a model for the final layer of any
2x2 matrices represent the input and the filter respectively.
DNN. We also expanded our findings to cover more intricate
As Fig. 7 illustrates the filter slides over the input image,
architectures such as LeNet-100-300-10 and demonstrated that
one pixel at a time, and performs element-wise multiplications
the gradients of convolutional neural network layers can be
with the overlapping region of the image. The result of these
transformed into linear layers. These results can aid research
multiplications is then summed, and the sum is stored in the
on compressing DNNs that utilize the full gradient, as noted
corresponding location of the output feature map.
in [21]. Furthermore, they can benefit sparse optimization in
The size of the filter and the stride (the number of pixels
both deterministic and stochastic settings, where the Iterative
the filter is moved each time) determine the size of the output
Hard Thresholding (IHT) algorithm uses the full gradient for
feature map. Additionally, the filter can be applied multiple
a sparse solution in deterministic settings [20] and the mini-
times with different filter parameters, to extract different
batch Stochastic IHT algorithm is employed in the stochastic
features from the same input image.
context [22]. We provided concise mathematical justifications
Lemma 2: The convolution operation between two matrices,
to make the results clear and useful for people from differ-
X and K, can be represented as a matrix multiplication.
ent fields, even those without a deep understanding of the
Specifically, it can be represented as the product of a Toeplitz
involved mathematics. This was particularly important when
matrix (or diagonal-constant matrix) of K and the vector
communicating complex technical concepts to non-experts as
obtained from stacking the columns of the transpose of X
it allowed for a clear and accurate understanding of the results.
in the order of the first one on top. Mathematically, this can
Additionally, using mathematical notation allowed for precise
be represented as:
and unambiguous statements of results, facilitating replication
X ∗ K = Teop(K)Vec(X⊤ ) and further research in the field. As next steps, we intend to
study the calculation of gradients for loss functions in various
where X ∈ RmX ×nX and K ∈ RmK ×nK and types of neural networks such as residual, recurrent, Long
mX , nX , mX , nX ∈ N. Short-Term Memory (LSTM), and Transformer networks. We
Remark 1: The above representation allows for the con-
will also explore the Jacobian of batch normalization to further
volution operation to be computed efficiently using matrix
our understanding of the method.
multiplication, which can be parallelized and accelerated on a
GPU. R EFERENCES
To clarify the above lemma, consider X ∈ R3×3 and K ∈ [1] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
R 2×2
as the input and filter respectively. Then, one can verify T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An
imperative style, high-performance deep learning library,” arXiv preprint
the lemma by writing the following: arXiv:1912.01703, 2019.
  [2] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
x1 S. Ghemawat, G. Irving, M. Isard et al., “{TensorFlow}: a system
x2  for {Large-Scale} machine learning,” in 12th USENIX symposium on
 
  x3  operating systems design and implementation (OSDI 16), 2016, pp. 265–
k1 k2 0 k3 k4 0 0 0 0   283.
 0 k1 k2 0 k3 k4 0 0 0 x4 
 
[3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
X∗K=  0 0 k1 k2 0 k3 k4 0 0 x5 
  recognition,” in Proceedings of the IEEE conference on computer vision
x6  and pattern recognition, 2016, pp. 770–778.
0 0 0 k1 k2 0 k3 k4 0  
x7  [4] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
  V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
x8  in Proceedings of the IEEE conference on computer vision and pattern
x9 recognition, 2015, pp. 1–9.

Authorized licensed use limited to: Universita degli Studi di Bologna. Downloaded on February 10,2025 at 16:08:52 UTC from IEEE Xplore. Restrictions apply.
[5] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely Definition 2 (Gradient of a scalar-valued function): Let f :
connected convolutional networks,” in Proceedings of the IEEE confer- R → R be a differentiable scalar-valued function. The gradient
ence on computer vision and pattern recognition, 2017, pp. 4700–4708.
[6] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for con- of ∇f : Rn → Rn at x ∈ Rn is defined as the following:
volutional neural networks,” in International conference on machine  ∂f (x) 
learning. PMLR, 2019, pp. 6105–6114. ∂x1
[7] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks  .. 
for biomedical image segmentation,” in International Conference on ∇f (x) = 
 .
.

Medical image computing and computer-assisted intervention. Springer, ∂f (x)
2015, pp. 234–241. ∂xn
[8] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look
once: Unified, real-time object detection,” in Proceedings of the IEEE Definition 3 (Sigmoid function): A sigmoid function σ :
ex
conference on computer vision and pattern recognition, 2016, pp. 779– R → [0, 1] is defined as σ(x) = 1+e x.
788.
[9] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training Definition 4 (Softmax function): A softmax function σ :
of deep bidirectional transformers for language understanding,” arXiv Rc → Rc is defined for c ≥ 3 as the following:
preprint arXiv:1810.04805, 2018.  ex1 
[10] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal,
exj
Pc
A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language mod- j=1

els are few-shot learners,” Advances in neural information processing



σ(x) =  ..  ∈ Rc .

systems, vol. 33, pp. 1877–1901, 2020.  . 
xc
[11] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Pce
j=1 exj
Y. Zhou, W. Li, P. J. Liu et al., “Exploring the limits of transfer learning
with a unified text-to-text transformer.” J. Mach. Learn. Res., vol. 21,
no. 140, pp. 1–67, 2020.
Definition 5 (Binary Cross Entropy Loss): Let ŷ ∈ [0, 1] be
[12] H. Robbins and S. Monro, “A stochastic approximation method,” The a predicted probability for a true label whose value is either
annals of mathematical statistics, pp. 400–407, 1951. zero or one, i.e., y ∈ {0, 1}. The Binary Cross Entropy (BCE)
[13] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning repre- loss is defined as follows:
sentations by back-propagating errors,” nature, vol. 323, no. 6088, pp.
533–536, 1986.
 
[14] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning BCE(y, ŷ) = − y log(ŷ) + (1 − y) log(1 − ŷ) .
applied to document recognition,” Proceedings of the IEEE, vol. 86,
no. 11, pp. 2278–2324, 1998. Remark 2 (BCE loss): The BCE loss is minimized when the
[15] Y. LeCun, “The mnist database of handwritten digits,” http://yann. lecun.
com/exdb/mnist/, 1998.
predicted label is close to the true label. The BCE loss has a
[16] K. Fukushima, “Visual feature extraction by a multilayered network of smooth and continuous gradient which makes it suitable for
analog threshold elements,” IEEE Transactions on Systems Science and use with gradient-based optimization algorithms. It is also a
Cybernetics, vol. 5, no. 4, pp. 322–333, 1969.
convex function with respect to the variable ŷ.
[17] ——, “Cognitron: A self-organizing multilayered neural network,” Bio-
logical cybernetics, vol. 20, no. 3-4, pp. 121–136, 1975. Definition 6 (Cross Entropy Loss): Let ŷ ∈ (0, 1)c ∈ Rc
[18] D. E. Rumelhart, G. E. Hinton, J. L. McClelland et al., “A general frame- (c ≥ 3) be a predicted probability for a true one-hot vector
work for parallel distributed processing,” Parallel distributed processing: label y ∈ Rc , i.e., yj = 1, yi = 0 for j ̸= i = 1, . . . , c. The
Explorations in the microstructure of cognition, vol. 1, no. 45-76, p. 26,
1986. Cross Entropy (CE) loss is defined as follows:
[19] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltz- c
mann machines,” in Proceedings of the 27th international conference on X
machine learning (ICML-10), 2010, pp. 807–814. CE(y, ŷ) = − yi log(ŷi ).
[20] S. Damadi and J. Shen, “Gradient properties of hard thresholding i=1
operator,” arXiv preprint arXiv:2209.08247, 2022.
[21] S. Damadi, E. Nouri, and H. Pirsiavash, “Amenable sparse network Remark 3: Cross entropy function is commonly used in
investigator,” arXiv preprint arXiv:2202.09284, 2022. machine learning and information theory to measure the dif-
[22] S. Damadi and J. Shen, “Convergence of the mini-batch siht algorithm,”
arXiv preprint arXiv:2209.14536, 2022.
ference between two probability distributions. It is often used
as a loss function to evaluate the performance of classification
models.
A PPENDIX
In the context of machine learning, cross entropy is typ-
Definition 1 (Jacobian matrix of a vector-valued function): ically used to measure the difference between the predicted
Let f : Rn → R  m be a differentiable vector-valued function probability distribution (outputted by the model) and the true
f1 (x) probability distribution (which represents the actual labels
 .. 
where f (x) =  .  and x ∈ Rn . This function takes a of the data). The cross entropy loss function is designed to
fm (x) penalize the model when it assigns low probabilities to the
point x ∈ Rn as an input and produces f (x) ∈ Rm as the true labels, and to reward the model when it assigns high
output. The Jacobian matrix of f with respect to x is defined probabilities to the true labels.
to be an m × n matrix denoted by Jx f (x) as the following: Definition 7 (Squared Error Loss): Let ŷ ∈ R be a predicted
value for a true label whose value is y ∈ R. The Square Error
 ∂f (x) ∂f1 (x)
  ⊤ 
1
· · · ∇ x f 1 (x) (SE) loss is defined as follows:
 ∂x. 1 ..
∂xn
..   .. 
Jx f (x) = 
 . . . . =
 
.
.
 SE(y, ŷ) = (y − ŷ)2 .
∂fm (x) ∂fm (x)  ⊤
∂x1 · · · ∂xn ∇ f
x m (x)

Authorized licensed use limited to: Universita degli Studi di Bologna. Downloaded on February 10,2025 at 16:08:52 UTC from IEEE Xplore. Restrictions apply.
A. Sigmoid with BCE each component can be calculated as the following:
     
∂ c c ∂ c c c
Lemma 4 (Gradient of BCE loss): Let x(i) be an input z z
X X X X X
− yi zi − log( e j ) = −(yi zi ) + yi log( e j )
∂zj i=1 j=1 ∂zj i=1 i=1 j=1
to a one-layer network solving binary classification with c z
e j
y (i) ∈ {0, 1} be its true label and ŷ (i) = σ w⊤ x(i) + b
X
= −yj + yi P z
c e j
i=1 j=1
is the predicted probability corresponding to the input for c
X
= −yj + yi (σ(z))j
i = 1, . . . , N . The gradient of the BCE loss is given as follows: i=1
c
X
= −yj + (σ(z))j yi
i=1
 
      x(i) = −yj + (σ(z))j
(i) ⊤ (i) (i) ⊤ (i)  
∇θ BCE y ,σ w x +b = − y −σ w x +b  

1
 Pc
where in the last equality we have use the fact that i=1 yi =
1. Since the above calculation is for the j-th component and
for i = 1, . . . , N . (ŷ)j = (σ(z))j , by calculating other components we get the
Proof 2: We calculate the following for a fixed i ∈ desired result.
{1, . . . , N }:
C. Square Error
 !
Lemma 6: Let x(i) be an input to a one-layer network with
    
(i) ⊤ (i) (i) ⊤ (i)
∇θ BCE y ,σ w x +b = −∇θ y log σ w x +b
(i)
(i)
 
⊤ (i)
 ! y ∈ R be the corresponding true value and
+ (1 − y ) log 1−σ w x +b .

ŷ (i) = w⊤ x(i) + b ŷ (i) inR


To calculate the above observe the following:
be the corresponding prediction for i = 1, . . . , N . The gradient
 
of SE loss is defined as follows:
 
⊤ (i)

 x(i) 

⊤ (i)
 
⊤ (i)

∇θ σ w x +b =  σ w x +b 1−σ w x +b .

1

∇θ SE(y (i) , w⊤ x(i) + b) = ∇θ (y (i) − (w⊤ x(i) + b))2
 (i) 
x
Hence, we can write the following: =2 (y (i) − ŷ (i) )
1

∇θ ℓ y, ŷ(θ) = ∇θ BCE

y
(i)


w
⊤ (i)
x +b
 !
for i = 1, . . . , N .
   Proof 4: Let θ := [w⊤ b]⊤ . Then,
σ w⊤ x(i) + b 1 − σ w⊤ x(i) + b
  
(i)
(i) x   (i) 
= − y
x
   
σ w⊤ x(i) + b (i) (i) 2 (i) (i) (i)
(y (i) −ŷ (i) ).
 
1
∇θ (y −ŷ ) = 2(y −ŷ )∇θ ŷ =2
   1
σ w⊤ x(i) + b 1 − σ w⊤ x(i) + b
  
(i) !
(i) x 
− (1 − y )    
1 − σ w⊤ x(i) + b
 
1
 
   x(i)
(i) ⊤ (i)  
= −y 1−σ w x +b 



1
 
  x(i)
(i) ⊤ (i)  
+ (1 − y )σ w x +b  


1
 
   x(i)
(i) ⊤ (i)  
= − y −σ w x +b 



1

B. Softamx with CE

Lemma 5 (Gradient of CE loss w.r.t z): Let ŷ ∈ (0, 1)c ∈ Rc


(c ≥ 3) be a predicted probability such that ŷ = σ(z) where
σ is a softmax function as defined in Def. 4 and z ∈ Rn . And,
let y be a true one-hot vector label y ∈ Rc , i.e., yj = 1, yi = 0
for j ̸= i = 1, . . . , c. The gradient of the CE loss with respect
to z is the following:

∇z ℓ (y, ŷ(z)) = ∇z CE (y, ŷ(z)) = −(y − ŷ).

Proof 3: To calculate ∇z CE (y, ŷ(z)) one can rewrite it as


∇z CE (y, σ(z)). Then the loss can be expanded as follows:

c
!
X ez i
CE (y, σ(z)) = − yi log Pc
i=1 j=1 ez j

Authorized licensed use limited to: Universita degli Studi di Bologna. Downloaded on February 10,2025 at 16:08:52 UTC from IEEE Xplore. Restrictions apply.

You might also like