0% found this document useful (0 votes)

5 views

The Backpropagation Algorithm for a Math Student

This paper discusses the Backpropagation (BP) algorithm, which is essential for training Deep Neural Networks (DNNs) by efficiently calculating the gradient of the loss function using Jacobian matrices. The authors aim to express the gradient in terms of matrix multiplication, making the process comprehensible to a wider audience, including those without a deep mathematical background. The paper provides a detailed algorithm and mathematical justifications for the gradient calculations, facilitating understanding of DNN training processes.

Uploaded by

Tigabu Yaya

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

The Backpropagation Algorithm for a Math Student

Uploaded by

Tigabu Yaya

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

The Backpropagation algorithm for a math student

Saeed Damadi1 , Golnaz Moharrer2 , Mostafa Cham2 , Jinglai Shen1

1
Department of Mathematics and Statistics, 2 Department of Information Systems
University of Maryland, Baltimore County (UMBC)
Baltimore, USA
{sdamadi1, golnazm1, mcham2, shenj}@umbc.edu

Abstract—A Deep Neural Network (DNN) is a composite

function of vector-valued functions, and in order to train a DNN,
it is necessary to calculate the gradient of the loss function with
respect to all parameters. This calculation can be a non-trivial
task because the loss function of a DNN is a composition of
2023 International Joint Conference on Neural Networks (IJCNN) | 978-1-6654-8867-9/23/$31.00 ©2023 IEEE | DOI: 10.1109/IJCNN54540.2023.10191596

several nonlinear functions, each with numerous parameters.

The Backpropagation (BP) algorithm leverages the composite
structure of the DNN to efficiently compute the gradient. As a
result, the number of layers in the network does not significantly
impact the complexity of the calculation. The objective of this
paper is to express the gradient of the loss function in terms of
a matrix multiplication using the Jacobian operator. This can be
achieved by considering the total derivative of each layer with Fig. 1. Relationship between training of a NN and BP algorithm
respect to its parameters and expressing it as a Jacobian matrix.
The gradient can then be represented as the matrix product of
these Jacobian matrices. This approach is valid because the chain function of a DNN. As its name suggests, the SGD algorithm
rule can be applied to a composition of vector-valued functions, requires calculating the gradient1 of the loss function of a
and the use of Jacobian matrices allows for the incorporation of
multiple inputs and outputs. By providing concise mathematical DNN. All different variants of the SGD algorithm require
justifications, the results can be made understandable and useful calculating at least a single gradient associated with a single
to a broad audience from various disciplines. sample, i.e., x in Fig. 1. This calculation is a non-trivial
task because the loss function of a DNN is a composition of
I. I NTRODUCTION several nonlinear vector-valued functions where each one has
Understanding the process of training a (Deep) Neural numerous parameters. The Backpropagation (BP) algorithm,
Networks (D)NNs, as illustrated in Fig 1, is not straightfor- introduced by [13] is an efficient way to calculate the gradient
ward because it involves many detailed parts. Additionally, of the loss function of a DNN. This algorithm leverages
modern and sophisticated DNNs are often provided as pre- the composite structure of a DNN to efficiently calculate
trained models in Python packages such as Pytorch [1] and the gradient of the loss function with respect to the model’s
TensorFlow [2], which can make it difficult for users to fully parameters, i.e., θ in Fig. 1.
understand the training process. These packages abstract away In this paper we are going to calculate the gradient of the
many of the implementation details, making it more accessible loss function of a DNN associated with a single sample. The
for users to use these models for their own tasks, but also gradient will be derived as the matrix multiplication of Jaco-
making it less transparent for the user to understand the inner bian matrices 2 . The derivation will be done by considering the
workings of the model. total derivative of each layer with respect to its parameters and
These pre-trained off-the-shelf models can solve variety of expressing it as a Jacobian matrix. The gradient can then be
tasks such as computer vision or language processing. Convo- represented as the matrix product of these Jacobian matrices.
lutional neural networks (CNNs) are currently the most widely This approach is well-founded because the chain rule is valid
used architecture for image classification and other computer for the Jacobian operator. Hence, the Jacobian operator can
vision tasks. Some examples of successful CNN architectures be applied to a composition of vector-valued functions. We
for image classification include ResNet [3], Inception [4], provide concise mathematical justifications so the results can
DenseNet [5], and EfficientNet [6]. CNNs models are capable be made understandable and useful to a broad audience from
of solving image segmentation and object detection which can various disciplines, even those without a deep understanding of
be done using U-Net [7] and YOLO [8]. For natural language the mathematics involved. This is particularly important when
processing tasks, transformer models like BERT [9], GPT-3 communicating complex technical concepts to non-experts, as
[10], and T5 [11] have achieved state-of-the-art performance
1 Please refer Def. 2 in Appendix for the definition of the gradient of a
on many benchmarks.
scalar-valued function.
Fig 1 shows the training process that utilizes the Stochastic 2 Please refer Def. 1 in Appendix for the definition of a Jacobian matrix
Gradient Descent (SGD) algorithm [12] to minimize the loss of a vector-valued function.

Authorized licensed use limited to: Universita degli Studi di Bologna. Downloaded on February 10,2025 at 16:08:52 UTC from IEEE Xplore. Restrictions apply.
Algorithm 1 The backpropagation algorithm
Require: Given an L-layer DNN or NN with a loss function ℓ, and a data pair (x, y). Let a[0] := x.
1:
Calculate ∇z[L] ℓ y, f (z[L] ) from Tab. I.
2: for l = 1, . . . , L do
3: if l ̸= L then
 ⊤ 
 a[l] 0 0
⊤ ..
JW[l] ,b[l] z[l] = W[L] JW[L−1] ,b[L−1] f [L−1] (z[L−1] ) · · · JW[l] ,b[l] f [l] (z[l] ) 

 0 . 0 I

⊤
0 0 a[l]
4: else if l = L then
 ⊤ 
a[L] 0 0
JW[L] ,b[L] z[L] = 
 .. 
 0 . 0 I

⊤
0 0 a[L]

5: Construct
h i
Jθ z[L] (θ) = JW[1] ,b[1] z[L] (θ) · · · JW[L] ,b[L] z[L] (θ) .

6: Calculate the gradient

⊤
∇θ ℓ y, ŷ(θ) = Jθ z[L] (θ) ∇z[L] ℓ y, f (z[L] ) .

it allows for a clear and accurate understanding of the results. i-th element of a vector x is denoted by xi . Likewise, wij
Additionally, using mathematical notation allows for precise denotes the ij-th element of a matrix W located at the i-th
and unambiguous statements of results, which can facilitate row and j-th column (sometimes written as wi,j for clarity).
replication and further research in the field. Also, Wi• and W•j denote the i-th row and the j-th column
Our results is summarized in Alg. 1 for L number of layers. of the matrix, respectively. The vector form of a matrix W
As Alg. 1 shows the matrix multiplication is done iteratively. is denoted by Vec(W), where each column of W is stacked
The iterative nature comes from the fact that the loss function on top of each other, with the first column at the top. The
of a DNN is defined as a composition of L + 1 functions letter f is reserved for a vector-valued non-linear activation
where the last one ℓ is the final function which measures the function of a layer in a DNN (NN), where z and a are its
loss (error) of the prediction and the actual value, i.e., ℓ(y, ŷ) input and output, respectively, i.e., a = f (z). The letter x is
in Fig 1 where ŷ is the prediction and y the actual value. reserved for the input to a DNN (NN), and y or y are reserved
The algorithm presented in Alg. 1 is explained and justified for the scalar or vector label of the input x. The predictions
by calculating the gradient of the loss function for networks of a DNN (NN) associated with y or y are denoted by ŷ
with one, two, and three layers. These calculations provide or ŷ, respectively. Superscripted index inside square brackets
insights into the gradients of loss functions for generic neural denotes the layer of a DNN (NN), e.g., z[l] is the z vector
networks and demonstrate how the gradient of a single-layer corresponding to the l-th layer.
network can serve as a model for the last layer of any DNN.
Additionally, the calculation of a two-layer network is used to III. R ESULT
extend the calculation beyond two-layer networks, as seen in
the calculation of the gradient of the loss function for LeNet- As we have explained earlier, the goal of this paper is to
100-300 [14], which is a three-layer network. Finally, we show take the first step towards training a DNN, which involves
how convolutional layers can be converted to linear layers in calculating the gradient of a loss function with respect to all
order to calculate their Jacobian matrices. These results can be parameters of the DNN, i.e., ∇θ ℓ(y, ŷ(θ)), where θ is the vec-
used to calculate the gradient of the loss function of a CNN. tor of parameters, ŷ is the predicted value by the network, y
is the true value, and ℓ(y, ŷ(θ)) is the loss incurred to predict
II. N OTATION
the output. To achieve this goal, two important observations
The letters x, x, and W denote a scalar, vector, and matrix, can make the task easier. First, it involves separating the last-
respectively. The letter I represents the identity matrix. The layer activation function from the network. Second, it involves

Authorized licensed use limited to: Universita degli Studi di Bologna. Downloaded on February 10,2025 at 16:08:52 UTC from IEEE Xplore. Restrictions apply.
using the relationship between the Jacobian operator and the
gradient.
To fulfill the first step, we observe the following equality:

ℓ y, ŷ(θ) = ℓ y, f [L] z[L] (θ)

which is a consequence of the fact that ŷ = f [L] z[L] (θ) ,
where f [L] is the last activation layer and z[L] (θ) is its Fig. 3. Last-Layer activation function combined with its associated loss
corresponding input. This equality can be illustrated more
clearly as shown in Fig. 2.
Second observation utilizes the relationship between the 3, which shows three common combinations, each associated
Jacobian operator and the gradient of a scalar-valued function with a different problem.
stated in the following lemma. In a binary classification problem, a sigmoid function is
Lemma 1 (Jacobian and gradient of a scalar-valued func- used as the last activation function together with Binary Cross
tion): For f : Rn → R as a scalar-valued differentiable func- Entropy (BCE) loss, as defined in Definition 3 and Definition
⊤
tion ∇x f (x) = Jx f (x) where ∇x f (x) is the gradient 5 in the Appendix, respectively. Similarly, in a classification
problem with more than two classes, a softmax function
and Jx f (x) is the Jacobian matrix of f at point x ∈ Rn
is used as the last activation function together with Cross
respectively.
Entropy (CE) loss, as defined in Definition 4 and Definition
Proof 1: The equality follows from the definitions of a
6 in the Appendix, respectively. In a regression problem,
Jacobian matrix as defined in Def. 1 and the gradient of a
where the goal is to predict a continuous value, the last-layer
scalar-valued function as defined in Def. in 2 in Appendix.
⊤ activation function is an identity function, and Square Error
By using the second observation as ∇x f (x) = Jx f (x) , (SE) is used as the loss function, as defined in Definition
and making use of the first one as 7. This means that the predicted output ŷ is equal to the
final layer’s output z[L] . Table I shows the combination of
ℓ y, ŷ(θ) = ℓ y, f [L] z[L] (θ) ,

the last activation function with its corresponding loss and

the expression for ∇z[L] ℓ y, f [L] z[L] . For the derivation,
one can write the following:
please refer to Appendix A.
⊤ More work is required to calculate the second term, i.e.,
∇θ ℓ y, ŷ(θ) = Jθ ℓ y, ŷ(θ)
Jθ z[L] (θ). In the following subsections, we show how to
!⊤
derive this calculation concisely for any number of layers.
[L] [L]
= Jθ ℓ y, f z (θ)
A. Gradient of a one-layer network

!⊤ We will now focus on computing Jθ z[L] (θ). This calcula-
= Jz[L] ℓ y, f [L] z[L] Jθ z[L] (θ) tion can be facilitated by starting with a single layer neural
network that is capable of solving a classification problem.
!⊤ !⊤ The single layer network plays a crucial role in gradient

= Jθ z[L] (θ) Jz[L] ℓ y, f [L] z[L] computation as it can be considered as the last layer of any
deep neural network (DNN). Due to the presence of only one
!⊤
layer, the superscripts in z[L] and f [L] can be omitted, giving
[L]
= Jθ z (θ) ∇z[L] ℓ y, f [L] z[L] us Jθ z(θ) := Jθ z[L] (θ). The network depicted in Fig. 4 is
designed to perform three-class classification based on inputs
(1)
with four features. Consequently, the weight matrix W is a
The advantage of Equation (1) is that it separates the orig-
4×3 and the bias vector b is a 3×1, i.e., W ∈ R4×3 and b ∈
inal gradient calculation
into two separate calculations, i.e.,
[L] [L] [L]
R3×1 . As shown in the concise representation of the network
Jθ z (θ) and ∇z[L] ℓ y, f z . ⊤
in the bottom part of Fig. 4, z ∈ R3 , i.e., z = W x+b. This
The calculation of the second term is straightforward be- vector z is a vector-valued function including 15 parameters
cause the choice of a loss function and the last activation where these parameters are all elements in W and b, i.e.,
function in a DNN are not arbitrary. This is illustrated in Fig. 15 = 4 × 3 + 3 × 1. According to the definition of Jacobian
matrix as defined in Def. 1 in Appendix, JW,b z(W, b) is
a 3 × 15 matrix, i.e., JW,b z(W, b) ∈ R3 × R15 .
For notational simplicity, all the parameters are denoted by
θ which is a vector in R15 , and is constructed by the process
of vectorization. The vectorization process stacks each column
of W on top of each other, with the first column on the top
Fig. 2. Separating the last layer activation from a DNN to create Vec(W). Then, the vector of network parameters

Authorized licensed use limited to: Universita degli Studi di Bologna. Downloaded on February 10,2025 at 16:08:52 UTC from IEEE Xplore. Restrictions apply.
TABLE I
L AST-L AYER ACTIVATION FUNCTION COMBINED WITH ITS ASSOCIATED LOSS

Loss ℓ(y, ŷ) ∇z[L] ℓ y, ŷ = ∇z[L] ℓ y, f [L] z[L] Proof

BCE −y log(ŷ) − (1 − y) log(1 − ŷ) −(y − ŷ) (y, ŷ ∈ R) Appendix A

− ci=1 yi log(ŷi ) Rc
P
CE −(y − ŷ) (y, ŷ) ∈ Appendix B
SE ∥y − ŷ∥2 −2(y − ŷ) (y, ŷ ∈ Rm ) Appendix C

h i⊤
can be written as θ ⊤ := Vec(W) ⊤ (b)⊤ . Therefore will be considered, as shown in Fig. 5 where block-wise model

to calculate Jθ z(θ) one can write the following: helps calculating the gradient of its
loss. Similar to a one-

⊤ layer network, one can write ∇z[2] ℓ y, f [2] z[2] = −(y− ŷ)
Jθ z(θ) = JW,b W x + b from Tab. I. To calculate Jθ z[2] (θ), observe that it can be
 ⊤ 
separated into parameters of second and first layer as does θ.
W•1 + b1
The separation for θ can be written as the following:
= JW,b  W•2 ⊤ + b2 
 
⊤
W•3 + b3 h ⊤ ⊤ i⊤
 ⊤  θ ⊤ := Vec(W[1] ) (b[1] )⊤ Vec(W[2] ) (b[2] )⊤
x 0 0 1 0 0
=  0 x⊤ 0 0 1 0
which is used to write the below separation of Jacobian
0 0 x⊤ 0 0 1
 ⊤  matrices:
x 0 0
=  0 x⊤ 0 I3×3  ∈ R3×15
h i
Jθ z[2] (θ) = JW[1] ,b[1] z[2] (θ) JW[2] ,b[2] z[2] (θ) .
0 0 x⊤
where W•i is a column of W for i = 1, 2, 3 and I3×3 appears Calculating the second term is straight forward because one
because the derivative of each element of z with respect to can write the following:
components of b are either zero or one. Therefore the gradient
⊤
of the loss function is calculated as follows JW[2] ,b[2] z[2] (θ) = JW[2] ,b[2] W[2] a[1] + b[2]
⊤
∇θ ℓ y, ŷ(θ) = Jθ z(θ) ∇z ℓ y, f (z) ⊤
 
  a[1] 0 0
x 0 0 ⊤
= 0 a[1]
 
0 I3×3 
0 x 0 ⊤
= − 0
 (y − ŷ) 0 0 a[1]
0 x
I3×3
where a[1] ∈ R5 because W[2] ∈ R5×3 . To calculate the
where the value for ∇z ℓ y, f (z) is obtained from Tab. I. second term first observe that W[2] is not a function of W[1]
The above gradient is similar to the gradient of one-layer
networks whose weights are vectors not matrices as shown in
Tab. II. As it can be seen from Tab. II famous problems such as
simple/multiple linear regression, simple binary classification,
and logistic regression can be written as a one-layer network
whose weight are vectors not matrices.
Although a one-layer network provides valuable understand-
ing of the relationship between the Jacobian and the gradient
of the loss function with respect to z, it is not practical in
terms of performance, i.e., accuracy in prediction. To improve
performance, adding more layers is recommended. As a result,
the following subsection will demonstrate the calculation of
the gradient of a two-layer network.
B. Gradient of a two-layer network
Studying a two-layer network can not only improve per-
formance (accuracy) but also aid in developing a method for
calculating the gradient of any deep neural network (DNN)
with multiple layers. To demonstrate this, a two-layer network Fig. 4. 3-class classifier with 4 features and a single layer.

Authorized licensed use limited to: Universita degli Studi di Bologna. Downloaded on February 10,2025 at 16:08:52 UTC from IEEE Xplore. Restrictions apply.
TABLE II
FAMOUS PROBLEMS REPRESENTED AS SIMPLE NEURAL NETWORKS .

Simple Linear Regression Simple Binary Classifier Multiple Linear Regression Logistic Regression

Architecture
   
x1 x1
 x2   x2 
x1
   
Input x∈R x= ∈ R2 x =  ..  ∈ Rn x =  ..  ∈ Rn
   
x2  .   . 
xn−1  xn−1 
xn xn
   
w1 w1
 .   . 
w w  .  w  .  w
Parameters θ= ∈ R2 θ == ∈ R3 θ= . = ∈ Rn+1 θ= . = ∈ Rn+1
b b w  b w  b
n n
b b
⊤ ⊤
Predictions ŷ = wx + b ŷ = σ(θ x) ŷ = w x + b ŷ = σ(θ ⊤ x)
Loss ℓ = (y − ŷ)2 ℓ = −y log(ŷ) − (1 − y) log(1 − ŷ) ℓ = (y − ŷ)2 ℓ = −y log(ŷ) − (1 − y) log(1 − ŷ)

x x x x
Gradient ∇ℓθ (y, ŷ) = 2 (y − ŷ) ∇ℓθ (y, ŷ) = − (y − σ(θ ⊤ x)) ∇ℓθ (y, ŷ) = 2 (y − ŷ) ∇ℓθ (y, ŷ) = − (y − σ(θ ⊤ x))
1 1 1 1

shown below:
[2] [2] ⊤ [1] [1] [1] ⊤ [1]

J z (θ) = W J [1] f z J W x+b
W[1] ,b[1] z W[1] ,b[1]
 
⊤
 x 0 0 
.
..
 
[2] ⊤ [1] [1] 
 
= W J [1] f z
.

 
z  0
 0 I 

 
x ⊤

0 0

Finally, the gradient would be the following:

[2] ⊤ [2]
∇θ ℓ y, ŷ(θ) = Jθ z (θ) ∇z[2] ℓ y, f (z )
  
x 0 0
 
..
  


 0 . 0

 J [1]

f [1]
z [1]
⊤ [2] 
W 

  z 
 0 0 x

 
  .
I5×5
 
= −
  
 (y − ŷ)

[1]
Fig. 5. Two-layer network

 a 0 0 

   


 0
 a[1] 0  


  
[1] 

  0 0 a  
  
I3×3
nor b[1] . Then one can write the following:
The next subsection will demonstrate the calculation of the
gradient for a three-layer network, thereby illustrating the
extension of a two-layer network gradient to an arbitrary
⊤
JW[1] ,b[1] z[2] (θ) = JW[1] ,b[1] W[2] a[1] + b[2] number of layers.
⊤
= W[2] JW[1] ,b[1] a[1] C. Gradient of a three-layer network
⊤ In this subsection, we evaluate the gradient of a loss function
= W[2] JW[1] ,b[1] f [1] (z[1] )
for a LeNet-100-300-10 network architecture, which consists
⊤ of 100, 300, and 10 units (neurons) in the first, second
= W[2] JW[1] ,b[1] f [1] (z[1] )
and third layer respectively [14]. This network is used for
image classification tasks [15]. The input to the network is
a vectorized representation of 28 × 28 digit images (784
The last equality is where the chain rule needs to be used in elements) and the output is a vector in R10 . The input to
order to obtain an expression in terms of W[1] and b[1] , as the LeNet-100-300-10 is a vector x ∈ R784 representing the

Authorized licensed use limited to: Universita degli Studi di Bologna. Downloaded on February 10,2025 at 16:08:52 UTC from IEEE Xplore. Restrictions apply.
Fig. 6. LeNet-100-300

vectorized
image. Similar
to a one- and two-layer networks D. Jacobian of activation functions
[3] [3]
∇z[3] ℓ y, f z = −(y − ŷ) ∈ R10 which is obtained

In this subsection we elaborate on Jz[l] f [l] z[l] where f [l]
from Tab. I. To calculate Jθ z[3] (θ), three Jacobian are needed is the l-th activation layer whose corresponding input is z[l] for
to separate out the parameters of the layers as the following: l = 1, . . . , L − 1. When l ̸= L, the vector f [l] (z[l] ) is typically
[3]
obtained by applying a single univariate function f to each
Jθ z (θ) = J z[3] (θ) J z[3] (θ) J z[3] (θ) .
W[1] ,b[1] W[2] ,b[2] W[3] ,b[3]
element of z[l] . The most common activation function used in
By following the same steps as for the two-layer network, the DNNs is the Rectified Linear Unit (ReLU) function, defined
gradient can be calculated as: as f (x) = max(0, x) [16], [17], [18], [19]. Algebraically, this
operation can be represented as
⊤
∇θ ℓ y, ŷ(θ) = Jθ z[3] (θ) ∇z[3] ℓ y, f (z[3] )

⊤
f ⊤ z = f (z1 ) · · · f (zd )

where Jacobian matrices of layers J⊤ [3] z
[3]
(θ) ,
W ,b
[3]
where f is the same univariate function applied to all elements
⊤ [3] ⊤ [3] of z, d is the output size of the layer, and we have omitted
JW[2] ,b[2] z (θ) , JW[1] ,b[1] z (θ) are:
superscripts for clarity. This special structure results in a
diagonal matrix, i.e.,
 [3] 
a 0 0
 .. 
 0
 . 0  , Jz f (z) = diag f ′ (z1 ), · · · , f ′ (zd )

 0 0 a[3] 
I10×10 where f ′ is the derivative of a univariate function. For the
 [2]  special case of the ReLU function J z f (z) is a diagonal
a 0 0 matrix of zeros and ones which are associated to the negative
 . .  ⊤ and positive elements of z. Multiplying such a matrix from
 0 . 0   Jz[2] f [2] z[2]

W[3] ,
 the left to any matrix W results in removing the rows of
 0 0 a[2]  W associated to zero elements in Jz f (z) which greatly
I100×100 decreases the computation. The zero-th norm of the parameter
  vector can be minimized in sparse optimizations using these
x 0 0
intuitions, as noted in [20].
..
 
 
. Note that the derivations so far have only considered fully
 
[1] [1] ⊤ [2] [2] ⊤
 
[2] [3]

0 0  J [1] f z W J [2] f z W ,


z z
connected networks. However, for computer vision tasks, CNN
 
 
0 0 x 
 
models are utilized, which employ convolution operations
 
I300×300

instead of matrix multiplication in some layers. In the next

and ∇z[3] ℓ y, f (z[3] ) = −(y − ŷ) which is consistent with subsection, we will demonstrate how a convolutional layer can
Alg. 1. be transformed into a fully connected network.

Authorized licensed use limited to: Universita degli Studi di Bologna. Downloaded on February 10,2025 at 16:08:52 UTC from IEEE Xplore. Restrictions apply.
Lemma 3: For an input X to a convolutional layer that has r
number of filters, each calculation X∗Ki +Bi is equivalent to
(Wi )⊤ x+bi where Ki is the matrix of the i-th filter, Bi is the
⊤
Fig. 7. An illustration of convolution
matrix of associated bias for each filter, Wi = Teop(Ki ) ,
bi = Vec(Bi ) for i = 1, . . . , r, and x := Vec(X⊤ ).
According to the above lemma, we can convert a convolutional
E. Convolution as matrix multiplication neural network to a typical fully connected one and find its
A 2-D convolution operation is a mathematical operation gradient.
that is used to extract features or patterns from a 2-dimensional IV. C ONCLUSION
input (matrix), such as an image. It works by applying a filter
In this paper, we demonstrated the utilization of the Jacobian
or kernel, which is also a matrix, to the input image [14].
operator to simplify the gradient calculation process in DNNs.
The filter is moved across the image, performing element-wise
We presented a matrix multiplication-based algorithm that ex-
multiplications with the overlapping regions of the image and
presses the BP algorithm using Jacobian matrices and applied
filter, and then summing the results. This process is repeated
it to determine gradients for single, double, and three-layer net-
for every position of the filter on the image, resulting in a
works. Our calculations offered insights into the gradients of
new matrix output, known as a feature map. Fig. 7 shows
loss functions in DNNs; for instance, the gradient of a single-
the process of a convolution operation where the 3x3 and
layer network can serve as a model for the final layer of any
2x2 matrices represent the input and the filter respectively.
DNN. We also expanded our findings to cover more intricate
As Fig. 7 illustrates the filter slides over the input image,
architectures such as LeNet-100-300-10 and demonstrated that
one pixel at a time, and performs element-wise multiplications
the gradients of convolutional neural network layers can be
with the overlapping region of the image. The result of these
transformed into linear layers. These results can aid research
multiplications is then summed, and the sum is stored in the
on compressing DNNs that utilize the full gradient, as noted
corresponding location of the output feature map.
in [21]. Furthermore, they can benefit sparse optimization in
The size of the filter and the stride (the number of pixels
both deterministic and stochastic settings, where the Iterative
the filter is moved each time) determine the size of the output
Hard Thresholding (IHT) algorithm uses the full gradient for
feature map. Additionally, the filter can be applied multiple
a sparse solution in deterministic settings [20] and the mini-
times with different filter parameters, to extract different
batch Stochastic IHT algorithm is employed in the stochastic
features from the same input image.
context [22]. We provided concise mathematical justifications
Lemma 2: The convolution operation between two matrices,
to make the results clear and useful for people from differ-
X and K, can be represented as a matrix multiplication.
ent fields, even those without a deep understanding of the
Specifically, it can be represented as the product of a Toeplitz
involved mathematics. This was particularly important when
matrix (or diagonal-constant matrix) of K and the vector
communicating complex technical concepts to non-experts as
obtained from stacking the columns of the transpose of X
it allowed for a clear and accurate understanding of the results.
in the order of the first one on top. Mathematically, this can
Additionally, using mathematical notation allowed for precise
be represented as:
and unambiguous statements of results, facilitating replication
X ∗ K = Teop(K)Vec(X⊤ ) and further research in the field. As next steps, we intend to
study the calculation of gradients for loss functions in various
where X ∈ RmX ×nX and K ∈ RmK ×nK and types of neural networks such as residual, recurrent, Long
mX , nX , mX , nX ∈ N. Short-Term Memory (LSTM), and Transformer networks. We
Remark 1: The above representation allows for the con-
will also explore the Jacobian of batch normalization to further
volution operation to be computed efficiently using matrix
our understanding of the method.
multiplication, which can be parallelized and accelerated on a
GPU. R EFERENCES
To clarify the above lemma, consider X ∈ R3×3 and K ∈ [1] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
R 2×2
as the input and filter respectively. Then, one can verify T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An
imperative style, high-performance deep learning library,” arXiv preprint
the lemma by writing the following: arXiv:1912.01703, 2019.
  [2] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
x1 S. Ghemawat, G. Irving, M. Isard et al., “{TensorFlow}: a system
x2  for {Large-Scale} machine learning,” in 12th USENIX symposium on
 
  x3  operating systems design and implementation (OSDI 16), 2016, pp. 265–
k1 k2 0 k3 k4 0 0 0 0   283.
 0 k1 k2 0 k3 k4 0 0 0 x4 
 
[3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
X∗K=  0 0 k1 k2 0 k3 k4 0 0 x5 
  recognition,” in Proceedings of the IEEE conference on computer vision
x6  and pattern recognition, 2016, pp. 770–778.
0 0 0 k1 k2 0 k3 k4 0  
x7  [4] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
  V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
x8  in Proceedings of the IEEE conference on computer vision and pattern
x9 recognition, 2015, pp. 1–9.

Authorized licensed use limited to: Universita degli Studi di Bologna. Downloaded on February 10,2025 at 16:08:52 UTC from IEEE Xplore. Restrictions apply.
[5] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely Definition 2 (Gradient of a scalar-valued function): Let f :
connected convolutional networks,” in Proceedings of the IEEE confer- R → R be a differentiable scalar-valued function. The gradient
ence on computer vision and pattern recognition, 2017, pp. 4700–4708.
[6] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for con- of ∇f : Rn → Rn at x ∈ Rn is defined as the following:
volutional neural networks,” in International conference on machine  ∂f (x) 
learning. PMLR, 2019, pp. 6105–6114. ∂x1
[7] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks  .. 
for biomedical image segmentation,” in International Conference on ∇f (x) = 
 .
.

Medical image computing and computer-assisted intervention. Springer, ∂f (x)
2015, pp. 234–241. ∂xn
[8] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look
once: Unified, real-time object detection,” in Proceedings of the IEEE Definition 3 (Sigmoid function): A sigmoid function σ :
ex
conference on computer vision and pattern recognition, 2016, pp. 779– R → [0, 1] is defined as σ(x) = 1+e x.
788.
[9] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training Definition 4 (Softmax function): A softmax function σ :
of deep bidirectional transformers for language understanding,” arXiv Rc → Rc is defined for c ≥ 3 as the following:
preprint arXiv:1810.04805, 2018.  ex1 
[10] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal,
exj
Pc
A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language mod- j=1

els are few-shot learners,” Advances in neural information processing


σ(x) =  ..  ∈ Rc .

systems, vol. 33, pp. 1877–1901, 2020.  . 
xc
[11] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Pce
j=1 exj
Y. Zhou, W. Li, P. J. Liu et al., “Exploring the limits of transfer learning
with a unified text-to-text transformer.” J. Mach. Learn. Res., vol. 21,
no. 140, pp. 1–67, 2020.
Definition 5 (Binary Cross Entropy Loss): Let ŷ ∈ [0, 1] be
[12] H. Robbins and S. Monro, “A stochastic approximation method,” The a predicted probability for a true label whose value is either
annals of mathematical statistics, pp. 400–407, 1951. zero or one, i.e., y ∈ {0, 1}. The Binary Cross Entropy (BCE)
[13] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning repre- loss is defined as follows:
sentations by back-propagating errors,” nature, vol. 323, no. 6088, pp.
533–536, 1986.

[14] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning BCE(y, ŷ) = − y log(ŷ) + (1 − y) log(1 − ŷ) .
applied to document recognition,” Proceedings of the IEEE, vol. 86,
no. 11, pp. 2278–2324, 1998. Remark 2 (BCE loss): The BCE loss is minimized when the
[15] Y. LeCun, “The mnist database of handwritten digits,” http://yann. lecun.
com/exdb/mnist/, 1998.
predicted label is close to the true label. The BCE loss has a
[16] K. Fukushima, “Visual feature extraction by a multilayered network of smooth and continuous gradient which makes it suitable for
analog threshold elements,” IEEE Transactions on Systems Science and use with gradient-based optimization algorithms. It is also a
Cybernetics, vol. 5, no. 4, pp. 322–333, 1969.
convex function with respect to the variable ŷ.
[17] ——, “Cognitron: A self-organizing multilayered neural network,” Bio-
logical cybernetics, vol. 20, no. 3-4, pp. 121–136, 1975. Definition 6 (Cross Entropy Loss): Let ŷ ∈ (0, 1)c ∈ Rc
[18] D. E. Rumelhart, G. E. Hinton, J. L. McClelland et al., “A general frame- (c ≥ 3) be a predicted probability for a true one-hot vector
work for parallel distributed processing,” Parallel distributed processing: label y ∈ Rc , i.e., yj = 1, yi = 0 for j ̸= i = 1, . . . , c. The
Explorations in the microstructure of cognition, vol. 1, no. 45-76, p. 26,
1986. Cross Entropy (CE) loss is defined as follows:
[19] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltz- c
mann machines,” in Proceedings of the 27th international conference on X
machine learning (ICML-10), 2010, pp. 807–814. CE(y, ŷ) = − yi log(ŷi ).
[20] S. Damadi and J. Shen, “Gradient properties of hard thresholding i=1
operator,” arXiv preprint arXiv:2209.08247, 2022.
[21] S. Damadi, E. Nouri, and H. Pirsiavash, “Amenable sparse network Remark 3: Cross entropy function is commonly used in
investigator,” arXiv preprint arXiv:2202.09284, 2022. machine learning and information theory to measure the dif-
[22] S. Damadi and J. Shen, “Convergence of the mini-batch siht algorithm,”
arXiv preprint arXiv:2209.14536, 2022.
ference between two probability distributions. It is often used
as a loss function to evaluate the performance of classification
models.
A PPENDIX
In the context of machine learning, cross entropy is typ-
Definition 1 (Jacobian matrix of a vector-valued function): ically used to measure the difference between the predicted
Let f : Rn → R  m be a differentiable vector-valued function probability distribution (outputted by the model) and the true
f1 (x) probability distribution (which represents the actual labels
 .. 
where f (x) =  .  and x ∈ Rn . This function takes a of the data). The cross entropy loss function is designed to
fm (x) penalize the model when it assigns low probabilities to the
point x ∈ Rn as an input and produces f (x) ∈ Rm as the true labels, and to reward the model when it assigns high
output. The Jacobian matrix of f with respect to x is defined probabilities to the true labels.
to be an m × n matrix denoted by Jx f (x) as the following: Definition 7 (Squared Error Loss): Let ŷ ∈ R be a predicted
value for a true label whose value is y ∈ R. The Square Error
 ∂f (x) ∂f1 (x)
  ⊤ 
1
· · · ∇ x f 1 (x) (SE) loss is defined as follows:
 ∂x. 1 ..
∂xn
..   .. 
Jx f (x) = 
 . . . . =
 
.
.
 SE(y, ŷ) = (y − ŷ)2 .
∂fm (x) ∂fm (x) ⊤
∂x1 · · · ∂xn ∇ f
x m (x)

Authorized licensed use limited to: Universita degli Studi di Bologna. Downloaded on February 10,2025 at 16:08:52 UTC from IEEE Xplore. Restrictions apply.
A. Sigmoid with BCE each component can be calculated as the following:
     
∂ c c ∂ c c c
Lemma 4 (Gradient of BCE loss): Let x(i) be an input z z
X X X X X
− yi zi − log( e j ) = −(yi zi ) + yi log( e j )
∂zj i=1 j=1 ∂zj i=1 i=1 j=1
to a one-layer network solving binary classification with c z
e j
y (i) ∈ {0, 1} be its true label and ŷ (i) = σ w⊤ x(i) + b
X
= −yj + yi P z
c e j
i=1 j=1
is the predicted probability corresponding to the input for c
X
= −yj + yi (σ(z))j
i = 1, . . . , N . The gradient of the BCE loss is given as follows: i=1
c
X
= −yj + (σ(z))j yi
i=1
 
x(i) = −yj + (σ(z))j
(i) ⊤ (i) (i) ⊤ (i)  
∇θ BCE y ,σ w x +b = − y −σ w x +b  

1
 Pc
where in the last equality we have use the fact that i=1 yi =
1. Since the above calculation is for the j-th component and
for i = 1, . . . , N . (ŷ)j = (σ(z))j , by calculating other components we get the
Proof 2: We calculate the following for a fixed i ∈ desired result.
{1, . . . , N }:
C. Square Error
!
Lemma 6: Let x(i) be an input to a one-layer network with

(i) ⊤ (i) (i) ⊤ (i)
∇θ BCE y ,σ w x +b = −∇θ y log σ w x +b
(i)
(i)

⊤ (i)
! y ∈ R be the corresponding true value and
+ (1 − y ) log 1−σ w x +b .

ŷ (i) = w⊤ x(i) + b ŷ (i) inR

To calculate the above observe the following:
be the corresponding prediction for i = 1, . . . , N . The gradient
 
of SE loss is defined as follows:

⊤ (i)

 x(i) 

⊤ (i)

⊤ (i)

∇θ σ w x +b =  σ w x +b 1−σ w x +b .

1

∇θ SE(y (i) , w⊤ x(i) + b) = ∇θ (y (i) − (w⊤ x(i) + b))2
(i)
x
Hence, we can write the following: =2 (y (i) − ŷ (i) )
1

∇θ ℓ y, ŷ(θ) = ∇θ BCE

y
(i)
,σ

w
⊤ (i)
x +b
!
for i = 1, . . . , N .
  Proof 4: Let θ := [w⊤ b]⊤ . Then,
σ w⊤ x(i) + b 1 − σ w⊤ x(i) + b

(i)
(i) x  (i)
= − y
x
 
σ w⊤ x(i) + b (i) (i) 2 (i) (i) (i)
(y (i) −ŷ (i) ).
 
1
∇θ (y −ŷ ) = 2(y −ŷ )∇θ ŷ =2
  1
σ w⊤ x(i) + b 1 − σ w⊤ x(i) + b

(i) !
(i) x 
− (1 − y )  
1 − σ w⊤ x(i) + b
 
1
 
x(i)
(i) ⊤ (i)  
= −y 1−σ w x +b 



1
 
x(i)
(i) ⊤ (i)  
+ (1 − y )σ w x +b  


1
 
x(i)
(i) ⊤ (i)  
= − y −σ w x +b 



1

B. Softamx with CE

Lemma 5 (Gradient of CE loss w.r.t z): Let ŷ ∈ (0, 1)c ∈ Rc

(c ≥ 3) be a predicted probability such that ŷ = σ(z) where
σ is a softmax function as defined in Def. 4 and z ∈ Rn . And,
let y be a true one-hot vector label y ∈ Rc , i.e., yj = 1, yi = 0
for j ̸= i = 1, . . . , c. The gradient of the CE loss with respect
to z is the following:

∇z ℓ (y, ŷ(z)) = ∇z CE (y, ŷ(z)) = −(y − ŷ).

Proof 3: To calculate ∇z CE (y, ŷ(z)) one can rewrite it as

∇z CE (y, σ(z)). Then the loss can be expanded as follows:

c
!
X ez i
CE (y, σ(z)) = − yi log Pc
i=1 j=1 ez j

Authorized licensed use limited to: Universita degli Studi di Bologna. Downloaded on February 10,2025 at 16:08:52 UTC from IEEE Xplore. Restrictions apply.

Molecular Modeling Techniques In Material Sciences 1st Edition Jörg-Rüdiger Hill download pdf
100% (4)
Molecular Modeling Techniques In Material Sciences 1st Edition Jörg-Rüdiger Hill download pdf
50 pages
API 570 Exam Quastions
No ratings yet
API 570 Exam Quastions
18 pages
Multi-Scale_Multi-Stage_Single_Image_Super-Resolution_Reconstruction_Algorithm_Based_on_Transformer
No ratings yet
Multi-Scale_Multi-Stage_Single_Image_Super-Resolution_Reconstruction_Algorithm_Based_on_Transformer
4 pages
Graph Neural Network The Next Frontier in Deep Learning
No ratings yet
Graph Neural Network The Next Frontier in Deep Learning
1 page
Improved Selective Mapping Technique For Reduction of PAPR in MIMO-OFDM Wireless Communication
No ratings yet
Improved Selective Mapping Technique For Reduction of PAPR in MIMO-OFDM Wireless Communication
4 pages
Zhang 2020
No ratings yet
Zhang 2020
5 pages
Grid-Connected and Off-Grid Operation of A Microgrid Applying Fuzzy Mixed-Integer Linear Programming
No ratings yet
Grid-Connected and Off-Grid Operation of A Microgrid Applying Fuzzy Mixed-Integer Linear Programming
5 pages
D-FACTS Cooperation in Renewable Integrated Microgrids: A Linear Multi-Objective Approach
No ratings yet
D-FACTS Cooperation in Renewable Integrated Microgrids: A Linear Multi-Objective Approach
9 pages
Distributed Nonlinear State Estimation in Electric Power Systems Using Graph Neural Networks
No ratings yet
Distributed Nonlinear State Estimation in Electric Power Systems Using Graph Neural Networks
6 pages
Exploring_Graph_Partitioning_Techniques_for_GNN_Processing_on_Large_Graphs_A_Survey
No ratings yet
Exploring_Graph_Partitioning_Techniques_for_GNN_Processing_on_Large_Graphs_A_Survey
7 pages
Comparison of Fault Detection Technique For AC Microgrid Protection
No ratings yet
Comparison of Fault Detection Technique For AC Microgrid Protection
6 pages
DeepCut Unsupervised Segmentation Using Graph Neural Networks Clustering
No ratings yet
DeepCut Unsupervised Segmentation Using Graph Neural Networks Clustering
14 pages
GNNChap 6
No ratings yet
GNNChap 6
27 pages
1810.01758v3
No ratings yet
1810.01758v3
11 pages
Fractal Image Compression Based On Discrete Wavelet Transform
No ratings yet
Fractal Image Compression Based On Discrete Wavelet Transform
5 pages
Chris Bhavya Avlsi Report
No ratings yet
Chris Bhavya Avlsi Report
7 pages
Design and Analysis of 8-Bit Multiplier For Low Power VLSI Applications
No ratings yet
Design and Analysis of 8-Bit Multiplier For Low Power VLSI Applications
5 pages
Regression Analysis
No ratings yet
Regression Analysis
16 pages
Conference RA GCS - IPC 1
No ratings yet
Conference RA GCS - IPC 1
1 page
Implementation Challenges and Performance Analysis of Image Compression Using Huffman Encoding and DCT Algorithm On DSP Processor TMS320C6748 and Arduino Nano 33 BLE
No ratings yet
Implementation Challenges and Performance Analysis of Image Compression Using Huffman Encoding and DCT Algorithm On DSP Processor TMS320C6748 and Arduino Nano 33 BLE
6 pages
Calib-Anything: Zero-Training Lidar-Camera Extrinsic Calibration Method Using Segment Anything
No ratings yet
Calib-Anything: Zero-Training Lidar-Camera Extrinsic Calibration Method Using Segment Anything
5 pages
Experimental Evaluation of Efficient Sparse Matrix Distributions
No ratings yet
Experimental Evaluation of Efficient Sparse Matrix Distributions
8 pages
A Bayesian Approach With Type-2 Student-T Membership Function For T-S Model Identification
No ratings yet
A Bayesian Approach With Type-2 Student-T Membership Function For T-S Model Identification
5 pages
Network-On-Chip Heuristic Mapping Algorithm Based
No ratings yet
Network-On-Chip Heuristic Mapping Algorithm Based
9 pages
444 Poster Final
No ratings yet
444 Poster Final
1 page
Graph Convolutional Matrix Completion
No ratings yet
Graph Convolutional Matrix Completion
9 pages
Comparison of Phasor Estimation Techniques For AC Microgrid Protection
No ratings yet
Comparison of Phasor Estimation Techniques For AC Microgrid Protection
6 pages
Performance Еvaluation of Тracking Аlgorithm Incorporating Attribute Data Processing via DSmT
No ratings yet
Performance Еvaluation of Тracking Аlgorithm Incorporating Attribute Data Processing via DSmT
5 pages
DistGNN-MB
No ratings yet
DistGNN-MB
11 pages
1-s2.0-S1359835X2400616X-main
No ratings yet
1-s2.0-S1359835X2400616X-main
56 pages
Machine Learning Techniques Final Report (Fall, 2020) : ML - Explorer
No ratings yet
Machine Learning Techniques Final Report (Fall, 2020) : ML - Explorer
6 pages
Matrices-Powerhouses-of-Computer-Science
No ratings yet
Matrices-Powerhouses-of-Computer-Science
6 pages
The Sparse Tableau Approach To Network Analysis and Design
No ratings yet
The Sparse Tableau Approach To Network Analysis and Design
13 pages
Kang Et Al. 2022
No ratings yet
Kang Et Al. 2022
11 pages
Hao2022 Article NOSMFuseAnInfraredAndVisibleIm
No ratings yet
Hao2022 Article NOSMFuseAnInfraredAndVisibleIm
14 pages
Digital Twins For Power Transformers
No ratings yet
Digital Twins For Power Transformers
5 pages
BMG End Is Learning
No ratings yet
BMG End Is Learning
25 pages
Remote Sensing
No ratings yet
Remote Sensing
21 pages
A Comparative Analysis of Hold Out - Cross and Re-Substitution Validation in Hyper-Parameter Tuned Stochastic Short Term Load Forecasting
No ratings yet
A Comparative Analysis of Hold Out - Cross and Re-Substitution Validation in Hyper-Parameter Tuned Stochastic Short Term Load Forecasting
6 pages
Physics-Informed Graphical Representation-Enabled Deep Reinforcement Learning For Robust Distribution System Voltage Control
No ratings yet
Physics-Informed Graphical Representation-Enabled Deep Reinforcement Learning For Robust Distribution System Voltage Control
14 pages
Calibration of Angle of Repose
No ratings yet
Calibration of Angle of Repose
2 pages
3D_Segmentation
No ratings yet
3D_Segmentation
15 pages
Short-Term Load Forecasting Method Based On ARIMA and LSTM
No ratings yet
Short-Term Load Forecasting Method Based On ARIMA and LSTM
5 pages
Model_Predictive_Control_for_Master-Slave_Inverters_in_Microgrids
No ratings yet
Model_Predictive_Control_for_Master-Slave_Inverters_in_Microgrids
6 pages
MACcelerator Approximate Arithmetic Unit For Computational Acceleration
No ratings yet
MACcelerator Approximate Arithmetic Unit For Computational Acceleration
6 pages
2010 Geophys Congress Ekinci 2
No ratings yet
2010 Geophys Congress Ekinci 2
4 pages
1 s2.0 S1877050918319367 Main
No ratings yet
1 s2.0 S1877050918319367 Main
6 pages
2D Hector SLAM of Indoor Mobile Robot Using 2D Lidar
No ratings yet
2D Hector SLAM of Indoor Mobile Robot Using 2D Lidar
4 pages
Suresh 2020
No ratings yet
Suresh 2020
45 pages
1091
No ratings yet
1091
10 pages
IEEE Signal Processing Magazine 1995 - Practical Transform Coding of Multiespectral Imagery
No ratings yet
IEEE Signal Processing Magazine 1995 - Practical Transform Coding of Multiespectral Imagery
12 pages
Adaptive Multi-resolution Graph-based Clustering Algorithm for Electrofacies_Wu Hongliang
No ratings yet
Adaptive Multi-resolution Graph-based Clustering Algorithm for Electrofacies_Wu Hongliang
15 pages
A Biased Graph Neural Network Sampler With Near Optimal Regret
No ratings yet
A Biased Graph Neural Network Sampler With Near Optimal Regret
25 pages
2206.13245v1
No ratings yet
2206.13245v1
6 pages
Machine Learning and AI
No ratings yet
Machine Learning and AI
10 pages
A GIS Partial Discharge Pattern Recognition Method
No ratings yet
A GIS Partial Discharge Pattern Recognition Method
10 pages
2302.00047v2
No ratings yet
2302.00047v2
8 pages
AI-Based SI-Compliant PCB Design Support For DDR Technology Enhanced by Transfer
No ratings yet
AI-Based SI-Compliant PCB Design Support For DDR Technology Enhanced by Transfer
6 pages
Fast Marchenko Multiples Elimination On CMP Processing
No ratings yet
Fast Marchenko Multiples Elimination On CMP Processing
5 pages
Approximation- and Quantization-Aware Training for Graph Neural Networks
No ratings yet
Approximation- and Quantization-Aware Training for Graph Neural Networks
14 pages
Plot-To-Track Correlation in A-SMGCS Using The Target Images From A Surface Movement Radar
No ratings yet
Plot-To-Track Correlation in A-SMGCS Using The Target Images From A Surface Movement Radar
7 pages
Bundle Adjustment: Optimizing Visual Data for Precise Reconstruction
From Everand
Bundle Adjustment: Optimizing Visual Data for Precise Reconstruction
Fouad Sabry
No ratings yet
Subnetting and Supernetting Classful Addressing
No ratings yet
Subnetting and Supernetting Classful Addressing
34 pages
A Study On Machine Learning Algorithms and Its Applications
No ratings yet
A Study On Machine Learning Algorithms and Its Applications
13 pages
AJST-9-1-150-154
No ratings yet
AJST-9-1-150-154
5 pages
315 F19 15 SVM 2
No ratings yet
315 F19 15 SVM 2
35 pages
Machine Learning Is A Computer Vision
No ratings yet
Machine Learning Is A Computer Vision
7 pages
08 Training
No ratings yet
08 Training
18 pages
Quiz 3 Key
No ratings yet
Quiz 3 Key
6 pages
Chapter 7 INT 21H
No ratings yet
Chapter 7 INT 21H
14 pages
Feature Selection
No ratings yet
Feature Selection
6 pages
AI Assignment 4
No ratings yet
AI Assignment 4
5 pages
Chapter 4
No ratings yet
Chapter 4
80 pages
Overview Network
No ratings yet
Overview Network
42 pages
University of Gondar Institute of Technology: C++ Programming-2 Assignment
No ratings yet
University of Gondar Institute of Technology: C++ Programming-2 Assignment
7 pages
Raspberry Pi For Beginners & Advanced Users The Comprehensive Raspberry Pi Mastery Guide
No ratings yet
Raspberry Pi For Beginners & Advanced Users The Comprehensive Raspberry Pi Mastery Guide
94 pages
8086 Hardware Specification: Segment 5
No ratings yet
8086 Hardware Specification: Segment 5
19 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
8 pages
CS6551 Chapter 7
No ratings yet
CS6551 Chapter 7
57 pages
Microprocessor Chapter 2
No ratings yet
Microprocessor Chapter 2
111 pages
Asdgjmklljbcx
No ratings yet
Asdgjmklljbcx
8 pages
Physics - Motion in Plane
No ratings yet
Physics - Motion in Plane
21 pages
The Constructal Law of Design and Evolution in Nature, Adrian Bejan (2010)
No ratings yet
The Constructal Law of Design and Evolution in Nature, Adrian Bejan (2010)
13 pages
wfm01 01 Que 20240113
No ratings yet
wfm01 01 Que 20240113
32 pages
Brain Computer Interfaces Abstract:: History
No ratings yet
Brain Computer Interfaces Abstract:: History
4 pages
Solution
No ratings yet
Solution
55 pages
Inclinometer
No ratings yet
Inclinometer
45 pages
Gujarat Technological University: Page 1 of 2
No ratings yet
Gujarat Technological University: Page 1 of 2
2 pages
Lecture 1. Shaft Design
No ratings yet
Lecture 1. Shaft Design
18 pages
Syllabus Physics
No ratings yet
Syllabus Physics
20 pages
Connecting Rod Book
No ratings yet
Connecting Rod Book
21 pages
3bhs112321 Zab E51 o Acs1000ad Parametros
No ratings yet
3bhs112321 Zab E51 o Acs1000ad Parametros
380 pages
PH110-CHAPTER 4 Dynamics
No ratings yet
PH110-CHAPTER 4 Dynamics
17 pages
Topic 3_Distance Measurement
No ratings yet
Topic 3_Distance Measurement
22 pages
Industrial Robotics - Model Papers
No ratings yet
Industrial Robotics - Model Papers
3 pages
I-Taalem Task - Chapter 10
No ratings yet
I-Taalem Task - Chapter 10
3 pages
Sesimis Design in Steel Session 14 - Braced Frames
No ratings yet
Sesimis Design in Steel Session 14 - Braced Frames
63 pages
Design of Transmission Housing
No ratings yet
Design of Transmission Housing
7 pages
Scientific Methods in Psychology
No ratings yet
Scientific Methods in Psychology
20 pages
Forane 404a Mollier Diagram English
No ratings yet
Forane 404a Mollier Diagram English
1 page
Belts Notes
No ratings yet
Belts Notes
18 pages
The Intention Experiment Using Your Thou
100% (1)
The Intention Experiment Using Your Thou
8 pages
Footing Load On Mohr-Coulomb Soil
No ratings yet
Footing Load On Mohr-Coulomb Soil
6 pages
CBSE-IX Science_Chap-4 (Structure of Atom)
No ratings yet
CBSE-IX Science_Chap-4 (Structure of Atom)
12 pages
KCAD Analysis Report REV2
No ratings yet
KCAD Analysis Report REV2
8 pages
Calculation of Q-Factor From OSNR - WDM Network Design
No ratings yet
Calculation of Q-Factor From OSNR - WDM Network Design
4 pages
LOU CLASS 11th and DROPPER
No ratings yet
LOU CLASS 11th and DROPPER
13 pages
Lesson Plan For Speed
No ratings yet
Lesson Plan For Speed
2 pages