The Backpropagation Algorithm for a Math Student
The Backpropagation Algorithm for a Math Student
Authorized licensed use limited to: Universita degli Studi di Bologna. Downloaded on February 10,2025 at 16:08:52 UTC from IEEE Xplore. Restrictions apply.
Algorithm 1 The backpropagation algorithm
Require: Given an L-layer DNN or NN with a loss function ℓ, and a data pair (x, y). Let a[0] := x.
1:
Calculate ∇z[L] ℓ y, f (z[L] ) from Tab. I.
2: for l = 1, . . . , L do
3: if l ̸= L then
⊤
a[l] 0 0
⊤ ..
JW[l] ,b[l] z[l] = W[L] JW[L−1] ,b[L−1] f [L−1] (z[L−1] ) · · · JW[l] ,b[l] f [l] (z[l] )
0 . 0 I
⊤
0 0 a[l]
4: else if l = L then
⊤
a[L] 0 0
JW[L] ,b[L] z[L] =
..
0 . 0 I
⊤
0 0 a[L]
5: Construct
h i
Jθ z[L] (θ) = JW[1] ,b[1] z[L] (θ) · · · JW[L] ,b[L] z[L] (θ) .
it allows for a clear and accurate understanding of the results. i-th element of a vector x is denoted by xi . Likewise, wij
Additionally, using mathematical notation allows for precise denotes the ij-th element of a matrix W located at the i-th
and unambiguous statements of results, which can facilitate row and j-th column (sometimes written as wi,j for clarity).
replication and further research in the field. Also, Wi• and W•j denote the i-th row and the j-th column
Our results is summarized in Alg. 1 for L number of layers. of the matrix, respectively. The vector form of a matrix W
As Alg. 1 shows the matrix multiplication is done iteratively. is denoted by Vec(W), where each column of W is stacked
The iterative nature comes from the fact that the loss function on top of each other, with the first column at the top. The
of a DNN is defined as a composition of L + 1 functions letter f is reserved for a vector-valued non-linear activation
where the last one ℓ is the final function which measures the function of a layer in a DNN (NN), where z and a are its
loss (error) of the prediction and the actual value, i.e., ℓ(y, ŷ) input and output, respectively, i.e., a = f (z). The letter x is
in Fig 1 where ŷ is the prediction and y the actual value. reserved for the input to a DNN (NN), and y or y are reserved
The algorithm presented in Alg. 1 is explained and justified for the scalar or vector label of the input x. The predictions
by calculating the gradient of the loss function for networks of a DNN (NN) associated with y or y are denoted by ŷ
with one, two, and three layers. These calculations provide or ŷ, respectively. Superscripted index inside square brackets
insights into the gradients of loss functions for generic neural denotes the layer of a DNN (NN), e.g., z[l] is the z vector
networks and demonstrate how the gradient of a single-layer corresponding to the l-th layer.
network can serve as a model for the last layer of any DNN.
Additionally, the calculation of a two-layer network is used to III. R ESULT
extend the calculation beyond two-layer networks, as seen in
the calculation of the gradient of the loss function for LeNet- As we have explained earlier, the goal of this paper is to
100-300 [14], which is a three-layer network. Finally, we show take the first step towards training a DNN, which involves
how convolutional layers can be converted to linear layers in calculating the gradient of a loss function with respect to all
order to calculate their Jacobian matrices. These results can be parameters of the DNN, i.e., ∇θ ℓ(y, ŷ(θ)), where θ is the vec-
used to calculate the gradient of the loss function of a CNN. tor of parameters, ŷ is the predicted value by the network, y
is the true value, and ℓ(y, ŷ(θ)) is the loss incurred to predict
II. N OTATION
the output. To achieve this goal, two important observations
The letters x, x, and W denote a scalar, vector, and matrix, can make the task easier. First, it involves separating the last-
respectively. The letter I represents the identity matrix. The layer activation function from the network. Second, it involves
Authorized licensed use limited to: Universita degli Studi di Bologna. Downloaded on February 10,2025 at 16:08:52 UTC from IEEE Xplore. Restrictions apply.
using the relationship between the Jacobian operator and the
gradient.
To fulfill the first step, we observe the following equality:
ℓ y, ŷ(θ) = ℓ y, f [L] z[L] (θ)
which is a consequence of the fact that ŷ = f [L] z[L] (θ) ,
where f [L] is the last activation layer and z[L] (θ) is its Fig. 3. Last-Layer activation function combined with its associated loss
corresponding input. This equality can be illustrated more
clearly as shown in Fig. 2.
Second observation utilizes the relationship between the 3, which shows three common combinations, each associated
Jacobian operator and the gradient of a scalar-valued function with a different problem.
stated in the following lemma. In a binary classification problem, a sigmoid function is
Lemma 1 (Jacobian and gradient of a scalar-valued func- used as the last activation function together with Binary Cross
tion): For f : Rn → R as a scalar-valued differentiable func- Entropy (BCE) loss, as defined in Definition 3 and Definition
⊤
tion ∇x f (x) = Jx f (x) where ∇x f (x) is the gradient 5 in the Appendix, respectively. Similarly, in a classification
problem with more than two classes, a softmax function
and Jx f (x) is the Jacobian matrix of f at point x ∈ Rn
is used as the last activation function together with Cross
respectively.
Entropy (CE) loss, as defined in Definition 4 and Definition
Proof 1: The equality follows from the definitions of a
6 in the Appendix, respectively. In a regression problem,
Jacobian matrix as defined in Def. 1 and the gradient of a
where the goal is to predict a continuous value, the last-layer
scalar-valued function as defined in Def. in 2 in Appendix.
⊤ activation function is an identity function, and Square Error
By using the second observation as ∇x f (x) = Jx f (x) , (SE) is used as the loss function, as defined in Definition
and making use of the first one as 7. This means that the predicted output ŷ is equal to the
final layer’s output z[L] . Table I shows the combination of
ℓ y, ŷ(θ) = ℓ y, f [L] z[L] (θ) ,
the last activation function with its corresponding loss and
the expression for ∇z[L] ℓ y, f [L] z[L] . For the derivation,
one can write the following:
please refer to Appendix A.
⊤ More work is required to calculate the second term, i.e.,
∇θ ℓ y, ŷ(θ) = Jθ ℓ y, ŷ(θ)
Jθ z[L] (θ). In the following subsections, we show how to
!⊤
derive this calculation concisely for any number of layers.
[L] [L]
= Jθ ℓ y, f z (θ)
A. Gradient of a one-layer network
!⊤ We will now focus on computing Jθ z[L] (θ). This calcula-
= Jz[L] ℓ y, f [L] z[L] Jθ z[L] (θ) tion can be facilitated by starting with a single layer neural
network that is capable of solving a classification problem.
!⊤ !⊤ The single layer network plays a crucial role in gradient
= Jθ z[L] (θ) Jz[L] ℓ y, f [L] z[L] computation as it can be considered as the last layer of any
deep neural network (DNN). Due to the presence of only one
!⊤
layer, the superscripts in z[L] and f [L] can be omitted, giving
[L]
= Jθ z (θ) ∇z[L] ℓ y, f [L] z[L] us Jθ z(θ) := Jθ z[L] (θ). The network depicted in Fig. 4 is
designed to perform three-class classification based on inputs
(1)
with four features. Consequently, the weight matrix W is a
The advantage of Equation (1) is that it separates the orig-
4×3 and the bias vector b is a 3×1, i.e., W ∈ R4×3 and b ∈
inal gradient calculation
into two separate calculations, i.e.,
[L] [L] [L]
R3×1 . As shown in the concise representation of the network
Jθ z (θ) and ∇z[L] ℓ y, f z . ⊤
in the bottom part of Fig. 4, z ∈ R3 , i.e., z = W x+b. This
The calculation of the second term is straightforward be- vector z is a vector-valued function including 15 parameters
cause the choice of a loss function and the last activation where these parameters are all elements in W and b, i.e.,
function in a DNN are not arbitrary. This is illustrated in Fig. 15 = 4 × 3 + 3 × 1. According to the definition of Jacobian
matrix as defined in Def. 1 in Appendix, JW,b z(W, b) is
a 3 × 15 matrix, i.e., JW,b z(W, b) ∈ R3 × R15 .
For notational simplicity, all the parameters are denoted by
θ which is a vector in R15 , and is constructed by the process
of vectorization. The vectorization process stacks each column
of W on top of each other, with the first column on the top
Fig. 2. Separating the last layer activation from a DNN to create Vec(W). Then, the vector of network parameters
Authorized licensed use limited to: Universita degli Studi di Bologna. Downloaded on February 10,2025 at 16:08:52 UTC from IEEE Xplore. Restrictions apply.
TABLE I
L AST-L AYER ACTIVATION FUNCTION COMBINED WITH ITS ASSOCIATED LOSS
Loss ℓ(y, ŷ) ∇z[L] ℓ y, ŷ = ∇z[L] ℓ y, f [L] z[L] Proof
h i⊤
can be written as θ ⊤ := Vec(W) ⊤ (b)⊤ . Therefore will be considered, as shown in Fig. 5 where block-wise model
to calculate Jθ z(θ) one can write the following: helps calculating the gradient of its
loss. Similar to a one-
⊤ layer network, one can write ∇z[2] ℓ y, f [2] z[2] = −(y− ŷ)
Jθ z(θ) = JW,b W x + b from Tab. I. To calculate Jθ z[2] (θ), observe that it can be
⊤
separated into parameters of second and first layer as does θ.
W•1 + b1
The separation for θ can be written as the following:
= JW,b W•2 ⊤ + b2
⊤
W•3 + b3 h ⊤ ⊤ i⊤
⊤ θ ⊤ := Vec(W[1] ) (b[1] )⊤ Vec(W[2] ) (b[2] )⊤
x 0 0 1 0 0
= 0 x⊤ 0 0 1 0
which is used to write the below separation of Jacobian
0 0 x⊤ 0 0 1
⊤ matrices:
x 0 0
= 0 x⊤ 0 I3×3 ∈ R3×15
h i
Jθ z[2] (θ) = JW[1] ,b[1] z[2] (θ) JW[2] ,b[2] z[2] (θ) .
0 0 x⊤
where W•i is a column of W for i = 1, 2, 3 and I3×3 appears Calculating the second term is straight forward because one
because the derivative of each element of z with respect to can write the following:
components of b are either zero or one. Therefore the gradient
⊤
of the loss function is calculated as follows JW[2] ,b[2] z[2] (θ) = JW[2] ,b[2] W[2] a[1] + b[2]
⊤
∇θ ℓ y, ŷ(θ) = Jθ z(θ) ∇z ℓ y, f (z) ⊤
a[1] 0 0
x 0 0 ⊤
= 0 a[1]
0 I3×3
0 x 0 ⊤
= − 0
(y − ŷ) 0 0 a[1]
0 x
I3×3
where a[1] ∈ R5 because W[2] ∈ R5×3 . To calculate the
where the value for ∇z ℓ y, f (z) is obtained from Tab. I. second term first observe that W[2] is not a function of W[1]
The above gradient is similar to the gradient of one-layer
networks whose weights are vectors not matrices as shown in
Tab. II. As it can be seen from Tab. II famous problems such as
simple/multiple linear regression, simple binary classification,
and logistic regression can be written as a one-layer network
whose weight are vectors not matrices.
Although a one-layer network provides valuable understand-
ing of the relationship between the Jacobian and the gradient
of the loss function with respect to z, it is not practical in
terms of performance, i.e., accuracy in prediction. To improve
performance, adding more layers is recommended. As a result,
the following subsection will demonstrate the calculation of
the gradient of a two-layer network.
B. Gradient of a two-layer network
Studying a two-layer network can not only improve per-
formance (accuracy) but also aid in developing a method for
calculating the gradient of any deep neural network (DNN)
with multiple layers. To demonstrate this, a two-layer network Fig. 4. 3-class classifier with 4 features and a single layer.
Authorized licensed use limited to: Universita degli Studi di Bologna. Downloaded on February 10,2025 at 16:08:52 UTC from IEEE Xplore. Restrictions apply.
TABLE II
FAMOUS PROBLEMS REPRESENTED AS SIMPLE NEURAL NETWORKS .
Simple Linear Regression Simple Binary Classifier Multiple Linear Regression Logistic Regression
Architecture
x1 x1
x2 x2
x1
Input x∈R x= ∈ R2 x = .. ∈ Rn x = .. ∈ Rn
x2 . .
xn−1 xn−1
xn xn
w1 w1
. .
w w . w . w
Parameters θ= ∈ R2 θ == ∈ R3 θ= . = ∈ Rn+1 θ= . = ∈ Rn+1
b b w b w b
n n
b b
⊤ ⊤
Predictions ŷ = wx + b ŷ = σ(θ x) ŷ = w x + b ŷ = σ(θ ⊤ x)
Loss ℓ = (y − ŷ)2 ℓ = −y log(ŷ) − (1 − y) log(1 − ŷ) ℓ = (y − ŷ)2 ℓ = −y log(ŷ) − (1 − y) log(1 − ŷ)
x x x x
Gradient ∇ℓθ (y, ŷ) = 2 (y − ŷ) ∇ℓθ (y, ŷ) = − (y − σ(θ ⊤ x)) ∇ℓθ (y, ŷ) = 2 (y − ŷ) ∇ℓθ (y, ŷ) = − (y − σ(θ ⊤ x))
1 1 1 1
shown below:
[2] [2] ⊤ [1] [1] [1] ⊤ [1]
J z (θ) = W J [1] f z J W x+b
W[1] ,b[1] z W[1] ,b[1]
⊤
x 0 0
.
..
[2] ⊤ [1] [1]
= W J [1] f z
.
z 0
0 I
x ⊤
0 0
Authorized licensed use limited to: Universita degli Studi di Bologna. Downloaded on February 10,2025 at 16:08:52 UTC from IEEE Xplore. Restrictions apply.
Fig. 6. LeNet-100-300
vectorized
image. Similar
to a one- and two-layer networks D. Jacobian of activation functions
[3] [3]
∇z[3] ℓ y, f z = −(y − ŷ) ∈ R10 which is obtained
In this subsection we elaborate on Jz[l] f [l] z[l] where f [l]
from Tab. I. To calculate Jθ z[3] (θ), three Jacobian are needed is the l-th activation layer whose corresponding input is z[l] for
to separate out the parameters of the layers as the following: l = 1, . . . , L − 1. When l ̸= L, the vector f [l] (z[l] ) is typically
[3]
obtained by applying a single univariate function f to each
Jθ z (θ) = J z[3] (θ) J z[3] (θ) J z[3] (θ) .
W[1] ,b[1] W[2] ,b[2] W[3] ,b[3]
element of z[l] . The most common activation function used in
By following the same steps as for the two-layer network, the DNNs is the Rectified Linear Unit (ReLU) function, defined
gradient can be calculated as: as f (x) = max(0, x) [16], [17], [18], [19]. Algebraically, this
operation can be represented as
⊤
∇θ ℓ y, ŷ(θ) = Jθ z[3] (θ) ∇z[3] ℓ y, f (z[3] )
⊤
f ⊤ z = f (z1 ) · · · f (zd )
where Jacobian matrices of layers J⊤ [3] z
[3]
(θ) ,
W ,b
[3]
where f is the same univariate function applied to all elements
⊤ [3] ⊤ [3] of z, d is the output size of the layer, and we have omitted
JW[2] ,b[2] z (θ) , JW[1] ,b[1] z (θ) are:
superscripts for clarity. This special structure results in a
diagonal matrix, i.e.,
[3]
a 0 0
..
0
. 0 , Jz f (z) = diag f ′ (z1 ), · · · , f ′ (zd )
0 0 a[3]
I10×10 where f ′ is the derivative of a univariate function. For the
[2] special case of the ReLU function J z f (z) is a diagonal
a 0 0 matrix of zeros and ones which are associated to the negative
. . ⊤ and positive elements of z. Multiplying such a matrix from
0 . 0 Jz[2] f [2] z[2]
W[3] ,
the left to any matrix W results in removing the rows of
0 0 a[2] W associated to zero elements in Jz f (z) which greatly
I100×100 decreases the computation. The zero-th norm of the parameter
vector can be minimized in sparse optimizations using these
x 0 0
intuitions, as noted in [20].
..
. Note that the derivations so far have only considered fully
[1] [1] ⊤ [2] [2] ⊤
[2] [3]
0 0 J [1] f z W J [2] f z W ,
z z
connected networks. However, for computer vision tasks, CNN
0 0 x
models are utilized, which employ convolution operations
I300×300
Authorized licensed use limited to: Universita degli Studi di Bologna. Downloaded on February 10,2025 at 16:08:52 UTC from IEEE Xplore. Restrictions apply.
Lemma 3: For an input X to a convolutional layer that has r
number of filters, each calculation X∗Ki +Bi is equivalent to
(Wi )⊤ x+bi where Ki is the matrix of the i-th filter, Bi is the
⊤
Fig. 7. An illustration of convolution
matrix of associated bias for each filter, Wi = Teop(Ki ) ,
bi = Vec(Bi ) for i = 1, . . . , r, and x := Vec(X⊤ ).
According to the above lemma, we can convert a convolutional
E. Convolution as matrix multiplication neural network to a typical fully connected one and find its
A 2-D convolution operation is a mathematical operation gradient.
that is used to extract features or patterns from a 2-dimensional IV. C ONCLUSION
input (matrix), such as an image. It works by applying a filter
In this paper, we demonstrated the utilization of the Jacobian
or kernel, which is also a matrix, to the input image [14].
operator to simplify the gradient calculation process in DNNs.
The filter is moved across the image, performing element-wise
We presented a matrix multiplication-based algorithm that ex-
multiplications with the overlapping regions of the image and
presses the BP algorithm using Jacobian matrices and applied
filter, and then summing the results. This process is repeated
it to determine gradients for single, double, and three-layer net-
for every position of the filter on the image, resulting in a
works. Our calculations offered insights into the gradients of
new matrix output, known as a feature map. Fig. 7 shows
loss functions in DNNs; for instance, the gradient of a single-
the process of a convolution operation where the 3x3 and
layer network can serve as a model for the final layer of any
2x2 matrices represent the input and the filter respectively.
DNN. We also expanded our findings to cover more intricate
As Fig. 7 illustrates the filter slides over the input image,
architectures such as LeNet-100-300-10 and demonstrated that
one pixel at a time, and performs element-wise multiplications
the gradients of convolutional neural network layers can be
with the overlapping region of the image. The result of these
transformed into linear layers. These results can aid research
multiplications is then summed, and the sum is stored in the
on compressing DNNs that utilize the full gradient, as noted
corresponding location of the output feature map.
in [21]. Furthermore, they can benefit sparse optimization in
The size of the filter and the stride (the number of pixels
both deterministic and stochastic settings, where the Iterative
the filter is moved each time) determine the size of the output
Hard Thresholding (IHT) algorithm uses the full gradient for
feature map. Additionally, the filter can be applied multiple
a sparse solution in deterministic settings [20] and the mini-
times with different filter parameters, to extract different
batch Stochastic IHT algorithm is employed in the stochastic
features from the same input image.
context [22]. We provided concise mathematical justifications
Lemma 2: The convolution operation between two matrices,
to make the results clear and useful for people from differ-
X and K, can be represented as a matrix multiplication.
ent fields, even those without a deep understanding of the
Specifically, it can be represented as the product of a Toeplitz
involved mathematics. This was particularly important when
matrix (or diagonal-constant matrix) of K and the vector
communicating complex technical concepts to non-experts as
obtained from stacking the columns of the transpose of X
it allowed for a clear and accurate understanding of the results.
in the order of the first one on top. Mathematically, this can
Additionally, using mathematical notation allowed for precise
be represented as:
and unambiguous statements of results, facilitating replication
X ∗ K = Teop(K)Vec(X⊤ ) and further research in the field. As next steps, we intend to
study the calculation of gradients for loss functions in various
where X ∈ RmX ×nX and K ∈ RmK ×nK and types of neural networks such as residual, recurrent, Long
mX , nX , mX , nX ∈ N. Short-Term Memory (LSTM), and Transformer networks. We
Remark 1: The above representation allows for the con-
will also explore the Jacobian of batch normalization to further
volution operation to be computed efficiently using matrix
our understanding of the method.
multiplication, which can be parallelized and accelerated on a
GPU. R EFERENCES
To clarify the above lemma, consider X ∈ R3×3 and K ∈ [1] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
R 2×2
as the input and filter respectively. Then, one can verify T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An
imperative style, high-performance deep learning library,” arXiv preprint
the lemma by writing the following: arXiv:1912.01703, 2019.
[2] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
x1 S. Ghemawat, G. Irving, M. Isard et al., “{TensorFlow}: a system
x2 for {Large-Scale} machine learning,” in 12th USENIX symposium on
x3 operating systems design and implementation (OSDI 16), 2016, pp. 265–
k1 k2 0 k3 k4 0 0 0 0 283.
0 k1 k2 0 k3 k4 0 0 0 x4
[3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
X∗K= 0 0 k1 k2 0 k3 k4 0 0 x5
recognition,” in Proceedings of the IEEE conference on computer vision
x6 and pattern recognition, 2016, pp. 770–778.
0 0 0 k1 k2 0 k3 k4 0
x7 [4] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
x8 in Proceedings of the IEEE conference on computer vision and pattern
x9 recognition, 2015, pp. 1–9.
Authorized licensed use limited to: Universita degli Studi di Bologna. Downloaded on February 10,2025 at 16:08:52 UTC from IEEE Xplore. Restrictions apply.
[5] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely Definition 2 (Gradient of a scalar-valued function): Let f :
connected convolutional networks,” in Proceedings of the IEEE confer- R → R be a differentiable scalar-valued function. The gradient
ence on computer vision and pattern recognition, 2017, pp. 4700–4708.
[6] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for con- of ∇f : Rn → Rn at x ∈ Rn is defined as the following:
volutional neural networks,” in International conference on machine ∂f (x)
learning. PMLR, 2019, pp. 6105–6114. ∂x1
[7] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks ..
for biomedical image segmentation,” in International Conference on ∇f (x) =
.
.
Medical image computing and computer-assisted intervention. Springer, ∂f (x)
2015, pp. 234–241. ∂xn
[8] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look
once: Unified, real-time object detection,” in Proceedings of the IEEE Definition 3 (Sigmoid function): A sigmoid function σ :
ex
conference on computer vision and pattern recognition, 2016, pp. 779– R → [0, 1] is defined as σ(x) = 1+e x.
788.
[9] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training Definition 4 (Softmax function): A softmax function σ :
of deep bidirectional transformers for language understanding,” arXiv Rc → Rc is defined for c ≥ 3 as the following:
preprint arXiv:1810.04805, 2018. ex1
[10] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal,
exj
Pc
A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language mod- j=1
Authorized licensed use limited to: Universita degli Studi di Bologna. Downloaded on February 10,2025 at 16:08:52 UTC from IEEE Xplore. Restrictions apply.
A. Sigmoid with BCE each component can be calculated as the following:
∂ c c ∂ c c c
Lemma 4 (Gradient of BCE loss): Let x(i) be an input z z
X X X X X
− yi zi − log( e j ) = −(yi zi ) + yi log( e j )
∂zj i=1 j=1 ∂zj i=1 i=1 j=1
to a one-layer network solving binary classification with c z
e j
y (i) ∈ {0, 1} be its true label and ŷ (i) = σ w⊤ x(i) + b
X
= −yj + yi P z
c e j
i=1 j=1
is the predicted probability corresponding to the input for c
X
= −yj + yi (σ(z))j
i = 1, . . . , N . The gradient of the BCE loss is given as follows: i=1
c
X
= −yj + (σ(z))j yi
i=1
x(i) = −yj + (σ(z))j
(i) ⊤ (i) (i) ⊤ (i)
∇θ BCE y ,σ w x +b = − y −σ w x +b
1
Pc
where in the last equality we have use the fact that i=1 yi =
1. Since the above calculation is for the j-th component and
for i = 1, . . . , N . (ŷ)j = (σ(z))j , by calculating other components we get the
Proof 2: We calculate the following for a fixed i ∈ desired result.
{1, . . . , N }:
C. Square Error
!
Lemma 6: Let x(i) be an input to a one-layer network with
(i) ⊤ (i) (i) ⊤ (i)
∇θ BCE y ,σ w x +b = −∇θ y log σ w x +b
(i)
(i)
⊤ (i)
! y ∈ R be the corresponding true value and
+ (1 − y ) log 1−σ w x +b .
B. Softamx with CE
c
!
X ez i
CE (y, σ(z)) = − yi log Pc
i=1 j=1 ez j
Authorized licensed use limited to: Universita degli Studi di Bologna. Downloaded on February 10,2025 at 16:08:52 UTC from IEEE Xplore. Restrictions apply.