Deep learning
Dr. Aissa Boulmerka
[email protected] 2023-2024
1
CHAPTER 2
SHALLOW NEURAL NETWORKS
2
What is a Neural Network?
𝑥1
𝑥2 𝑦=a
𝑥3
𝑥
𝑤 𝑧 = 𝑤𝑇𝑥 + 𝑏 𝑎=𝜎 𝑧 ℒ 𝑎, 𝑦
3
What is a Neural Network?
𝑥1
𝑥2 𝑦=a
𝑥3
𝑊 [1] 𝑧 [1] = 𝑤 [1] 𝑥 + 𝑏 [1] 𝑎[1] = 𝜎 𝑧 [1] 𝑧 [2] = 𝑤 [2] 𝑎[1] + 𝑏 [2] 𝑎[2] = 𝜎 𝑧 [2] ℒ 𝑎[2] , 𝑦
𝑏 [1] 𝑊 [2]
𝑏 [2]
4
Neural Network Representation
2 layers neural network
𝐰 [𝟏] , 𝐛 [𝟏]
(4,3) (4,1)
𝐚[𝟎] = 𝐗 𝐚[𝟏]
[𝟏]
𝒂𝟏
𝐰 [𝟐] , 𝐛 [𝟐]
𝑥1 (1,4) (1,1)
[𝟏]
𝒂𝟐 𝐚[𝟐]
𝑥2 𝑦 = 𝐚[𝟐]
[𝟏]
𝒂𝟑
𝑥3 𝒂𝟏 [𝟏]
[𝟏]
𝒂𝟒
𝒂𝟐 [𝟏]
𝐚[𝟏] =
𝒂𝟑 [𝟏]
Input Hidden Output 𝒂𝟒 [𝟏]
layer layer layer
5
Neural Network Representation
𝑥1
𝑥1 𝑥2 𝑦
𝑥2 𝑤 𝑇 𝑥 + 𝑏 𝜎(𝑧) 𝑎=𝑦 𝑥3
𝑥3
𝑧 = 𝑤𝑇𝑥 + 𝑏
𝑎 = 𝜎(𝑧)
6
Neural Network Representation
[1] 1𝑇 [1]
𝑧1 = 𝑤1 𝑥 + 𝑏1
[1] [1]
𝑎1 = 𝜎(𝑧1 )
𝑥1
𝑥1 𝑥2 𝑦
𝑥2 𝑤 𝑇 𝑥 + 𝑏 𝜎(𝑧) 𝑎=𝑦 𝑥3
𝑥3
[1] 1𝑇 [1]
𝑧= 𝑤𝑇𝑥 +𝑏 𝑧2 = 𝑤2 𝑥 + 𝑏2
[1] [1]
𝑎 = 𝜎(𝑧) 𝑥1 𝑎2 = 𝜎(𝑧2 )
𝑥2 𝑦
𝑥3
7
Neural Network Representation
[𝟏]
𝒂𝟏
[1] 1𝑇 [1] [1] [1]
𝑥1 𝑧1 = 𝑤1 𝑥 + 𝑏1 , 𝑎1 = 𝜎 𝑧1
[1] 1𝑇 [1] [1] [1]
[𝟏]
𝒂𝟐 𝑧2 = 𝑤2 𝑥 + 𝑏2 , 𝑎2 = 𝜎 𝑧2
𝑥2 𝑦 [1]
𝑧3 = 𝑤3
1𝑇 [1] [1]
𝑥 + 𝑏3 , 𝑎3 = 𝜎 𝑧3
[1]
[𝟏]
𝒂𝟑 [1] 1𝑇 [1] [1] [1]
𝑧4 = 𝑤4 𝑥 + 𝑏4 , 𝑎4 = 𝜎 𝑧4
𝑥3
𝑊 [1] 𝑏 [1]
[𝟏]
𝒂𝟒
1𝑇 [1] 1𝑇 [1] [1]
____ 𝑤1 ___ 𝑏1 𝑤1 𝑥 + 𝑏1 𝑧1
1𝑇 𝑥1 [1] 1𝑇 [1] [1]
____𝑤2 ____ 𝑏2 𝑤2 𝑥 + 𝑏2 𝑧2
𝑧 [1] = 1𝑇
𝑥2 +
[1]
= 1𝑇 [1]
= [1]
(4,1) ____𝑤2 ____ 𝑥3 𝑏3 𝑤3 𝑥 + 𝑏3 𝑧3
1𝑇 [1] 1𝑇 [1] [1]
____𝑤2 ____ (3,1) 𝑏4 𝑤4 𝑥 + 𝑏4 𝑧4
(4,3) (4,1)
[1]
𝑎1
[1]
𝑎2
𝑎[1] = [1]
= 𝜎 𝑍 [1]
𝑎3
[1]
𝑎4 8
Neural Network Representation Learning
[𝟏]
𝒂𝟏
𝑥1
[𝟏]
𝒂𝟐
𝑥2 𝑦 Given input x:
[𝟏]
𝒂𝟑
𝑥3 𝑧 [1] = 𝑊 1 𝑥 + 𝑏 [1]
[𝟏] 𝑎[2] (𝟒, 𝟏) 𝟒, 𝟑 𝟑, 𝟏 (𝟒, 𝟏)
𝒂𝟒
𝑥 = 𝑎[0] 𝑎[1] = 𝜎 𝑧 [1]
𝑎[1] (𝟒, 𝟏) (𝟒, 𝟏)
𝑧 [2] = 𝑊 2
𝑎[1] + 𝑏 [2]
(𝟏, 𝟏) 𝟏, 𝟒 𝟒, 𝟏 (𝟏, 𝟏)
𝑎[2] = 𝜎 𝑧 [2]
(𝟏, 𝟏) (𝟏, 𝟏)
9
For loop across multiple examples
𝑥1 𝑧 [1] = 𝑊 1 𝑥 + 𝑏 [1]
𝑥2 𝑎[1] = 𝜎 𝑧 [1]
𝑦
𝑧 [2] = 𝑊 2 𝑎[1] + 𝑏 [2]
𝑥3 𝑎[2] = 𝜎 𝑧 [2]
𝑥 -------------> 𝑎[2] = 𝑦 for i=1 to m:
𝑥 (1) -------------> 𝑎[2](1) = 𝑦 (1) 𝑧 [1](𝑖) = 𝑊 1 𝑥 (𝑖) + 𝑏 [1]
𝑥 (2) -------------> 𝑎[2](2) = 𝑦 (2) 𝑎[1](𝑖) = 𝜎 𝑧 [1](𝑖)
⋯ ⋯ ⋯ 𝑧 [2](𝑖) = 𝑊 2 𝑎[1](𝑖) + 𝑏 [2]
𝑥 (𝑚) -------------> 𝑎[2](𝑚) = 𝑦 (𝑚) 𝑎[2](𝑖) = 𝜎 𝑧 [2](𝑖)
𝑎[2](𝑖)
Layer 2 Example (i)
10
Vectorizing across multiple examples
for i=1 to m:
𝑍 [1] = 𝑊 1 𝑋 + 𝑏 [1]
𝑧 [1](𝑖) = 𝑤 1 𝑥 (𝑖) + 𝑏 [1]
𝐴[1] = 𝜎 𝑍 [1]
𝑎[1](𝑖) = 𝜎 𝑧 [1](𝑖)
𝑍 [2] = 𝑊 2 𝐴[1] + 𝑏 [2]
𝑧 [2](𝑖) = 𝑤 2 𝑎[1](𝑖) + 𝑏 [2]
𝐴[2] = 𝜎 𝑍 [2]
𝑎[2](𝑖) = 𝜎 𝑧 [2](𝑖)
| | ⋯ | | ⋯ | |
𝑋 = 𝑥 (1) 𝑥 (2) ⋯𝑥 (𝑚) 𝑍 [1] = 𝑍 [1](1) 𝑍 [1](2) ⋯𝑍 [1](𝑚)
| | ⋯ | | | ⋯ |
(𝒏𝒙 , 𝒎)
# training examples
# hidden units
| | ⋯ |
𝐴[1] = 𝑎[1](1) 𝑎[1](2) ⋯𝑎[1](𝑚)
| | ⋯ |
11
Justification for vectorized implementation
𝑍 [1](1) = 𝑊 1
𝑥 (1) + 𝑏 [1] , 𝑍 [1](2) = 𝑊 1
𝑥 (2) + 𝑏 [1] , 𝑍 [1](3) = 𝑊 1
𝑥 (3) + 𝑏 [1]
∙ ∙ ∙
∙ ∙ ∙
𝑊 1 𝑥 (1) = 𝑊 1 𝑥 (2) = 𝑊 1 𝑥 (3) =
∙ ∙ ∙
∙ ∙ ∙
| | |
| | | 𝑊 1
𝑥 (1) 𝑊 1
𝑥 (2) 𝑊 1
𝑥 (3) | | |
1
𝑊 𝑥 (1) 𝑥 (2) 𝑥 (3) + 𝑏[1] = | | | = 𝑍 [1](1) 𝑍 [1](2)𝑍 [1](3)
| | | + + + | | |
𝑏 [1] 𝑏 [1] 𝑏 [1]
𝑋 𝑍 [1]
𝑍 [1] = 𝑊 1
𝑋 + 𝑏 [1]
12
Recap of vectorizing across multiple examples
𝑥1 for i=1 to m:
𝑧 [1](𝑖) = 𝑊 1 𝑥 (𝑖) + 𝑏 [1]
𝑥2 𝑦 𝑎[1](𝑖) = 𝜎 𝑧 [1](𝑖)
𝑥3 𝑧 [2](𝑖) = 𝑊 2 𝑎[1](𝑖) + 𝑏 [2]
𝑎[2](𝑖) = 𝜎 𝑧 [2](𝑖)
| | ⋯ | 𝐴[0]
𝑋 = 𝑥 (1) 𝑥 (2) ⋯𝑥 (𝑚)
| | ⋯ | 𝑍 [1] = 𝑊 1 𝑋 + 𝑏 [1]
𝐴[1] = 𝜎 𝑍 [1]
𝑍 [2] = 𝑊 2 𝐴[1] + 𝑏 [2]
| | ⋯ | 𝐴[2] = 𝜎 𝑍 [2]
𝐴[1] = 𝑎[1](1) 𝑎[1](2) ⋯𝑎[1](𝑚)
| | ⋯ |
13
Activation functions
Sigmoid:
𝑔[1] 𝑍 [1] = 𝑡𝑎𝑛ℎ 𝑍 [1]
1
𝑎=
𝑥1 𝒕𝒂𝒏𝒉 1 + 𝑒 −𝑧
Sigmoid:
𝑔[2] 𝑍 [2] = 𝜎 𝑍 [2]
𝑥2 𝒕𝒂𝒏𝒉 𝝈 𝑦 tanh:
𝑒 𝑧 − 𝑒 −𝑧
𝑎= 𝑧
𝑥3 𝒕𝒂𝒏𝒉 𝑒 + 𝑒 −𝑧
Given 𝑿: ReLU:
𝑎 = max 0, 𝑧
𝑧 [1] = 𝑊 1 𝑥 + 𝑏 [1]
𝑎[1] = 𝑔[1] 𝑍 [1]
𝑧 [2] = 𝑊 2 𝑎[1] + 𝑏 [2]
Leaky ReLU:
𝑎[2] = 𝑔[2] 𝑧 [2] 𝑎 = max 0.01𝑧, 𝑧
14
Pros and cons of activation functions
Sigmoid activation function: tanh activation function:
1 𝑒 𝑧 − 𝑒 −𝑧
𝑎= 𝑎= 𝑧
1 + 𝑒 −𝑧 𝑒 + 𝑒 −𝑧
Never use this, except for the output layer.
if you are doing binary classification, or The tanh is much strictly superior.
maybe almost never use this.
ReLU activation function: Leaky ReLU activation function:
𝑎 = max 0, 𝑧 𝑎 = max 0.01𝑧, 𝑧
The default and the most commonly used You can also try the leaky ReLU function.
activation function is the ReLU.
So if you're not sure what else to use, use
the ReLU function.
15
Why do you need non linear activation functions?
𝒍𝒊𝒏
𝑥1
𝒍𝒊𝒏 𝑎[1] = 𝑧 [1] = 𝑊 1
𝑥 + 𝑏 [1]
𝑥2 𝒍𝒊𝒏 𝑦∈ℝ 𝑎[2] = 𝑧 [2] = 𝑊 2 [1]
𝑎 + 𝑏 [2]
𝒍𝒊𝒏
𝑎[2] = 𝑊 2 𝑊 1𝑥 + 𝑏 [1] + 𝑏 [2]
𝑥3 𝑎[2] = 𝑊 2 𝑊 1 𝑥 + 𝑊 2 𝑏 [1] + 𝑏 [2]
𝒍𝒊𝒏
𝑎[2] = 𝑊′𝑥 + 𝑏′
Given 𝑿:
𝑧 [1] = 𝑊 1 𝑥 + 𝑏 [1]
𝑎[1] = 𝑔 𝑧 [1] = 𝑧 [1] Linear activation
function:
𝑧 [2] = 𝑊 2 𝑎[1] + 𝑏 [2] 𝒈 𝒛 =𝒛
𝑎[2] = 𝑔 𝑧 [2] = 𝑧 [2]
16
Derivatives of activation functions
Sigmoid activation function:
1
𝑔 𝑧 =
1 + 𝑒 −𝑧
𝑧 = 10 ⟹ 𝑔 𝑧 ≈ 1
𝑑
𝑔′ 𝑧 = 𝑔 𝑧 = slope of 𝑔(𝑧) at 𝑧 𝑑
𝑑𝑧 ⟹ 𝑔 𝑧 ≈1 1−1 ≈0
𝑑𝑧
1 1 𝑧 = −10 ⟹ 𝑔 𝑧 ≈ 0
= 1−
1 + 𝑒 −𝑧 1 + 𝑒 −𝑧 𝑑
⟹ 𝑔 𝑧 ≈0 1−0 ≈0
𝑑𝑧
=𝑔 𝑧 1−𝑔 𝑧
𝑧 = 0 ⟹ 𝑔 𝑧 = 1/2
𝑎=𝑔 𝑧 , 𝑔′ 𝑧 =𝑎 1−𝑎 𝑑 1 1
⟹ 𝑑𝑧 𝑔 𝑧 = 2 1 − 2 = 1/4
17
Derivatives of activation functions
Tanh activation function:
𝑒 𝑧 − 𝑒 −𝑧
𝑔 𝑧 = tanh 𝑧 = 𝑧
𝑒 + 𝑒 −𝑧
𝑧 = 10 ⟹ 𝑡𝑎𝑛ℎ 𝑧 ≈ 1
𝑑 ⟹ 𝑔′ 𝑧 ≈ 0
𝑔′ 𝑧 = 𝑔 𝑧 = slope of 𝑔(𝑧) at 𝑧
𝑑𝑧
= 1 − tanh 𝑧 2 𝑧 = −10 ⟹ 𝑔 𝑧 ≈ −1
⟹ 𝑔′ 𝑧 ≈ 0
𝑎=𝑔 𝑧 , 𝑔′ 𝑧 = 1 − 𝑎 2
𝑧=0⟹𝑔 𝑧 =0
⟹ 𝑔′ 𝑧 = 1
18
Derivatives of activation functions
ReLU and Leaky ReLU :
𝑔 𝑧 = 𝑚𝑎𝑥 0, 𝑧 𝑎 = 𝑚𝑎𝑥 0.01𝑧, 𝑧
1 𝑠𝑖 𝑧 ≥ 0 1 𝑠𝑖 𝑧 ≥ 0
𝑔′ 𝑧 = 𝑔′ 𝑧 =
0 𝑠𝑖 𝑧 < 0 0.01 𝑠𝑖 𝑧 < 0
19
Gradient descent for neural networks
Parameters :
𝑊 [1] 𝑏 [1] 𝑊 [2] 𝑏 [2]
𝑛[1] , 𝑛[0] 𝑛[1] , 1 𝑛[2] , 𝑛[1] 𝑛[2] , 1
with 𝑛𝑥 = 𝑛[0] , 𝑛 1 , 𝑛[2] = 1
Cost function: 𝑚
1
J 𝑊 [1] , 𝑏 [1] , 𝑊 [2] , 𝑏[2] = ℒ 𝑦, 𝑦
𝑚
𝑖=1
Gradient descent:
Repeat {
Compute predictions 𝑦 (𝑖) , 𝑖 = 1, … , 𝑚
𝜕𝐽 𝜕𝐽
𝑑𝑊 [1] = 𝜕𝑊 [1] , 𝑑𝑏 [1] = 𝜕𝑏[1]
𝑊 [1] = 𝑊 [1] − 𝛼𝑑𝑊 [1]
𝑏 [1] = 𝑏 [1] − 𝛼𝑑𝑏 [1]
𝜕𝐽 𝜕𝐽
𝑑𝑊 [2] = 𝜕𝑊 [2] , 𝑑𝑏 [2] = 𝜕𝑏[2]
𝑊 [2] = 𝑊 [2] − 𝛼𝑑𝑊 [2]
𝑏 [2] = 𝑏 [2] − 𝛼𝑑𝑏 [2]
} 20
Formulas for computing derivatives
Forward propagation: Back propagation:
𝑍 [1] = 𝑊 1
𝑋 + 𝑏 [1] 𝑑𝑍 [2] = 𝐴[2] − 𝑌
1
𝐴[1] = 𝑔[1] 𝑍 [1] 𝑑𝑊 [2] = 𝑑𝑍 [2] 𝐴 1 𝑇
𝑚
1
𝑍 [2] = 𝑊 2
𝐴[1] + 𝑏 [2] 𝑑𝑏 [2] = 𝑚 𝑛𝑝. 𝑠𝑢𝑚 𝑑𝑍 [2] , 𝑎𝑥𝑖𝑠 = 1, 𝑘𝑒𝑒𝑝𝑑𝑖𝑚𝑠 = 𝑇𝑟𝑢𝑒
(𝑛 2 , 1) (𝑛 2 , )
𝐴[2] = 𝑔[2] 𝑍 [2] = 𝜎 𝑍 [2]
𝑑𝑍 [1] = 𝑊 2 𝑇 𝑑𝑍 [2] ∗ 𝑔[1]′ 𝑍 [1]
(𝑛 1 , 𝑚) (𝑛 1 , 𝑚) (𝑛 1 , 𝑚)
Element wise product
1
𝑑𝑊 [1] = 𝑑𝑍 [1] 𝑋 𝑇
𝑚
1
𝑑𝑏 [1] = 𝑚 𝑛𝑝. 𝑠𝑢𝑚 𝑑𝑍 [1] , 𝑎𝑥𝑖𝑠 = 1, 𝑘𝑒𝑒𝑝𝑑𝑖𝑚𝑠 = 𝑇𝑟𝑢𝑒
(𝑛 1 , 1) (𝑛 1 , ) reshape
21
What happens if you initialize weights to zero?
𝑥1 [𝟏]
𝒂𝟏
[𝟐]
𝒂𝟏 𝑦
𝑥2 [𝟏]
𝒂𝟐
𝑛[0] = 2 𝑛1 =2
0 0 0
𝑊 [1] = 𝑏 [1] =
0 0 0
[𝟏] [𝟏] [𝟏] [𝟏]
𝒂𝟏 = 𝒂𝟐 (symmetric) 𝒅𝒛𝟏 = 𝒅𝒛𝟐
𝑢 𝑣
𝑑𝑤 = 𝑊 [1] = 𝑊 [1] − 𝛼𝑑𝑤
𝑢 𝑣
• The bias terms 𝑏 can be initialized by 0, but initializing 𝑊 to all 0s is a
problem:
[𝟏] [𝟏]
• The two activations 𝒂𝟏 and 𝒂𝟐 will be the same, because both of
these hidden units are computing exactly the same function.
• After every single iteration of training the two hidden units are still
computing exactly the same function.
22
Random initialization
𝑥1 [𝟏]
𝒂𝟏
𝒂𝟏
[𝟐]
𝑦
𝑥2 [𝟏]
𝒂𝟐
𝑊 [1] = 𝑛𝑝. 𝑟𝑎𝑛𝑑𝑜𝑚. 𝑟𝑎𝑛𝑑𝑛 2,2 ∗ 0.01
𝑏 [1] = 𝑛𝑝. 𝑧𝑒𝑟𝑜𝑠 2,1
𝑊 [2] = 𝑛𝑝. 𝑟𝑎𝑛𝑑𝑜𝑚. 𝑟𝑎𝑛𝑑𝑛 1,1 ∗ 0.01
𝑏 [1] = 0
23
Vectorization demo
24
References
Andrew Ng. Deep learning. Coursera.
Geoffrey Hinton. Neural Networks for Machine Learning.
Kevin P. Murphy. Probabilistic Machine Learning An Introduction. MIT
Press, 2022.
25