Machine Learning
Machine Learning
w ← w − η∇L
where: - w: weights vector - η: learning rate (step size) - L: loss function
Backpropagation Mechanism
For a basic neural network, the output can be represented as:
y = f (wx + b)
where f is an activation function and b is the bias. The loss function L(y, ŷ)
compares the predicted output ŷ with the true output y.
To compute gradients using the chain rule, the gradient of the loss with
respect to the weights is calculated:
dL dL dy
= ·
dw dy dw
For a multi-layer neural network, applying the chain rule will look like this:
1. For Output Layer:
δ = ∇L(y, ŷ) · f ′ (z)
where z = wx + b and f ′ is the derivative of the activation function.
2. For Hidden Layer:
1
Different activation functions affect the gradient calculations:
- **ReLU**: (
x if x > 0
f (x) =
0 otherwise
Derivative: (
′ 1 if x > 0
f (x) =
0 otherwise
- **Sigmoid**:
1
f (x) =
1 + e−x
Derivative:
f ′ (x) = f (x) · (1 − f (x))
import tensorflow as tf
x = tf.Variable(2.0)
with tf.GradientTape() as tape:
y = x**3 + 2*x + 5
grad = tape.gradient(y, x)
print(grad.numpy()) # Output: 14
2
Module 4: Probability & Distributions in Deep
Learning
1. Common Probability Distributions
1. Gaussian (Normal) Distribution: - Probability density function:
1 (x−µ)2
f (x) = √ e− 2σ 2
2πσ 2
- Used in weight initialization (e.g., Xavier, He Initialization).
2. Bernoulli Distribution: - Probability mass function:
P (X = 1) = p, P (X = 0) = 1 − p
f (x; λ) = λe−λx (x ≥ 0)
where µ is the mean, σ 2 is the variance, and ϵ is a small value to prevent division
by zero.
P (w) ∼ N (µ, σ 2 )
3
Module 5: Optimization in Deep Learning
1. Impact of Learning Rate in Gradient Descent
Gradient Descent Update Rule:
w ← w − η∇L
v ← βv + (1 − β)∇L
w ← w − ηv
where β is the momentum term (typically set to 0.9).
2. Adam Optimizer: A combination of momentum and RMSProp. Updates
are computed as:
vt = β1 vt−1 + (1 − β1 )∇L
st = β2 st−1 + (1 − β2 )(∇L)2
η
w←w− √ vt
st + ϵ
where β1 and β2 are the decay rates for the moving averages, and ϵ is a small
constant for numerical stability.