0% found this document useful (0 votes)
1 views

Machine Learning

The document covers key concepts in mathematics for machine learning, focusing on vector calculus, gradients, and backpropagation in deep learning. It discusses automatic differentiation methods in TensorFlow and PyTorch, various probability distributions used in deep learning, and optimization techniques including adaptive learning rates. Additionally, it highlights the importance of regularization methods and Bayesian approaches for uncertainty estimation in neural networks.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Machine Learning

The document covers key concepts in mathematics for machine learning, focusing on vector calculus, gradients, and backpropagation in deep learning. It discusses automatic differentiation methods in TensorFlow and PyTorch, various probability distributions used in deep learning, and optimization techniques including adaptive learning rates. Additionally, it highlights the importance of regularization methods and Bayesian approaches for uncertainty estimation in neural networks.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Mathematics for Machine Learning -Assignment 1

Module 3: Vector Calculus and Deep Learning


1. Gradients in Deep Learning & Backpropagation
The gradient of a function f : Rn → R is a vector of partial derivatives repre-
senting the function’s rate of change with respect to its input variables. The
gradient is defined as:
 
∂f ∂f ∂f
∇f (x) = , ,...,
∂x1 ∂x2 ∂xn
In deep learning, gradients dictate the weight adjustments during training
using the gradient descent algorithm:

w ← w − η∇L
where: - w: weights vector - η: learning rate (step size) - L: loss function

Backpropagation Mechanism
For a basic neural network, the output can be represented as:

y = f (wx + b)
where f is an activation function and b is the bias. The loss function L(y, ŷ)
compares the predicted output ŷ with the true output y.
To compute gradients using the chain rule, the gradient of the loss with
respect to the weights is calculated:
dL dL dy
= ·
dw dy dw
For a multi-layer neural network, applying the chain rule will look like this:
1. For Output Layer:
δ = ∇L(y, ŷ) · f ′ (z)
where z = wx + b and f ′ is the derivative of the activation function.
2. For Hidden Layer:

δhidden = (wT δ) · f ′ (zhidden )

1
Different activation functions affect the gradient calculations:
- **ReLU**: (
x if x > 0
f (x) =
0 otherwise
Derivative: (
′ 1 if x > 0
f (x) =
0 otherwise
- **Sigmoid**:
1
f (x) =
1 + e−x
Derivative:
f ′ (x) = f (x) · (1 − f (x))

2. Automatic Differentiation in TensorFlow/PyTorch


Forward Mode Autodiff computes derivatives simultaneously with function eval-
uation, while **Reverse Mode Autodiff** efficiently computes gradients in deep
learning.
Example in PyTorch:
import torch
x = torch.tensor(2.0, requires_grad=True)
y = x**3 + 2*x + 5
y.backward()
print(x.grad) # dy/dx = 14
Example in TensorFlow:

import tensorflow as tf
x = tf.Variable(2.0)
with tf.GradientTape() as tape:
y = x**3 + 2*x + 5
grad = tape.gradient(y, x)
print(grad.numpy()) # Output: 14

3. Comparison of Differentiation Methods


Differentiation Type Formula Advantages Limitations
Symbolic Exact derivatives Exact and precise Computational expensive
f (x+h)−f (x−h)
Numerical 2h Simple to implement Prone to approx errors
Automatic Computational graphs Efficient for complex functions Requires extra memory

2
Module 4: Probability & Distributions in Deep
Learning
1. Common Probability Distributions
1. Gaussian (Normal) Distribution: - Probability density function:

1 (x−µ)2
f (x) = √ e− 2σ 2
2πσ 2
- Used in weight initialization (e.g., Xavier, He Initialization).
2. Bernoulli Distribution: - Probability mass function:

P (X = 1) = p, P (X = 0) = 1 − p

- Applicable for binary classification.


3. Exponential Distribution: - Probability density function:

f (x; λ) = λe−λx (x ≥ 0)

- Used for modeling the time until an event occurs.

2. Gaussian Distribution-Based Regularization


- Dropout Regularization: Reduces overfitting by randomly deactivating neu-
rons:
y (d) = y · Dropout(p)
where p is the dropout probability.
- Batch Normalization: Normalizes activations using:
y−µ
ŷ = √
σ2 + ϵ

where µ is the mean, σ 2 is the variance, and ϵ is a small value to prevent division
by zero.

3. Bayesian Deep Learning for Uncertainty Estimation


Bayesian Neural Networks (BNNs) assign probability distributions to weights
rather than deterministic values: - A weight w can be represented as:

P (w) ∼ N (µ, σ 2 )

- Provides uncertainty quantification in predictions.

3
Module 5: Optimization in Deep Learning
1. Impact of Learning Rate in Gradient Descent
Gradient Descent Update Rule:

w ← w − η∇L

Adaptive Learning Rate Methods:


1. Momentum: Accelerates gradient descent using past gradients:

v ← βv + (1 − β)∇L

w ← w − ηv
where β is the momentum term (typically set to 0.9).
2. Adam Optimizer: A combination of momentum and RMSProp. Updates
are computed as:
vt = β1 vt−1 + (1 − β1 )∇L
st = β2 st−1 + (1 − β2 )(∇L)2
η
w←w− √ vt
st + ϵ
where β1 and β2 are the decay rates for the moving averages, and ϵ is a small
constant for numerical stability.

You might also like