MODULE 2
Feedforward Networks: Introduction to feedforward neural networks, Gradient-Based Learning,
Back-Propagation and Other Differentiation Algorithms. Regularization for Deep Learning.
1. Describe the architecture of a feedforward neural network. Explain the roles of input
layers, hidden layers, and output layers.
Deep Feedforward Networks:
Deep feedforward networks, also known as feedforward neural networks or mul3layer perceptrons (MLPs),
are a basic and important type of deep learning model. Their main goal is to approximate a func3on f∗.
For example, in a classifica3on problem, this func3on maps an input x (like an image) to an output y (like a
category or label). The network models this as y=f(x;θ), where θ represents the parameters the network
learns to make accurate predic3ons.
Called "feedforward" because information flows in one direction (input to output) without feedback, they
form the basis of many applications like image recognition (e.g., convolutional networks) and are stepping
stones to recurrent neural networks used in natural language processing.
These networks are structured as chains of layers, where each layer processes the output of the previous
one. Simple, efficient, and versatile, they are essential tools in modern machine learning.
These networks are called "neural" because they are inspired by the brain's structure. Each hidden layer is
made up of multiple units, which are similar to neurons in that they process inputs and produce an output
(called an activation value). These layers and units work together to represent complex functions. While the
architecture is inspired by neuroscience, the goal of neural networks is to approximate functions effectively
for generalization, not to mimic brain processes exactly.
1. Depth of model: The number of layers in a feedforward network defines its depth, leading to the
term "deep learning."
2. Output layer: The final layer directly learns to produce the desired output y for each input x.
3. Training process: The network adjusts its parameters to approximate the target function f∗, based
on noisy, approximate training data.
4. Hidden layers: Intermediate layers (hidden layers) transform the input, but their specific behavior is
not directly guided by the training data.
5. Learning algorithm: The algorithm decides how to use hidden layers to approximate the
function f∗, as training data only specifies the output layer’s behaviour.
Key Components:
Layers:
• Input Layer: Takes in the input features x.
• Hidden Layers: Intermediate layers that learn complex patterns. Each hidden layer contains neurons
that apply an activation function to a linear combination of inputs.
• Output Layer: Produces the network's final output, which could be a probability distribution for
classification tasks or a numerical value for regression.
Activation Functions:
• Activation functions introduce non-linearity, enabling the network to learn complex mappings.
Common functions include ReLU, sigmoid, and tanh.
Learning Process:
• The network is trained using gradient-based optimization (e.g., stochastic gradient descent) to
minimize a loss function that measures the error between predicted and true outputs.
6.1 Example: Learning XOR
Task: Learn the XOR function, a binary operation returning 1 when exactly one of x1 or x2 is
1, otherwise returning 0.
Goal of the Network :
We want the neural network to learn the XOR function, meaning: y=f∗(x) [TARGET FUNCTION] The
model’s function is y=f(x;θ) , where θ represents the trainable parameters.
Why a Linear Model Fails
Why Nonlinearity is Crucial
• If the hidden layer were linear, the network would still behave like a linear model .f(x)=w⊤x.
• By using a nonlinear activation g, the network learns a new feature space that allows linear
separation of XOR points.
6.2 Gradient-Based Learning
Explain briefly about gradient descent algorithm.
MQP> Explain the concept of gradient-based learning in neural networks. Why is it important?
To design a deep neural network (DNN) with gradient-based learning, we need to carefully consider the
architecture, cost funcAon, output units, and hidden units.
Training a neural network is like training any other machine learning model using gradient descent. The key
differences are:
Gradient-Based Learning with Respect to the Cost Function
Gradient-based learning is central to optimizing neural networks.
Cost Function: This measures the error between the network’s predictions and actual target values, guiding
parameter updates.
o Mean Squared Error (MSE): Used in regression tasks, MSE minimizes the squared difference
between predicted and actual values.
o Cross-Entropy Loss: Common in classification tasks, cross-entropy measures the difference between
predicted probabilities and true labels. For binary classification, binary cross-entropy is used, while for
multiclass classification, categorical cross-entropy is used.
6.2.1.1 Learning Conditional Distributions with Maximum Likelihood
the use of maximum likelihood for training modern neural networks, linking it to cost funcAons like
nega;ve log-likelihood and cross-entropy.
Maximum Likelihood Training:
• Most modern neural networks are trained to maximize the likelihood of the observed data.
• The cost function for this is the negative log-likelihood, equivalent to cross-entropy when using
probability distributions.
1. Optimization via Gradient Descent:
o Gradient Descent: Computes the gradient of the cost function with respect to the model
parameters and updates these parameters in the direction that reduces the cost.
o Backpropagation: Efficiently calculates the gradient by propagating errors backward through
the network, layer by layer.
o Learning Rate: Controls the step size for each parameter update. Smaller learning rates
provide more precise convergence but slower training, while larger rates speed up training but
risk overshooting optimal values.
Design of Output Units
The choice of output units depends on the type of prediction task:
• Gaussian Output Units (Regression Tasks): For continuous outputs, assuming a Gaussian
distribution with mean μ\muμ and variance σ2\sigma^2σ2, the cost function is MSE. This is used for
tasks like predicting house prices or temperatures.
• Bernoulli Output Units (Binary Classification): For binary outputs (0 or 1), the Bernoulli
distribution models the probability of each outcome. Binary cross-entropy is the cost function here,
common in tasks like spam detection.
• Multinoulli (Categorical) Output Units (Multiclass Classification): For multiclass problems, the
Multinoulli distribution (or categorical distribution) models probabilities across classes. A softmax
activation is applied in the output layer to produce class probabilities, and categorical cross-entropy
is the cost function. This is ideal for problems like image classification.
Design of Hidden Units
Hidden units in a neural network allow it to learn complex patterns by transforming the input data non-linearly.
Choosing the right activation function is crucial for effective learning:
• ReLU (Rectified Linear Unit): Outputs zero for negative inputs and the input value if positive,
promoting sparse and efficient representations. ReLU is widely used for its computational efficiency
and helps avoid vanishing gradients.
• Leaky ReLU and Parametric ReLU (PReLU): Variants of ReLU that allow a small gradient for
negative inputs to mitigate the “dying ReLU” problem, where neurons stop updating if stuck in
negative values.
• Sigmoid and Tanh Functions: While less common in deep networks, these functions are useful for
bounded output and probabilistic interpretations. However, they are prone to vanishing gradient issues
and can hinder training in deeper networks. Tanh, in particular, outputs values between -1 and 1,
providing zero-centered data for more balanced updates.
6.3.1 Rectified Linear Units and Their Generalizations//Role of Ac;va;on Func;ons:
Benefits of Maxout Units:
• Flexibility: With a large enough number of pieces, maxout can approximate any convex funcAon,
making it versaAle.
• Redundancy: Maxout units have mulAple filters, which can help with catastrophic forgeBng, where
networks forget previously learned tasks. This redundancy can help preserve past knowledge during
training.
• Efficiency: In some cases, maxout units can reduce the number of parameters by summarizing
mulAple features using the maximum.
6.3.2 Logistic Sigmoid and Hyperbolic Tangent
• Characteristics:
• Saturation: The sigmoid function saturates at both extremes. As z→∞,σ(z) approaches 1, and as z→∞,σ(z)
approaches 0. The function is sensitive to input only near 0.
• Gradient Saturation: When the input is very large or very small, the gradient (used for learning) becomes
really small, making it hard to learn. So, it's not ideal for hidden layers in deep networks.
• Use: Despite these drawbacks, the logistic sigmoid is often used as an output activation function for binary
classification tasks, where the output needs to be a probability between 0 and 1. However, it is not
recommended as a hidden unit activation function due to poor gradient propagation.
Gradient: The gradient of the tanh funcAon is larger than the sigmoid's, especially near 0. This makes the
tanh funcAon preferable over sigmoid when used in hidden layers, as it helps miAgate the vanishing gradient
problem to some extent.
Why Not Use Sigmoid and Tanh in Hidden Layers?
• Both sigmoid and tanh can cause the network to struggle with learning in deep networks because
their gradients become very small (this is called the vanishing gradient problem).
• ReLU (Rectified Linear Units) and its variations (like leaky ReLU) are now preferred for hidden
layers because they don’t suffer from this problem and are easier to optimize.
6.3.3 Other Hidden Units
6.5 Back-Propagation and Other Differentiation Algorithms
MQP > Discuss the working of Back propagation.
• Forward Propagation:
• In a feedforward neural network, information flows from the input x through the hidden units in each
layer to produce the output y^ This process is known as forward propagation.
• During training, forward propagation continues until a scalar cost function JJ(θ) is produced.
• Back-Propagation:
• The back-propagation algorithm (or simply backprop) computes the gradient of the cost function
with respect to the network's parameters.
• This is achieved by propagating the information from the cost J(θ) backwards through the network,
allowing the computation of gradients needed for optimization.
• Gradient Computation:
• Although an analytical expression for the gradient is straightforward, evaluating this expression
directly can be computationally expensive. Backpropagation simplifies this by using an efficient,
inexpensive method to compute the gradient.
• Misconceptions about Back-Propagation:
• Back-propagation is often misunderstood as the entire learning algorithm for neural networks, but it
only refers to the method for computing gradients. The actual learning is performed using another
algorithm, such as stochastic gradient descent, which uses the computed gradient to adjust the
model's parameters.
• Back-propagation is also not exclusive to multi-layer neural networks. It can compute derivatives for
any function, and can be used in various tasks beyond neural networks. For example, it can compute
derivatives of a function f with respect to a set of variables x (whose gradients are needed), while other
variables y (inputs to the function) may not require their gradients.
6.5.1 Computational Graphs
• Purpose of Computational Graphs:
• To describe the back-propagation algorithm more precisely, we use computational graphs, which
provide a formal way of representing the flow of computation in a neural network.
• In these graphs, each node represents a variable (which could be a scalar, vector, matrix, or tensor).
• Operations are the functions applied to one or more variables to generate an output. These
operations are the basic building blocks in the graph.
Operations:
• An operation is a function that takes one or more variables as input and returns a single output. For
example, it could be an addition, multiplication, or any other basic mathematical function.
• More complex functions can be represented by combining simpler operations.
Graph Representation:
• If a variable y is computed by applying an operation to a variable x, we draw a directed edge from x
to y in the graph. This shows the flow of information from x to y.
• The node representing the output variable y may be annotated with the name of the operation, though
sometimes this label is omitted if the operation is clear from context.
Examples of ComputaAonal Graphs (Figure 6.8)
These examples demonstrate how complex expressions are broken down into smaller steps, each represented
by a node in the graph. This helps in understanding how the gradient flow works during back-propagation.
6.5.2 Chain Rule of Calculus
The chain rule of calculus is a mathemaAcal principle used to compute derivaAves of composed funcAons
when their individual derivaAves are known. BackpropagaAon leverages this rule to calculate gradients in a
computaAonally efficient manner.
-- ∇xz is the gradient of z with respect to x,
-- ∇yz is the gradient of z with respect to y.
6.5.3 Recursively Applying the Chain Rule to Obtain Backpropagation:
The backpropagation algorithm applies the chain rule recursively to compute gradients layer by layer, starting
from the output layer and moving backward to the input layer. This recursive gradient computation enables
the neural network to update its weights to minimize the loss function.
Recursive Gradient Calcula9on:
We calculate the gradients from the output layer (last layer) backward to the input layer (first layer). This is
the essence of backpropagation.
2.HIDDEN UNIT :
2. Demonstrate the Back-Propagation algorithm in Fully-Connected MLP
Practical
Implications
• Batch vs Single Input: In practice, computations are typically done for a batch of inputs rather than
a single input xxx. This batch processing significantly improves efficiency by allowing parallel
computation and optimizing the gradient updates.
• Gradient Computation: After performing the forward pass to compute the output, the next step is to
apply backpropagation (as detailed in Algorithm 6.4), which computes the gradients with respect to
the parameters (weights and biases). These gradients are then used to update the parameters during
training, enabling the model to learn.
REGULARIZATION:
MQP > Discuss other regularization techniques like dropout and Early stopping. How
do they help prevent overFitting?
#DROPOUT:
Dropout is a technique introduced by Srivastava et al. in 2014 that regularizes neural networks by reducing
overfi_ng. It works by randomly "dropping" or turning off some neurons during training, making them
inacAve. This forces the network to learn to make decisions without relying too much on any one neuron.
The effect is similar to training mulAple smaller networks, but without the extra computaAonal cost. This
technique helps the model generalize beaer to new data by prevenAng it from memorizing the training data
too closely.
How Dropout Works:
• Bagging (Bootstrap Aggregating) involves training multiple models and combining them for better
results. However, training large neural networks for bagging is expensive in terms of time and memory.
Dropout provides a way to approximate bagging without the computational cost.
• The Dropout Algorithm:
o Dropout works by randomly turning off neurons (units) in the network during training. The
neurons are randomly dropped, and this randomness forces the network to learn better
representations by not relying too heavily on any one unit.
o For each training step, a random mask (a binary vector) is applied to the network, which
decides which neurons will be kept and which ones will be dropped. Typically, 80% of input
units and 50% of hidden units are kept active.
o The mask is independent for each training example (or minibatch), so each forward pass
through the network is slightly different.
3. Training Process:
o Dropout uses a method like stochastic gradient descent (SGD). For each minibatch, a different
mask is applied to the network, and the network performs forward propagation and
backpropagation as usual.
o During training, the weights (parameters) of the network are shared among all sub-networks.
Instead of having separate models, a single network with shared parameters is trained on
different configurations.
Key Points in Dropout Training:
1. Mask Vector (µ): This binary vector indicates which units should be active in the network for each
training step. A value of 1 means the unit is active, and 0 means it's dropped.
2. Cost Function (J): The model's error is calculated based on the current weights (θ) and the mask (µ).
Dropout aims to minimize the expected cost across all possible masks. Since there are exponentially
many possible masks, this is approximated by sampling a few random masks during training.
3. Expectation (Eµ): Dropout approximates the cost’s expectation by averaging over many sampled
masks. This gives an unbiased estimate of the gradient for updating the weights.
Difference from Bagging:
1. Bagging:
• Models are trained independently on different data subsets, each with its own parameters.
• Models are fully trained to convergence.
2. Dropout:
• Many sub-networks are formed by applying different masks, but they share parameters.
• Dropout approximates an ensemble by training a single large model with random subsets of
active neurons, instead of separate models.
3. Training:
• Bagging trains models fully, while dropout trains only a small fraction of possible sub-networks
each time.
• Parameter sharing improves the performance of all sub-networks, even if only a few are
trained.
4. Training Data:
• Like bagging, dropout uses a subset of training data (sampled with replacement) for each sub-
network, ensuring diversity.
7.8 Early Stopping
Early Stopping, a regularization technique used in training machine learning models, especially in deep
learning, to prevent overfitting.
• Training and Validation Errors: As a model trains, the training error decreases, indicating that the model
is learning and fitting the training data. However, after a certain point, the validation error (the error on a
separate set of data not seen during training) may start to increase, which suggests that the model is beginning
to overfit to the training data.
• Overfitting: Overfitting occurs when the model becomes too complex and starts to learn the noise or
irrelevant details in the training data. As a result, while its performance on the training data continues to
improve, its ability to generalize to unseen data (like the validation set) diminishes.
• Early Stopping: This technique aims to avoid overfitting by monitoring the validation set error during
training. Every time the validation error improves, the model parameters (weights, biases, etc.) are saved. If
the validation error does not improve for a pre-specified number of iterations (also called "patience"), the
training is stopped, and the model parameters from the iteration with the lowest validation error are restored.
MQP > What is regularization? How does regularization help in reducing overfitting.
Regularization is a technique used in machine learning to prevent overfitting by adding a penalty to the loss
function.
• The penalty discourages the model from fitting the noise in the training data, encouraging simpler
models with better generalization.
How Regularization Helps Reduce Overfitting
1. Simplifies the Model
o Regularization reduces the complexity of the model by shrinking the coefficients of less
important features toward zero.
o Simpler models are less likely to overfit and are better at generalizing to new data.
2. Avoids Over-reliance on Specific Features
o Penalizes large weights, ensuring the model does not rely too heavily on a small subset of
features.
o Promotes learning from the broader structure of the data.
3. Types of Regularization
o L1 Regularization (Lasso): Adds the sum of absolute values of weights to the loss function.
§ Shrinks some weights to zero, effectively performing feature selection.
o L2 Regularization (Ridge): Adds the sum of squared values of weights to the loss function.
§ Reduces the magnitude of all weights but does not eliminate them entirely.
o Elastic Net: Combines L1 and L2 penalties for more flexibility.
4. Dropout Regularization
o In neural networks, dropout randomly disables a subset of neurons during training.
o Prevents neurons from co-adapting, improving generalization.
5. Improves Model Generalization
o By reducing the model's capacity to memorize the training data, regularization ensures better
performance on unseen data.
Advantages of Regularization
• Reduces overfitting while retaining model accuracy.
• Encourages sparsity in features (L1) and prevents large weights (L2).
• Enhances model robustness and generalization capabilities.
MQP> Explain briefly about gradient descent algorithm.