Deep Learning - 5 Mark Questions with Detailed Answers
1. Explain the structure and working of a McCulloch-Pitts neuron with an example.
The McCulloch-Pitts neuron is a mathematical model that mimics the working of a biological neuron.
It takes binary inputs, applies fixed weights, and sums them. The output is binary, determined by
comparing the weighted sum with a threshold value. If the sum is greater than or equal to the
threshold, the neuron fires (output = 1), otherwise, it does not (output = 0). This model can simulate
logic gates like AND, OR, and NOT using suitable weights and thresholds. However, it cannot learn
or adapt as its weights are fixed.
2. Describe the Least Mean Squares (LMS) algorithm and its significance in neural learning.
The LMS algorithm is an adaptive learning algorithm used to minimize the mean square error
between the desired output and the predicted output of a neuron. It adjusts weights iteratively
according to the rule w(t+1) = w(t) + * e(t) * x(t), where e(t) = d(t) - y(t). The algorithm converges
towards the optimal weights that minimize error. LMS is computationally simple and widely used in
adaptive filters and neural networks for continuous learning and noise reduction.
3. What is a Perceptron? Explain its architecture, learning rule, and limitations.
A Perceptron is a single-layer neural network that performs binary classification. It consists of input
nodes, weights, a summation unit, and an activation function (usually a step function). It learns by
adjusting weights using the rule w_i w_i + (d - y)x_i, where is the learning rate. The perceptron can
solve only linearly separable problems like AND and OR but fails for non-linear problems like XOR.
This limitation was later solved by introducing hidden layers in multi-layer perceptrons.
4. Explain the concept of a Multi-Layer Perceptron (MLP) and how it overcomes perceptron
limitations.
A Multi-Layer Perceptron consists of an input layer, one or more hidden layers, and an output layer.
Each neuron applies a non-linear activation function such as ReLU or sigmoid. MLPs use
backpropagation to adjust weights, allowing them to learn complex non-linear relationships. They
overcome the perceptrons limitation by introducing non-linearity and multiple layers, enabling
solutions to problems like XOR and image recognition.
5. Derive the gradient descent algorithm and explain its role in optimization.
Gradient Descent is an optimization algorithm used to minimize a loss function by iteratively
updating weights in the opposite direction of the gradient. For a cost function J(w), the update rule is
w w - J(w), where is the learning rate. The gradient J(w) represents the direction of steepest
ascent, so subtracting it moves towards the minimum. It is the foundation for training neural
networks, helping minimize loss functions like mean square error or cross-entropy.
6. Discuss the forward and backpropagation algorithms with proper mathematical
expressions.
Forward propagation computes the output of a neural network by passing input data through layers
of neurons. Each layer performs a weighted sum followed by an activation function.
Backpropagation is used to update weights by calculating the gradient of the loss function with
respect to each weight using the chain rule. This process propagates the error backward through
the network, adjusting weights to minimize loss.
7. Explain vanishing and exploding gradient problems and methods to solve them.
In deep networks, during backpropagation, gradients may become very small (vanishing) or very
large (exploding). Vanishing gradients cause slow or no learning, while exploding gradients make
weights unstable. Solutions include using activation functions like ReLU, gradient clipping,
normalization techniques, and architectures such as LSTM or ResNet that maintain stable gradients
across layers.
8. What is regularization in deep learning? Explain its types and purpose.
Regularization techniques prevent overfitting by penalizing model complexity. Common types
include L1 and L2 regularization, dropout, and early stopping. L1 promotes sparsity by adding
absolute weight penalties, while L2 adds squared penalties to reduce large weights. Dropout
randomly disables neurons during training, and early stopping halts training when validation loss
stops improving.
9. Discuss the working of Dropout and how it helps in reducing overfitting.
Dropout randomly deactivates a fraction of neurons during training, forcing the network to learn
redundant representations and preventing over-reliance on specific neurons. During inference, all
neurons are active but scaled by the dropout rate. This improves generalization and reduces
overfitting.
10. Explain the difference between Stochastic, Mini-Batch, and Batch Gradient Descent.
Batch Gradient Descent updates weights after processing the entire dataset, providing stable
convergence but requiring high computation. Stochastic Gradient Descent (SGD) updates weights
after each training sample, introducing noise but faster updates. Mini-Batch Gradient Descent
combines both approaches, updating weights after small batches of data for better convergence and
efficiency.
11. Write short notes on Adam optimizer. Compare it with SGD.
Adam (Adaptive Moment Estimation) combines momentum and RMSProp. It maintains moving
averages of gradients (m) and squared gradients (v) to adapt learning rates. The update rule is _t =
_t-1 - * m_t / (v_t + ). Adam converges faster and performs well on sparse gradients, while SGD is
simpler but requires careful tuning of the learning rate.
12. Explain the structure and working of Convolutional Neural Networks (CNNs).
CNNs process image data using convolutional layers that detect spatial hierarchies. Each layer
extracts features like edges and textures using convolution filters. Pooling layers reduce
dimensionality while retaining key information. Finally, fully connected layers perform classification.
CNNs are highly effective for vision tasks like image recognition and segmentation.
13. Discuss different types of pooling operations and their roles in CNNs.
Pooling layers reduce the spatial size of feature maps, decreasing computation and overfitting. Max
pooling selects the highest value in a region, preserving dominant features, while average pooling
computes the mean. Global pooling averages or maximizes over entire feature maps, often used in
final layers of CNNs.
14. Explain the concept of a Convolution kernel and its use in feature extraction.
A convolution kernel or filter is a small matrix that slides over the input to compute dot products with
local regions. It extracts features like edges or textures by emphasizing spatial patterns. Different
kernels detect horizontal, vertical, and diagonal features, enabling deeper networks to learn complex
patterns.
15. Discuss the working and applications of Recurrent Neural Networks (RNNs).
RNNs handle sequential data by maintaining hidden states that store past information. The current
output depends on both the current input and the previous hidden state. Applications include text
generation, speech recognition, and time-series forecasting. However, standard RNNs suffer from
vanishing gradients for long sequences.
16. Explain the architecture and functioning of LSTM networks in detail.
LSTM (Long Short-Term Memory) networks solve vanishing gradient problems using gates that
control information flow. The forget gate decides what to discard, the input gate updates cell state,
and the output gate determines the next hidden state. This gating mechanism allows LSTMs to
remember long-term dependencies effectively.
17. Compare PCA, FA, and ICA in terms of goals, assumptions, and mathematical modeling.
PCA (Principal Component Analysis) finds orthogonal directions that maximize variance. FA (Factor
Analysis) models observed variables as linear combinations of latent factors plus noise. ICA
(Independent Component Analysis) finds statistically independent components using higher-order
statistics. PCA uses covariance, FA assumes Gaussian factors, and ICA assumes non-Gaussian
independent sources.
18. Explain the structure and working of an Autoencoder and its different types.
An Autoencoder is a neural network that learns to reconstruct its input through an encoder and
decoder. The encoder compresses the input into a lower-dimensional latent representation, and the
decoder reconstructs it. Types include Sparse Autoencoder (enforces sparsity), Denoising
Autoencoder (reconstructs clean data from noise), and Contractive Autoencoder (penalizes
sensitivity to small input changes).
19. What is a Variational Autoencoder (VAE)? Derive and explain its objective function.
A VAE is a generative model that learns a latent probability distribution. The encoder outputs mean
and variance parameters of latent variable z. The objective function maximizes the Evidence Lower
Bound (ELBO): L = E_q[log p(x|z)] - KL(q(z|x)||p(z)). The first term ensures reconstruction accuracy,
and the KL term regularizes latent space to follow a normal distribution.
20. Explain the concept and functioning of Generative Stochastic Networks (GSNs).
GSNs generalize denoising autoencoders by learning a stochastic transition between noisy and
clean samples. The model defines a Markov chain that alternates between corruption and
reconstruction steps. Over iterations, this transition distribution converges to the true data
distribution, allowing sample generation.
21. What is Transfer Learning? Explain its types and benefits with suitable examples.
Transfer Learning transfers knowledge from a source domain/task to a related target domain/task. It
is useful when target data is scarce. Types include Inductive TL (different tasks), Transductive TL or
Domain Adaptation (same task, different domains), and Unsupervised TL (no labels). Example:
using ImageNet pretrained CNNs for medical image classification.
22. Describe Domain Adaptation and explain methods like TCA and KMM.
Domain Adaptation addresses differences in source and target distributions. Transfer Component
Analysis (TCA) learns a common subspace minimizing Maximum Mean Discrepancy (MMD)
between domains. Kernel Mean Matching (KMM) reweights source samples to align means in
feature space. Both reduce domain shift for better generalization.
23. What are Eigen Domain Transformation (EDT) and Domain Invariant Features (DIF)
methods?
EDT aligns source and target eigenspaces via PCA-like subspace transformation to minimize
domain distance. DIF minimizes domain discrepancy by finding a projection that reduces MMD while
preserving local geometry and class separation. These methods achieve unsupervised domain
adaptation effectively.
24. Explain the basic architecture and objective function of a Generative Adversarial Network
(GAN).
A GAN consists of two networks: a Generator (G) that creates fake samples from noise, and a
Discriminator (D) that distinguishes real from fake samples. They compete in a minimax game:
min_G max_D V(D,G) = E_x[log D(x)] + E_z[log(1 - D(G(z)))]. The generator improves until it
produces realistic data indistinguishable from real samples.
25. Discuss the training process and challenges faced in GANs such as mode collapse.
Training alternates between optimizing D and G. D learns to classify real vs fake, and G learns to
fool D. Problems include mode collapse (G generates limited variations), non-convergence, and
vanishing gradients when D becomes too strong. Solutions include mini-batch discrimination, label
conditioning, and loss variants like WGAN.
26. Explain DCGAN architecture and its improvements over vanilla GANs.
DCGAN (Deep Convolutional GAN) replaces fully connected layers with convolutional and
transposed convolutional layers. It uses batch normalization, ReLU in generator, and Leaky ReLU in
discriminator. DCGANs produce stable training and high-quality images.
27. Explain the working principle and purpose of InfoGAN and WGAN.
InfoGAN introduces a latent code c in the generator and maximizes mutual information I(c; G(z,c))
for interpretable representations. WGAN replaces Jensen-Shannon divergence with Wasserstein
distance, improving stability and solving mode collapse by enforcing Lipschitz continuity via weight
clipping or gradient penalty.
28. Discuss the structure and application of Coupled GAN (CoGAN).
CoGAN trains two GANs for related domains (e.g., photos and sketches) with shared layers in
generators and discriminators. Shared weights enforce feature alignment without paired samples,
learning a joint distribution across domains. It is useful for cross-domain translation and image
synthesis.
29. Explain LAPGAN architecture and how it improves image quality.
LAPGAN generates high-resolution images progressively using a Laplacian pyramid. Each GAN
level generates residual details at a different scale, starting from a coarse low-resolution base.
Combining residuals at each level results in sharper and more realistic images.
30. Explain Adversarially Learned Inference (ALI) and its significance in learning latent
representations.
ALI, also called BiGAN, extends GANs by adding an encoder network E(x) that maps data to latent
space. The discriminator distinguishes between real pairs (x, E(x)) and fake pairs (G(z), z). At
convergence, joint distributions p(x,z) and q(x,z) match. ALI learns both generation and inference
simultaneously, enabling better representation learning.