PART A — (10 × 2 = 20 marks)
1. Differentiate supervised and unsupervised learning.
Basis Supervised Learning Unsupervised Learning
Data Uses labeled data (input-output pairs). Uses unlabeled data (only inputs).
To learn a mapping from inputs to To find hidden patterns, structures,
Goal
outputs for prediction or classification. or groupings in the data.
Clustering, Dimensionality
Examples Regression, Classification.
Reduction.
2. What is stochastic gradient descent?
Stochastic Gradient Descent (SGD) is an iterative optimization algorithm used to
minimize the loss function of a model. Unlike standard Gradient Descent, which uses
the entire dataset to compute the gradient, SGD calculates the gradient and updates the
model's parameters using a single, randomly selected training example (or a small mini-
batch) at a time. This makes it much faster and allows for online learning, though the
path to the minimum is noisier.
3. What are sparse interactions in a convolutional neural network?
Sparse interactions (or sparse connectivity) refer to the concept in CNNs where a
neuron in a layer is connected to only a small, local region of the previous layer, rather
than every neuron. This is achieved using a small kernel/filter. It drastically reduces the
number of parameters and computations, makes the model more efficient, and helps in
detecting local features like edges and corners.
4. Present an outline of pooling layer in a convolutional neural network.
A pooling layer is used in CNNs to progressively reduce the spatial size (height, width)
of the representation, which reduces the computational load, memory usage, and
number of parameters. It also helps in making the detection of features somewhat
invariant to scale and orientation. The most common type is Max Pooling, which
outputs the maximum value from a region of the feature map covered by the filter.
5. Define a recurrent neural network.
A Recurrent Neural Network (RNN) is a class of artificial neural networks designed
for sequential data. Unlike feedforward networks, RNNs have "memory" through
internal loops, allowing information to persist. The output from a previous step is fed
back as input to the current step, making them suitable for tasks like time series
prediction, speech recognition, and natural language processing.
6. What is LSTM? How it differs from RNN?
LSTM (Long Short-Term Memory) is a special kind of RNN architecture designed to
solve the vanishing gradient problem of standard RNNs.
• RNN: Has a simple repeating module (usually a single tanh layer). It struggles to
learn long-term dependencies due to vanishing gradients.
• LSTM: Has a complex repeating module with three gates (Input, Forget,
Output) and a cell state. This gating mechanism allows it to selectively remember
or forget information over long periods, making it highly effective for long-range
dependencies.
7. What is a baseline model in deep learning?
A baseline model is a simple, often non-deep-learning or a very simple neural network
model, used as a reference point for evaluating the performance of more complex
models. It provides a minimum performance threshold that any new, sophisticated
model must significantly exceed to be considered useful. Examples include a logistic
regression model for classification or a simple feedforward network with one hidden
layer.
8. Define random search.
Random Search is a hyperparameter tuning technique where a fixed number of
hyperparameter settings are sampled from a defined search space randomly. The model
is trained and evaluated for each of these random combinations. It is often more
efficient than grid search because it has a better chance of finding good
hyperparameters by exploring the search space more broadly.
9. What is a regularized autoencoder?
A regularized autoencoder is an autoencoder that does not use an undercomplete
hidden layer for dimensionality reduction. Instead, it uses a loss function with a
regularization term to prevent the network from simply learning the identity function,
even if the hidden layer is the same size or larger than the input. This forces the model
to learn useful properties of the data. Types include Sparse, Denoising, and Contractive
Autoencoders.
10. Define a stochastic encoder.
A stochastic encoder is an encoder, typically used in Variational Autoencoders (VAEs),
where the encoding process is probabilistic. Instead of encoding an input into a fixed
point in the latent space, it encodes it into a probability distribution (e.g., a Gaussian).
The latent vector is then sampled from this distribution. This introduces continuity and
completeness in the latent space, enabling generative capabilities.
PART B — (5 × 13 = 65 marks)
11. (a) (i) Discuss the Bias - Variance trade off.
The bias-variance trade-off is a fundamental concept in machine learning that describes the
tension between a model's simplicity and its ability to fit the training data.
• Bias: Error due to overly simplistic assumptions in the learning algorithm. A high-
bias model (e.g., linear regression for a complex problem) fails to capture the
underlying trends of the data, leading to underfitting.
• Variance: Error due to excessive sensitivity to small fluctuations in the training set. A
high-variance model (e.g., a very deep decision tree) learns the noise in the training
data as if it were a true pattern, leading to overfitting.
The Trade-off:
• As model complexity increases, bias decreases (it fits the training data better) but
variance increases (it becomes more sensitive to the specific training set).
• The goal is to find the optimal model complexity that minimizes the total error (the
sum of bias error, variance error, and irreducible error).
(ii) Discuss overfitting and underfitting with an example.
• Overfitting: Occurs when a model learns the training data too well, including its
noise and outliers. It performs excellently on training data but poorly on unseen test
data.
o Example: A student who memorizes a textbook word-for-word but fails to
answer application-based questions in the exam.
o In ML: A deep neural network achieving 99% accuracy on the training set but
only 60% on the test set.
• Underfitting: Occurs when a model is too simple to capture the underlying structure
of the data. It performs poorly on both training and test data.
o Example: A student who has only read the chapter titles and fails to answer
both direct and application-based questions.
o In ML: Using a linear model to fit a complex, non-linear dataset, resulting in
high error on both sets.
(b) Explain the operations of a deep feedforward network with a diagram.
A Deep Feedforward Network (DFN) or Multi-Layer Perceptron (MLP) is the quintessential
deep learning model. Information flows from input to output without any feedback loops.
Operations:
1. Input Layer: Receives the feature vector.
2. Hidden Layers: Multiple layers between input and output. Each layer consists of
neurons.
3. Forward Propagation: For each neuron in a hidden/output layer:
a. Weighted Sum: z = (w1*x1 + w2*x2 + ... + wn*xn) + b where w are
weights, x are inputs, and b is bias.
b. Activation Function: An activation function f (e.g., ReLU, Sigmoid) is applied
to z to introduce non-linearity: a = f(z).
c. This output a becomes the input for the next layer.
4. Output Layer: The final layer produces the network's prediction.
5. Loss Calculation: The difference between the prediction and the actual target is
calculated using a loss function (e.g., Mean Squared Error, Cross-Entropy).
6. Backpropagation & Optimization: The loss is propagated backward through the
network using the chain rule to calculate the gradient of the loss with respect to each
weight. An optimizer (e.g., SGD, Adam) then updates the weights to minimize the
loss.
[Diagram Description: A diagram would show an input layer on the left with 3 nodes, two
hidden layers in the middle with 4 nodes each, and an output layer on the right with 1 node.
All nodes in one layer would be fully connected to all nodes in the next layer, representing a
fully connected network. Arrows would indicate the forward flow of data.]
12. (a) What is a convolutional neural network? Outline transposed and dilated
convolutions with an example.
A Convolutional Neural Network (CNN) is a specialized neural network for processing grid-
like data such as images. Its core components are convolutional layers, pooling layers, and
fully connected layers. It uses shared weights and sparse connectivity to efficiently capture
spatial hierarchies of features.
• Transposed Convolution (Deconvolution): It is essentially a reverse convolution
used for upsampling a feature map to a higher resolution. It applies a filter but
increases the spatial dimensions.
o Example: In a segmentation task, the initial layers downsample the image.
Transposed convolutions in the decoder part of the network are used to
upsample the feature maps back to the original image size to predict a label for
each pixel.
• Dilated Convolution (Atrous Convolution): It is a convolution where the kernel is
applied over an area larger than its length by inserting "holes" (zeros) between the
kernel elements. The dilation rate defines the spacing.
o Example: In image segmentation (e.g., DeepLab), dilated convolutions are
used to exponentially increase the receptive field without increasing the
number of parameters or losing resolution (avoiding pooling), thus capturing
more context from the image.
(b) How to introduce non-linearity in a convolutional neural network? Explain with an
example.
Non-linearity is introduced in a CNN through activation functions applied element-wise to
the output of a convolutional layer. Without it, the entire network would be a linear
transformation, no matter how many layers, severely limiting its ability to learn complex
patterns.
• Mechanism: After the convolution operation produces a feature map, each value in
that map is passed through a non-linear activation function like ReLU (Rectified
Linear Unit): f(x) = max(0, x).
• Example: Consider a CNN for cat vs. dog classification. The first layer might detect
simple edges. If we only had linear activations, subsequent layers would only be able
to combine these edges in a linear way (e.g., weighted sums). With ReLU, the
network can learn to "ignore" negative activations (like weak edges) and "activate"
only strong, positive features. This allows it to build up a hierarchy: edges -> textures
-> patterns -> parts of a cat's face -> the concept of a "cat". This complex, non-linear
mapping is only possible due to the activation functions.
13. (a) What is a bi-directional recurrent neural network? Explain the architecture with
a diagram.
A Bidirectional Recurrent Neural Network (Bi-RNN) is an RNN architecture that processes
sequential data in both forward and backward directions. This allows the network to have
information about both past and future context for any point in the sequence, which is not
possible in a standard unidirectional RNN.
Architecture:
1. It consists of two separate hidden layers:
o A forward hidden layer that processes the sequence from start to end (t=1 to
t=T).
o A backward hidden layer that processes the sequence from end to start (t=T
to t=1).
2. At each time step t, the output is computed based on the concatenation or combination
of the hidden states from both the forward and backward layers (h_t = [h_forward_t,
h_backward_t]).
3. This combined context is then passed to the output layer.
[Diagram Description: A diagram would show a sequence of inputs (x1, x2, x3). Each input
is connected to two hidden layers: one processing left-to-right (forward RNN) and another
processing right-to-left (backward RNN). The hidden states from both directions at each time
step are combined (e.g., concatenated) and fed to an output layer (y1, y2, y3).]
(b) What is long short term memory? Compare and contrast LSTM and gated
recurrent units.
LSTM (Long Short-Term Memory) is a type of RNN with a gating mechanism to control
the flow of information. It solves the vanishing gradient problem and can learn long-term
dependencies.
LSTM vs. GRU (Gated Recurrent Unit):
Feature LSTM GRU
Gates Three gates: Input, Forget, Output. Two gates: Update, Reset.
Two state vectors: Cell State (c_t) and
Internal State One state vector: Hidden State (h_t).
Hidden State (h_t).
The cell state acts as a "conveyor The hidden state captures both long
Functionality belt" for long-term memory, carefully and short-term dependencies without a
regulated by the gates. separate cell state.
Complexity More complex, has more parameters. Simpler, has fewer parameters.
Training
Can be slower to train. Often faster to train due to simplicity.
Speed
Can model very long sequences Often performs comparably to LSTM
Performance
effectively. on many tasks with less data.
Conclusion: GRU is a simpler, more efficient alternative to LSTM and often performs just as
well, especially on smaller datasets. LSTM might still be preferred for tasks requiring
modeling of very long-term dependencies.
14. (a) Discuss the various performance metrics to evaluate a deep learning model with
an example.
Performance metrics quantify a model's effectiveness. The choice depends on the task.
• For Classification:
o Accuracy: (TP+TN)/(TP+TN+FP+FN). Proportion of correct predictions.
Good for balanced classes.
▪ Example: A cat/dog classifier with 95% accuracy means it's correct
95% of the time.
o Precision: TP/(TP+FP). How many of the predicted positives are actual
positives.
▪ Example: A spam detector with high precision means when it says
"spam," it's very likely correct (low false positives).
o Recall (Sensitivity): TP/(TP+FN). How many of the actual positives were
correctly predicted.
▪ Example: A cancer detection model with high recall misses very few
actual cancer cases (low false negatives).
o F1-Score: Harmonic mean of precision and recall. Balances the two.
o Confusion Matrix: A table showing correct and incorrect predictions for each
class.
• For Regression:
o Mean Absolute Error (MAE): Average of absolute differences between
predicted and actual values. Robust to outliers.
o Mean Squared Error (MSE): Average of squared differences. Punishes
larger errors more heavily.
(b) What are hyperparameters? Discuss the steps to perform hyperparameter tuning.
Hyperparameters are configuration parameters external to the model whose values cannot
be estimated from the data. They are set before the training process begins. Examples:
Learning rate, number of layers, number of neurons per layer, batch size, dropout rate.
Steps for Hyperparameter Tuning:
1. Define a Search Space: Identify the hyperparameters to tune and their possible value
ranges (e.g., learning rate: [0.1, 0.01, 0.001]).
2. Choose a Tuning Method:
o Manual Search: Manually tweaking based on intuition and experience.
o Grid Search: Exhaustively searching over a specified set of values. It's
thorough but computationally expensive.
o Random Search: Randomly sampling combinations from the search space.
Often more efficient than grid search.
o Bayesian Optimization: A probabilistic model that uses past evaluation
results to choose the next hyperparameters to evaluate. It's very efficient for
expensive models.
3. Select a Performance Metric: Choose a metric to evaluate the models (e.g.,
validation accuracy, F1-score).
4. Execute the Search: Run the training and evaluation process for each
hyperparameter combination using cross-validation.
5. Select the Best Model: Identify the hyperparameter set that yielded the best
performance on the validation metric.
6. Evaluate on Test Set: Finally, assess the performance of the best-tuned model on the
held-out test set to get an unbiased estimate of its generalization error.
15. (a) Justify your answer, that how autoencoders are suitable compared to Principal
Component Analysis (PCA) for dimensionality reduction.
While both Autoencoders and PCA are used for dimensionality reduction, autoencoders are
generally more powerful and suitable for complex data due to their non-linear nature.
• PCA: Is a linear technique. It performs a linear transformation of the data to find the
orthogonal directions (principal components) of maximum variance. It is simple, fast,
and deterministic.
• Autoencoder: Is a non-linear technique. It uses a neural network with an encoder (to
compress) and a decoder (to reconstruct). The bottleneck layer acts as the low-
dimensional representation.
Justification for Autoencoders:
1. Non-Linearity: Autoencoders can learn complex, non-linear manifolds in the data,
whereas PCA is limited to linear subspaces. For real-world data like images (which lie
on non-linear manifolds), autoencoders can capture the structure much more
effectively.
2. Representation Power: The hidden layers in an autoencoder can learn hierarchical
features, leading to a more powerful and meaningful latent space representation.
3. Flexibility: The architecture is highly flexible. We can use convolutional layers for
images, different types of regularizations (sparse, denoising), and different loss
functions tailored to the data.
Conclusion: For simple, linear data, PCA is sufficient and efficient. However, for complex,
high-dimensional data like images, audio, and text, autoencoders are far more suitable as they
can learn a more efficient and powerful non-linear reduced representation.
(b) What is a generative adversarial network? Explain the architecture with a diagram.
A Generative Adversarial Network (GAN) is a class of deep learning frameworks designed
for generative modeling. It consists of two neural networks, a Generator and
a Discriminator, that are trained simultaneously in a competitive game.
Architecture:
1. Generator (G): Takes random noise as input and tries to generate fake data that is
indistinguishable from real data. Its goal is to fool the Discriminator.
2. Discriminator (D): Takes both real data and fake data from the generator as input. It
tries to correctly classify whether the input is real or fake. Its goal is to become a
perfect classifier.
The Training Process (The Adversarial Game):
• The generator and discriminator are pitted against each other.
• The generator tries to produce more realistic data to fool the discriminator.
• The discriminator gets better at distinguishing real from fake.
• This competition drives both networks to improve until the generator produces highly
realistic data.
[Diagram Description: A diagram would show a "Noise Vector" input going into the
"Generator" network, which produces "Fake Data." Both "Real Data" from the dataset and
"Fake Data" from the generator are fed into the "Discriminator" network. The Discriminator
outputs a probability ("Real" or "Fake"). Arrows would indicate the flow of data and the
adversarial feedback loop, where the discriminator's output is used to update both networks.]
PART C — (1 × 15 = 15 marks)
16. (a) Discuss the various loss functions in neural networks.
A loss function (or cost function) measures the discrepancy between the model's prediction
and the actual target value. It is the objective that the model aims to minimize during training.
1. For Regression Tasks:
• Mean Squared Error (MSE): MSE = (1/n) * Σ(y_true - y_pred)²
o Pros: Easily differentiable, convex.
o Cons: Sensitive to outliers (due to squaring).
• Mean Absolute Error (MAE): MAE = (1/n) * Σ|y_true - y_pred|
o Pros: Robust to outliers.
o Cons: Gradients are not smooth around zero, which can slow down
convergence.
2. For Classification Tasks:
• Binary Cross-Entropy: Used for binary classification (2 classes).
o L = -[y_true * log(y_pred) + (1 - y_true) * log(1 - y_pred)]
o Heavily penalizes confident but wrong predictions.
• Categorical Cross-Entropy: Used for multi-class classification (>2 classes) with
one-hot encoded labels.
o L = - Σ y_true_i * log(y_pred_i)
o Measures the difference between the predicted probability distribution and the
true distribution.
• Sparse Categorical Cross-Entropy: Same as above, but used when the labels are
integers (not one-hot encoded), which is more memory efficient.
3. Other Specialized Loss Functions:
• Huber Loss: Combines MSE and MAE. It is less sensitive to outliers than MSE and
is smooth around zero.
• Hinge Loss: Used for "maximum-margin" classification, notably in Support Vector
Machines (SVMs).
The choice of loss function is critical as it directly guides the learning process of the neural
network.
(b) Discuss the steps involved in grid search with an example.
Grid Search is a traditional method for hyperparameter tuning that performs an exhaustive
search over a manually specified subset of the hyperparameter space.
Steps:
1. Define the Model: Choose the algorithm (e.g., a Support Vector Classifier).
2. Define the Hyperparameter Grid: Specify the hyperparameters and the values you
want to try for each.
o Example Grid for an SVM:
▪ 'kernel': ['linear', 'rbf']
▪ 'C': [0.1, 1, 10, 100]
▪ 'gamma': [0.01, 0.1, 1]
3. Define the Evaluation Metric: Choose a scoring metric to evaluate performance
(e.g., 'accuracy').
4. Perform Cross-Validation: Typically, k-fold cross-validation (e.g., 5-fold) is used
for each hyperparameter combination to get a robust performance estimate and avoid
overfitting to a single train-validation split.
5. Execute the Search: The algorithm will train and evaluate a model for every single
combination of hyperparameters in the grid.
o For the example above, the number of combinations is 2 (kernels) × 4 (C
values) × 3 (gamma values) = 24 unique models.
o With 5-fold cross-validation, it will train 24 × 5 = 120 models in total.
6. Identify the Best Parameters: After all runs are complete, the combination of
hyperparameters that achieved the highest average cross-validation score is selected
as the best.
7. Train the Final Model: Finally, train a new model on the entire training set using
these best-found hyperparameters and evaluate it on the held-out test set.
Limitation: It can be computationally very expensive when the hyperparameter space is
large.