Deep Learning Interview Questions & Answers.
1. What is Deep Learning?
Deep learning is a subset of machine learning that uses neural networks with many layers (deep
architectures) to learn complex patterns from large datasets.
2. How is Deep Learning different from Machine Learning?
● Machine Learning: Features are often manually extracted; algorithms work on these
features.
● Deep Learning: Automatically learns features from raw data using neural networks,
especially effective with large datasets.
3. What is a Neural Network?
A neural network is a collection of connected nodes (neurons) organized into layers: input,
hidden, and output layers. Each neuron applies a weighted sum and activation function to
transform data.
4. What is an Activation Function?
An activation function introduces non-linearity into the network, allowing it to learn complex
relationships. Examples: ReLU, Sigmoid, Tanh, Softmax.
5. Why do we use the ReLU activation function?
● Faster convergence
● Reduces vanishing gradient problem
● Simple to compute
6. What is the Vanishing Gradient Problem?
In deep networks, gradients become very small during backpropagation, slowing learning or
making it impossible for early layers to update.
7. How can you prevent the Vanishing Gradient Problem?
● Use ReLU/Leaky ReLU activations
● Batch normalization
● Skip connections (ResNet)
● Proper weight initialization
8. What is the Exploding Gradient Problem?
Gradients become excessively large, causing unstable training and large weight updates.
9. How do you prevent Exploding Gradients?
● Gradient clipping
● Proper weight initialization
● Lower learning rate
10. What is Backpropagation?
An algorithm to compute gradients of the loss function with respect to weights, propagating
errors backward from the output to the input layers.
11. What is the difference between Batch Gradient Descent, Stochastic
Gradient Descent, and Mini-Batch Gradient Descent?
● Batch: Uses the whole dataset for each update.
● SGD: Uses one sample at a time.
● Mini-Batch: Uses a subset of data for each update (most common).
12. What is the role of the Learning Rate?
Controls how much weights are updated during training. Too high → unstable, too low → slow
convergence.
13. What is Dropout in Deep Learning?
A regularization technique that randomly drops neurons during training to prevent overfitting.
14. What is Batch Normalization?
A technique to normalize activations in each mini-batch, speeding up training and improving
stability.
15. What are Hyperparameters in Deep Learning?
Parameters not learned during training, e.g., learning rate, batch size, number of layers,
optimizer type.
16. What are Word Embeddings?
Vector representations of words that capture semantic meaning. Examples: Word2Vec, GloVe.
17. Difference between CNN and RNN?
● CNN: Best for spatial data like images.
● RNN: Best for sequential data like text or time series.
18. What is a Convolutional Neural Network (CNN)?
A deep learning architecture that uses convolutional layers to extract spatial features from data.
19. What is Pooling in CNNs?
Reduces the spatial size of feature maps to lower computational cost and control overfitting.
Types: Max Pooling, Average Pooling.
20. What is Padding in CNNs?
Adding zeros around the input to preserve spatial dimensions after convolution.
21. What is Transfer Learning?
Using a pre-trained model and fine-tuning it for a related task to save training time and improve
performance.
22. What is an RNN?
Recurrent Neural Network — maintains hidden states to process sequential data.
23. What is the Vanishing Gradient Problem in RNNs?
RNNs struggle to learn long-term dependencies due to repeated multiplication of small
gradients.
24. How does LSTM solve the Vanishing Gradient Problem?
LSTMs use gates (input, forget, output) to control information flow, enabling long-term memory.
25. Difference between LSTM and GRU?
● LSTM: Has three gates and a separate cell state.
● GRU: Has two gates, no separate cell state, fewer parameters.
26. What is the Softmax function used for?
Converts logits into probability distributions for multi-class classification.
27. What is a Cost Function in Deep Learning?
A function that measures the error between predicted and actual values (e.g., Cross-Entropy
Loss, MSE).
28. What is Cross-Entropy Loss?
A loss function for classification tasks that penalizes incorrect predictions more heavily.
29. What is Overfitting in Deep Learning?
When a model performs well on training data but poorly on unseen data.
30. How to prevent Overfitting?
● Dropout
● Data augmentation
● Early stopping
● Regularization (L1/L2)
31. What is Early Stopping?
Stopping training when validation loss stops improving to prevent overfitting.
32. What is Gradient Clipping?
Restricting the gradient’s magnitude to prevent exploding gradients.
33. What is the purpose of Weight Initialization?
Proper initialization prevents vanishing/exploding gradients and speeds up convergence.
34. What is Xavier Initialization?
Initializes weights based on the number of input and output neurons to maintain variance across
layers.
35. What is Adam Optimizer?
An optimization algorithm combining momentum and adaptive learning rates (RMSProp + SGD
with momentum).
36. Difference between Adam, SGD, and RMSProp?
● SGD: Simple, slower convergence.
● RMSProp: Adapts learning rate for each parameter.
● Adam: Combines RMSProp and momentum.
37. What is a Residual Network (ResNet)?
A network with skip connections to avoid vanishing gradients in very deep architectures.
38. What is Attention Mechanism in Deep Learning?
A technique that allows the model to focus on relevant parts of the input sequence.
39. What is a Transformer model?
An architecture relying on self-attention instead of recurrence or convolution, used in NLP.
40. What is BERT?
Bidirectional Encoder Representations from Transformers — a pre-trained NLP model for
various tasks.
41. What is Autoencoder?
A neural network used for unsupervised learning that compresses input into a
lower-dimensional representation and reconstructs it.
42. What is Generative Adversarial Network (GAN)?
A model with two networks — generator and discriminator — competing to generate realistic
data.
43. What is the role of the Discriminator in GANs?
Classifies whether input data is real or fake.
44. What is the role of the Generator in GANs?
Generates fake data similar to real data.
45. What is the difference between Supervised, Unsupervised, and
Reinforcement Learning?
● Supervised: Labeled data
● Unsupervised: No labels
● Reinforcement: Learn by interacting with environment
46. What is Reinforcement Learning’s Reward Function?
A function that assigns feedback to the agent for each action taken.
47. What is One-Hot Encoding?
A method of representing categorical variables as binary vectors.
48. What is Data Augmentation in Deep Learning?
Creating new training examples by modifying existing data (e.g., rotations, flips, noise).
49. Why use GPU for Deep Learning?
GPUs handle parallel computations efficiently, speeding up training.
50. What is Model Deployment in Deep Learning?
Making the trained model available for real-world use via APIs, web apps, or embedded
systems.