We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
120 Important Deep
Learning Interview
Questions+ Answers
Notes
Ef
-- Amar SharmaImportant Deep Learning Intowiew Questions with Answers
L What is deep learning? How is it different from machine learning?
Answer:
Deep learning is a subset of machine learning that wes newral networks
with multiple layers te automatically learn representations from data.
Key differences:
Deep learning requires large datasets. and cemputatisnal power.
It learns features directly from data, whereas traditisnal machine
learning often requires feature engineering.
Deep learning algorithms are typically based on newral networks with
many hidden layers.
2 What os a newral network, and hew dees tt work?
Answer:
A newal network s a computational medel inspired by the human brain,
consisting of layers of intercennected nedes (newrens).
Input layer receives data.
Hidden layers perform computations and learn features.
Output layer provides predictions.Hidden layers perform computations and learn features.
Output layer prevides predictions.
The network learns by adjusting weights using a process called
backprepagation and an aptimizatian algorithm like gradient descent.
3. What is backprspagaticn?
Answer:
Backprepagation is an algorithm used be brain newral netuorks by
minimizing the enror.
The error from the sutput layer is propagated backward through the
network.
Gradients are computed for each weight using the chain rule.
Weights are updated using an sptimizer (e.g.. SGD or Adam) te reduce
He evr.
4. What are activation functions, and why ar6 they important?
Answer:
Activation functions intreduce nen-linearity te newral networks, enabling
them te medel complex relationships. Commen functions:networks, enabling them to medtel compler relationships. Commen
functions:
RLU (Rectified Linear Unit): Fast convergence, aveids vanishing
gradient issues.
||| Siqmetcl: Output between O and 1, used for binary classification.
Softmax: Outputs probabilities for multiclass classification.
Tanbs: Outputs between -Land 1, centered around sere.
5. What is suerfitting, and hew can it be prevented?
Answer:
Ouerfitting securs when a medel performs well on training data but
psorly on unseen data.
Prevention techniques:
Use regularization methads like LI/L2 (Ridge, Lasse).
Apply agp layers
Reduce medel complexity.
Use more training data or data augmentation.
Perform early stepping during baining.Parform early stepping during bratning.
6. What is the difference betwen batch sise. epschs, and iterations?
Answer:
Batch size: Number of samples processed before updating the medels
weights.
pach: One complete pass threugh the entire training dataset.
Iteration: One batch update during training. For example, i you have
1000 samples and a batch size of 100. there will be 10 iterations per
peck.
7. What is the vanishing gradient problem. and how can it be mitigated?
Answer:
The vanishing gradient problem securs when gradients beceme very small
in deep networks, slewing or stepping learning.
Métigation techniques:
Ube activation functisns like ReLU.
Initialize weights properly (e.g.. Xavier or He initialization).Use activation functiens like RLU.
Initialize weights prsperly (eg.. Xavier er He initialization).
Use batch normalization.
Build networks with shép connections leg. Reset)
8. What is branafer learning?
Answer:
hansfer learning invelves using a pre-bained medel en a new task.
Instead of training from scratch, the medels pre-bained weights are
fine-tuned for the target task.
This ts useful when data is limited and for tasks like image recegnitisn
or natural language precessing.
9. Explain the difference between CNNNs and RIN.
Answer:
CNNo (Convolutional Newral Networks): Designed for spatial data like
émages. They use cervelutional layers te capture spatial hierarchies.
RNNo (Recurrent Newral Networks): Designed for sequential data like
time series or text. They have memory cells ts capture temporal
dependencies.capture temporal dependencies.
10. What are gradient descent and its variants?
Answer:
Gradient descent &s an eptimization algorithm used te minimize the less
function.
Cammen variants:
|| Batch. Gradient Descent: Uses the entire dataset for each update (slour
for large datasets).
Stechastic Gradient Descent (SGD): Uses ene sample per update (faster
but netsy).
Méni-batch Gradiont Descent: Uses a subset (batch) of the data for each
update (balances speed and stability)
Adam Optimizer: Combines mementum and adaptive learning rates for
efficient raining.
Ll. What is the rle of the lsss function in neural networls?
Answer:
The lass function measures the difference between the predicted outputpredicted sutput and the actual target value. Tt guides the optimization
process by providing a metric for minimizing the error.
Commen less functions:
Mean Squared Error (MSE): For regression tasks.
Binary Crsss-Enbrepy: For binary classification.
Categorical Cress—Enbrepy: For multi-class classification.
12. What are weight initialization techniques, and why are they
amportant?
Answer:
Weight initialization techniques help ensure faster convergence and aveid
tues like vanishing/expleding gradients.
Randem Initialization: Assigns random values to weighls.
Xavier Initialization: Keeps the variance of activations canstant across
ayers.
He Initialization: Optimized for ReLUl activations.
13. What is the difference between L1 and LD reqularization?13. What is the difference between L1 and LD reqularization?
Answer:
L1 Regularization: Adds the abselute value of weights to the lsss function
(Lasse). Encourages sparsity, making some weights jer.
£2 Regularization: Adds the squared value of weights to the less function
(Ridge). Penalizes large weights and prevents sverfitting.
|| 14. What are auteencaders, and haw are they used?
Answer:
Autsenceders are newral networks used for undsupewised learning.
designed ts reconstruct input data. They have an enceder (ts compress
data) and a deceder (te recenstruct i).
Applications:
Danensienali ;
Anemaly detection.
Densising data.
15. What is the rl of batch normalization?15. What is the rele of batch nerwnalization?
Answer:
Batch normalization nornalizes the input of cach layer te improve
stability and cenvergence during training.
Benefits:
Reduces internal covariate shift.
Allows for higher learning rates.
Acts as a regularizer. reducing the need for drspeut.
16. What & @ reewnent newral network (RNN), and haw dees tt handle
sequential data?
Answer:
RNNs are designed be prscess sequences of data by maintaining a hidden
dependencies in data like time series, text, or speech.
Variants like LSTM (Leng Short-Tem Memory) and GRU (Gated
Recurrent Unit) address issuer like vanishing gradients.
17. What is the purpose of dropsul in deep learning?
Answer:
Dropout is a regularization lechnique that randomly sels a fraction ofAnswer:
Dropout (3 a regularization technique that randomly sets a fraction of.
neurons le sere during taining.
Prevents averfitting by intreducing nase.
Encsurages the network te learn more rsbust features.
18. What are GANs (Generative Adversarial Networks)?
Answer:
GANS are newral networks consisting of twe compenents:
Generator: Creates fake data resembling real data.
Discriminator: Distinguishes between real and fake data.
They are trained together, improving the generators ability te create
realistic data.
Applications:
Image generation.
Style transfer.
Data augmentation.19. What is the difference between superised. unsuperwised, and
reinforcement (earning?
Answer:
Supervised Learning: The medel learns from labeled data (e.g.
lassification, regression)
Unsupewised Learning: The medel identifies patterns in unlabeled data
(eg., clustering, dimensisnality reduction).
Reinforcement Learning: The medel learns by interacting with the
envitenment and receiving feedback tn the form of rewards or penalties.
20. What are attention mechanisms in deep learning?
Answer:
Attention mechanisms allow the medel be fecus on relevant parts of the
input while making predictions.
Example: Tn machine translation, the attentian mechanism helps the
medel fscus on specific words in the ssurce sentence while translating.
Applicatiens:
Transformer medels like BERT and GPT.
Image captisning.
Text summarization.Tent oe
QL. What are the main companents of a esruslational neural network
(CNN)?
Answer:
Convolutional Layers: Entract features by applying fillers over the inpul
Pasting Layers: Reduce the spatial dimensisns of feature maps (e.g.. max
peeling).
Fully Connected Layers: Combine high-level features for classification or
regression.
Drepout/Bias Layers: Prevent sverfitting and impreve generalization.
22. What is the difference between a feedforward newral network and a
reewrvrent neural network?
Answer:
Feedforvard Neural Network (FNN): Processes input data in ene
direction, witheut leaps. Ideal for tasks like image recegnition.ene direction, witheut loops. Tdeal for tasks like image recegnition.
Recurrent Newral Network (RNN): Processes sequential data with
feedback leaps be maintain memory. Used for time-series and language
madeling.
23. What are LSTMs and GRUs? Hew. are they different?
Answer:
LSTM (Long Short-Term Memory): Use gates (tnput, forget, eutput) te
maintain leng-leum dependencies in sequences.
GRUs (Gated Recurrent Units): A simplified version of L8TMe,
combining forget and input gates inks ene update gate.
GRUb are computationally faster, while LSTMs handle complex
dependencies better.
24. What is the difference between parameterized and nen-
parameterczed layers?
Answer:
Parameterized Layers: Contain trainable parameters (0.g., Dense,
Conulutianal layer).Dense, Convslutional layers)
Nen-parameterized Layers: De net contain bainable parameters but
modify data (eg. Activation. Pasling layers)
25. What 6s the exploding gradient problem, and hau 6s tt mitigated?
Answer:
Exploding gradients sccur when large gradient values cause instability
|| duaing baining.
Solutions:
Gradient clipping: Restrict gradients te a maximum value.
Use better initialization metheds.
Use architectures like LSTM/GRUs for sequential data.
26. What is the purpose of the softmax function?
Answer:
Softmax converts rau. scores (lagtts) inte probabilities that sum ts 1.
Used in the sutput layer for multé-class classification.
Formula:
Seflmarlri) = explri) / sumleyplj) for al.Formula:
Seflrmavcri) = explri) J sumexpleg) for all.
27. What is the difference between supewised pretraining and self-
supervised learning?
Answer:
Supervised Pretraining: The medel is bained en a related labeled dataset,
then fine-tuned en the target dataset.
Self-Supewised Learning: The medel generates pseudo labels from data
(e.g., predicting mashed tehens in BERT) and learns representations
28. What is the Transformer architecture, and hew dees tt work?
Answer:
The hansformer is a deep learning architecture designed for sequence-
ts-sequence tasks. It uses:
Self-Attention Mechanism: Te focus en relevant parts of input sequences.
Pasitional Encading: Ts maintain order in input sequences.
Tk replaced RNNs for tasks like machine translation (e.g. BERT. GPT
medels).Te replaced RIN for tasks libe machine translation (eg. BERT. GPT
medels).
29. What are the main challenges in training deep newral networks?
Answer:
Venishinglerplading gradients.
Ouerfitting on training data.
Difficulty ir hyperparameter tuning.
Data scarcity or imbalance.
30. What is the difference between medel-based and data-based
parallelism in deep learning?
Answer:
Medel-based Parallelism: Splits the medel acres multiple devices (e.g.
aplitting layers of a large neural network).
Data-based Parallelism: Splits the data inte batches processed in parallelData based Parallelism: Splits the data inte batches processed in parallel
across devices.
31. What is bansfer learning, and why is it important in deep learning?
Answer:
hansfer learning invelves using a pre-bained medel en a related task
and fine-tuning it for a target lasle
Benefits:
Reduces training time.
Requires less data for the target task.
Leverages learned features frem a larger dataset (cg. ImageNel)
32. What is the purpose of an activation function in @ neural network?
Answer:
Activation functions introduce non-linearity inte the networb, enabling it
ts learn complex patterns.
ReLU (Rectified Linear Unit): max(0, x).ReLUU Rectified Linear Unit): mal. »)
Tank: Outputs values between -1 and 1.
||| Seftmax: Converts sulputs inte probabilities.
33. What is knowledge distillation in deep learning?
Answer:
Knowledge distillation transfers knowledge from a large, complex medel
(eacher) to a smaller, simpler medel (student) without significant
performance lass.
Steps:
Train the teacher medel.
Use the teachers soft predictisns ts brain the student.
34. What is the rle of learning rake scheduling in training deep learning
medels?
Answer:
Learning rate scheduling adjusts the learning rate during bratning teTypes of schedules:
Step decay: Reduce the learning rate at fixed intewals.
Expanential decay: Multiply the learning rate by a factor at each step.
Cyclic learning rakes: Oseillate the learning rabe within a range.
35. What are the differences between instance normalization, bateh
normalizatisn, and layer normalization?
Answer:
Batch Normalization: Normalizes activations acress a batch of data.
Useful for training stability.
Instance Normalization: Normalizes activations for cach sample. Often
used in tyle transfer tasks.
Layer Normalization: Normalizes acress features for each sample.
Effective for RNNo and tansformer architectures.
36. What is the vanishing gradient problem, and hew de activation
functions like ReLU address tt?
Answer:Answer:
Vanishing gradients eccur when gradients shrink exponentially during
backprepagation, preventing effective weight updates.
ReLU: Ausids vanishing gradients by allawing gradients te pass.
unchanged for positive values, as its derivative is either O or 1.
32. What are the differences between Adam and SGD eptimizers?
Answer:
SGD (Stochastic Gradient Descent): Updates weights using the gradient
of the (ss function. Slower convergence.
Adam (Adaptive Mement Estimaticn): Combines momentum and adaptive
learning rates for faster convergence and improved stability.
38. What are attention heads in the bansformer medel?
Answer:
Attention heads in bansfermers allow the medel te focus an different
parts of the input simultanceusly.
Multi-head attention splits the queries, keys, and values inte multiple
parts. computes attontian independently, and esmbines resulls for better
contextual understanding.Mulli-head attention splits the queries, Keys. and values inte multiple
parts, computes attention independently, and combines results for better
contextual understanding.
39. What is the difference between gradient clipping and gradient
normalization?
Answer:
Gradient Clipping: Limits the magnitude of gradients be a pre-defined
Hreshald te prevent exploding gradients.
Gradient Normalization: Scales gradients te have a consistent magnitude,
40. What is the difference between early stepping and checkpsinting in
baining?
Answer:
Early Stopping: Steps baining when performance on a validation set
steps amproving, preventing euerfitling.
Checkpeinting: Saves medel weights periedically during baining. Useful
|_ for recovering from interruptions or selecting the best-performing medel.41 What 0 the difference between the enceder and deceder in sequence
fs-sequence medels?
Answer:
Enesder: Processes the input sequence and encsdes it inte a fixed-length
vector or context.
Deceder: Takes the enceded context and generates the sutput sequence step
by atep.
Examples: Used in machine branslation (og, English. le French).
42. What is the rsle of pesitinal enceding in bansformers?
Answer:
hansformers de net process data sequentially, se pesitisnal enceding is
added te input embeddings te prsuide information absut the order of
tskens.
Paesitisnal encedings are dsinussidal functions of different frequencies.
43. What are the challenges of deploying deep learning medels in
production?
Answer:
High inference latency and memory usage.High inference latency and memory wage.
Ensuring medel robustness te real-world data.
Scalability under high baffie.
Maintaining medel versisning and repreducibility.
44. What is Layeuwise Relevance Prepagatieon (LRP)?
Answer:
LRP & an explainability technique for newal networks. It decomposes
the eulput prediction back ts the input features te show their relevance.
It helps interpret medel decisions and is used in sensitive demains like
healthcare.
45. What is the difference between semantic segmentation and instance
segmentation?
Answer:
Semantic Segmentation: Classifies each pixel of an image inte a category
Tnatance Segmentation: Tdentifies individual objects of the same olesInstance Segmentation: Identifies individual objects of the same class
| 46. What is @ dilated convelution, and wher is it wed?
Answer:
A dilated cervelution (alse called atreus convelutisn) expands the
elements.
| Used in:
Semantic segmentatisn (eg.. DeepLab).
Audis and time-series data analysis.
47. What are the benefits of using cosine similarity ever det product for
measuring vector similarity?
Answer:
Cosine Similarity: Measures the cosine of the angle between tus vectors,
focusing en orientation rather than magnitude.
Benefits:Prevents large magnitude differences from deminating the similarity
48. What is jer0-shet learning, and how dees i work?
Answer:
Zera-shet learning enables. a madel le make predictions for classes it has
net seen during braining.
Mechanism:
Leverages a shared semantic space (e.g., werd embeddings) te transfer
hnewuledge from seen be unseen classes.
49. What is a Siamese network, and where is tt used?
Answer:
A Siamese network uses tus identical subnetuorks te compare inpuls by
learning @ similarity metric.
Applications:
Face weificatin
One-shet learning.
Si ual50. What is the purpose of weight initialization in deep learning?
Answer:
Prcper weight initialization prevents vanishing or explading gradients
and accelerates convergence.
Navier Initialization: Suitable for activations like sigmeid or tanh.
He Initialization: Designed for ReLU activation functions.
SL. What are vanishing and expleding gradients, and haw de they
impact deep learning medels?
Answer:
Vanishing Gradients: Gradients become very small, causing weights ts
update slowly and halting learning.
Cxpleding Gradients: Gradients become very large, leading te unstable
updates and passible divergence.
Seltisns:
Ube activation functions tibe RLU.
Implement gradient clipping.
Use batch normalization or better initialization methods like He
initialization.Use batch normalization or better initialization methods like He
initialinatis
52. What are the differences between data augmentation and data
synthesis?
Answer:
Data Augmentation: Applies bansformations te existing data (e.g..
relations, flips, neise). It enhances diversity witheut altering class
distribulis
Data Synthesis: Generates entirely new data using techniques like GANs
or simulations. Useful for handling imbalanced or rare classes.
53. What are the key differences between RNNo, GRUs, and L8TMs?
Answer:
RNNo: Precess sequential data but suffer from vanishing gradients for
long sequences.
GRUs (Gated Recurrent Units): Simplified LETMs with fewer
parameters; combine the forget and input gates.
LSTMs (Long Short-Teum Memary): Use separate forget, input. andLSTMs (Leng Short-Teun Memory): Use separate forget, input, and
eutput gates te handle (sng-term dependencies effectively.
54. What is the purpose of gradient accumulation in deep learning?
Answer:
Gradient accumulation splits the batch inte smaller micre-batches ts
compute gradients iteratively, then updates the weights after processing
all micre-batehes.
Bonofits:
Reduces memory usage for large medels or small GPUs.
Simulates larger batch sizes for better convergence.
55. What are capsule networks, and hew de they differ from CNNs?
Answer:
Capsule networks medel spatial relationships between features using
vectors, instead of scalars like CNNs.
Advantages:
Beller handling of spatial hierarchies.
Preseres orientation and pase information.
Example: Used in tasks like image classification with fewer bainingPresewes orientation and psse information.
Example: Used in tasks like image classification with fewer bratning
examples.
56. What are deep reinforcement learning (DRL) and its applications?
Answer:
DRE combines deep learning and reinforcement learning, where agents
learn eptimal policies through, trial and erro.
Applications:
Game playing (eg. AlphaGe, Deta 2).
Rebatics and contrel systems.
Autenemsus uchicles.
SF Hew dees dreapeut work in deep learning, and why is it effective?
Answer:
Dropout randomly disables a fraction of newrons during training,
preventing suerfitting by reducing ce-dependencies. ameng newrsns.
During inference, the full network is used with scaled-down weights.
58. What ts label smesthing, and why ts it used?weights.
58. What is label smesthing, and why is tt used?
|| Answer:
Label smesthing replaces hard labels (e.g. 1 0r O) with smesthed
probabilities (eg. 0.9 and OD
Benefits:
Reduces evercenfidence in predictions.
| Helps the madel generalize better.
Example: Commen in image classification with crsss— loss.
entropy
59. What is the difference between dense and sparse embeddings?
Answer:
Dense Embeddings: Leur-dimensienal, cantinusus-valued vectors (¢.g..
Werd2QVee, BERT). Compact and efficient for deunstream tasks.
Sparse Embeddings: High-dimensienal, mestly zere vectors (e.g., one-het
encoding). Inefficient but straightforward.
60. What is the difference between teacher forcing and free-wnning in60. What is the difference between teacher forcing and free-running in
sequence medlels?
Answer:
Teacher Forcing: During training, the medel uses the ground bruth as
input for the next time step. Speeds up convergence but can lead ts
exposure bias.
Pree- Running: During inference, the medel uses its sun predictions as
inputs. Better simulates real-world usage.
61. What is the purpose of ship connections in deep neural networks?
Answer:
Shep connections, like these used in ResNet, allow gradients ts flow more
easily Uoraugh the network, mitigating the vanishing gradient problem.
They alse enable the medel ts learn identity mappings for shallow layers.
62. What are adversarial examples, and hew de they affect deep
learning medels?
Ansuser:
Adwersarial examples are inputs deliberately perturbed ts fuel a modelAnswer:
Aduervsarial examples are inputs deliberately perturbed ts feel a madel
inte making incorrect predictions. They expsse vulnerabilities in deep
learning medels and highlight the need for bust braining techniques like
adversarial training
63. Has lacs attention mechanism work in deep learning madels?
Answer:
The attention mechanism assigns different. weights te different parts of
the input sequence, fscusing en the mest relevant features for a specific
task. For example, in translation tasks, attention aligns words between
the ssurce and target languages.
64. What are variational auteencsders (VAEs), and how de they differ
frum standard autsenesders?
Answer:
VAEs generate new data by learning a probabilistic latent space. Unlike
standard auteencaders, they aptimize a variational lauer bound using
both reconstruetion loss and a KL divergence teum ts ensure smesth
latent space representations.
65. What i the difference between transfer learning and fine-tuning?
Answer:fine-tuning?
Answer:
Transfer Learning: Reusing a pre-brained medels features witheut
Fine- Tuning: Adapting a pre-brained medel to a specific lash by training
some or all of its layers with a new dataset.
66. What «3 the concept of gradient penalty ir GANs?
Answer:
Gradient penalty is used ts enforce the Lipschity continuity candition in
Wasserstein GANs. Tt adds a penalty te the less function based en the
gradient now of the discriminator, stabilizing training.
67. What are sell-supewised learning and its applications?
Answer:
Self-supersised learning creates labels from raw data itself bs bain
medels witheut manual annotations.
Applications:
Pre-training medels like BERT and SimCLR.
Applications in computer vision and NLP.Applications in computer vision and NLP.
68. What 6s label imbalance, and haw can tt be addressed in deep
learning?
Answer:
Label imbalance eceurs when classes in a dataset are net equally
represented.
Selutions:
Oversampling minority classes.
Undersampling majority classes.
Using class weights in the less function.
69. What arc group normalization and layer normalizatisn?
Answer:
Greup Normalization: Nermalizes activations within groups ef channels.
Effective for small batch. sizes.
Layer Normalization: Normalizes across all features of a single data
psint. Commen in NEP tasks.70. What are hyperparameter splimization techniques in deep learning?
Answer:
Grid Search.
Random Search.
Bayesian Optimization.
Hyperband or Population Based Training.
Tsals like Optuna and Ray Tune help automate this prscess.
71. What is brauledge distillatisn, and why 0s useful?
Answer:
Knowledge distillation transfers brewledge from a large. complex madel
(teacher) te a smaller, faster medel (student). It impreves inference speed
72. What is the difference between BatchNown, [Link], and
InstanceNorm?
Answer:
BatchNown: Normalizes eser a mini-batch of samples.BatchNowm: Normalizes euer a mini-batch of samples.
LayerNorm: Normalizes acress features of a single sample.
InstanceNorm: Normalizes acress spatial dimensiens for eact sample.
commenty used in style banofer.
73. What s spectral normalization, and why és it used in GANs?
Answer:
Spectral normalization constrains the Lipschit; constant of the
discriminator by normalizing its weight matrices. It stabilizes raining
and prevents. mede collapse.
74, What is the SWA Slechastic Weight Averaging) technique?
Answer:
SWA averages weights from multiple SGD steps during braining. It
impraves generalization by converging te flal minima in the (ass.
Gandseape.
75. What is.a Soflmax bottleneck problem in language medels?
Answer:
The Soflmax battlonech limits the expressiveness of language medels due
ts cs restricted output distribution. Techniques like adaptive Softmax andAnswer:
The Seftmax battloncch limits the expressiveness of language madels due
te its reatrieted output distribution. Techniques lie adaptive Seftmar and
Mixture of Soflmax help address this issue.
76. How dees the bansformer architecture handle lang sequences
efficiently?
Answer:
Transformers use self-attention mechanisms that precess sequences in
parallel unlike RNNs. They can msdel (sng-range dependencies withsut
sequential computation.
FF. What is a mixture of experts (MeE) madel?
Answer:
An MoE medet combines several sub-medels (experts) and uses a gating
mechanism le assign weights te each expert for a given input. It is
computatisnally efficient for sealing lage medels.
78. What are the main differences between Mask R-CNN and Faster R-
CNN?
Answer:
Faaler R-CNN: Detects objects and generates bounding boresFaster R-CNN: Detects objects and generates bounding bexes.
Mask R-CNN: Extends Faster R-CNN by adding a mash head for pirel-
79. What is the purpese of casine annealing in learning rate scheduling?
Answer:
Cosine annealing gradually decreases the learning rate fellowing a cosine
|| ewe. It helps achieve better conuergence by enceuraging the medel ts
settle inte. a minimum slowly.
80. What is the difference between active learning and semi-superiised
learning?
Answer:
Active Learning: Identifies the mest informative samples te label frm an
unlabeled peal
Semi-Supewised Learning: Combines a small labeled dataset with a
lage unlabeled dataset te improve performance.81. What is the purpose of gradient clipping, and when isd used?
Answer:
Gradient clipping limits the gradient magnitude ts prevent exploding
| gradients, commonly used in RNNs and deep networks. It stabilizes
baaining when gradients become excessively large.
“| 82. What és focal lass, and why tit useful?
Answer:
Fecal less is designed te address class imbalance by dewn-weighting the
(039 for well-classified examples. and facusing en bard-te-classify
examples.
Formula:
FLipd) = -(1 — pty * loglpt)
where Y centrsls the focusing effect.83. What is dilated esrwelution, and bows dees it differ from standard
corvelution?
Answer:
Dilated convelution increases the receptive field without increasing the
number of parameters by introducing spaces between kernel elements. It is
woeful in tasks like semantic segmentation.
84. What is weight regularization, and how dees it work?
Answer:
Weight regularization reduces arerfitting by penalizing large weights.
L1 Regularization: Adds |us| be the lesa.
L2 Regularization (Weight Decay): Adds w°2 ta the less.
85. What are the advantages of using mixed precision baining?85. What are the advantages of using mixed precision braining?
Answer:
Reduces memory usage.
Inereases training speed.
Achieves comparable accuracy by using lewer precision (¢.g.. FP16) for
caleulations and higher precision (og.. FE3D) for hey operations.
86. Hew dees carly stepping prevent sverfitting?
Answer:
Early stepping halts training when performance en a validation set steps
improving. It prevents the medel fram suerfitting te the baining data by
stepping at an aptimal paint.
87 What is the difference between supeised prebraining and self
supervised pretraining?87. What is the difference between superised pretraining and self-
supervised pretraining?
Answer:
Superuised Pretraining: Pretraining on a labeled dataset before fine-
tuning on @ specific bash.
Self-Supewised Pretraining: Pretraining using self-generated labels
without human annetations, commenly used in NLP and vision.
88. What ts neural architecture search (NAS)?
Answer:
NAS « an automated process te find aptimal newral network
Gnadient-based methsds.89. What &s an enceder-deceder architecture?
Answer:
An enceder—deceder ts used in sequence-te-sequence tasks.
Enceder: Compressed input inte a latent representation.
Deceder: Generates eutput from the latent representation.
Examples: Translation and summarization.
90. What is the purpose of cesine similarity in NLP tasks?
Answer:
comparing word or sentence embeddings.
91. What s the difference between 8eq)8eq msdels with and without
allention?Answer:
Without Attention: Encedes the entire input ints a fixed-length vector,
With Attention: Dynamically fecuses on relevant parls of the input
sequence for better performance.
92. How dees transfer learning benefit small datasets?
Answer:
hansfer learning leverages features learned from a large dataset,
reducing the need for extensive data. It aveids sverfitting and improves
generalization an small datasets.
93. What are banspesed cenvelutions, and where are they used?
Answer:
Tanspesed (decenvelutisnal cenvelutions increase spatial resolution,
often used in generative tasks like image super-resolution or semantic
segmentation.94. What are GNNs (Graph Newral Networks). and where are they
applied?
Answer:
GNNs work on graph-dtructured data, prspagating information between
nedes.
Applications:
Social networks.
Melecular analysis.
| Recommendation systems.
95. What is the purpese of mashed language medels. (MLMs)?
Answer:
MLMs, like BERT. predict missing words in a sentence by masking parts
of the input. This bidirectional understanding impreves performance enmashing parls of the input. This bidirectional understanding impreves
performance en NLP tasks.
96. What 0s layer-wise learning rate scaling?
Answer:
Layer-wise learning rate sealing assigns different learning rakes te
different layers, often smaller rates for pre-brained layers and larger
|| 92 what és brausledge graph embedding?
Answer:
Knewledge graph embedding represents entities and relationships in a
knowledge graph as leuw—dimensienal vectors.
Applications:
Question answering.
Recommendation systems.98. What is feature pyramid networb (FPN)?
Answer:
FPN builds a multi-scale feature hierarchy by combining lsw-resslution,
semantically sbrong features with bigh-resslution, spatially precise
99. What is the difference between a sparse and dense layer?
Answer:
“|| Sparse Layer: Uses sparse matrices te save memory and computational
resources.
Dense Layer: Fully connected, requiring more resources but capturing all
| feature interactions.
100. What és capsule routing in capsule networks?100. What és capsule reuting in capsule networks?
Answer:
Capsule reuting ensures that lewer—layer capsules send thetr cutpuls to
Aigher-layer capsules based an agreement scores. This precess preserses
101. What és the vanishing gradient preblem, and hou és it mitigated?
Answer:
The vanishing gradient problem eccurs when gradients become extremely
small during backprspagatisn, preventing effective weight updates in
earlier layers. This aften happens in deep networks with activation
functions like sigmsid or tanh.
Mitigation Strategies:
1 Use ReLU activation functions: RLU avoids vanishing gradients by
having @ censtant gradient for pesitive values.2. Batch normalization: Normalizes layer inpuls to stabilize and maintain
gradients.
3. Residual connections: Allow gradionts ta flows directly through. ship
|| connections én deep networks (eg., ResNet).
102. Explain the concept of teacher forcing in RNNo. Why is it useful?
Answer:
Teacher forcing is a technique used in sequence-ta-sequence medels where
the actual target output is used as inpul te the next time step dering
raining, instead of the predicted eutput.
Advantages:
Speeds up convergence by previding greund-bruth inputs.
Reduces expssure bias (the discrepancy between training and inference).
Challenge:
Ab inference. the model might struggle witheut greund-bruth inputs.Challenges:
Ab inference. the madel might struggle without ground truth inputs.
Scheduled sampling can gradually reduce reliance en teacher forcing.
103. What i the difference between label smecthing and hard labels?
Answer:
Hard Labels: Assign a one-bet enceding for the target classes (eg.. (1. 0.
oD).
Label Smesthing: Medifies hard labels by assigning a small probability te
incorrect classes te make the medel less confident in its predictisns.
For example: (0.9. 0.05, 0.05).
Aduantages of Label Smosthing:
Improves generalization by preventing svercenfidence.104. What és bnewsledge distillation, and how is dappled?
Answer:
Knauledge distillatisn transfers bnewledge from a large, eompler madel
(leacher) lo.a smaller, simpler medet (student)
Hour é works:
The student medel is trained te mimic the beachers softened eutput
probabilities instead of the hard labels.
Leas Function:
L- (1- A) * cress enbropyly, 3) + a*
Kl divergencelsoftmax(; teacher/1), sefonax—student/T))
where T is the temperature and O balances the less tens.
Applications:
Depleying efficient medels on reseurce-eenstrained devices.
Medel compression.105. What is pradient centralization, and why is t wed?
Answer:
Gradient centralization normalizes gradients by sublracting their mean
before upaing weights
Improves optimization stability.
Reduces vartance in gradients, especially in deep networks.
|| Commenly used in conjunction with optimizers like SGD or Adam.
“| 106. What «s the difference between transeluctive and inductive
learning?
Answer:
Tranaductive Learning: Learns lo prediet labels only for the given lestRansductive Learning: Learns te predict labels enly for the given test
data, without generalizing te unseen data. Example: Graph based semi-
superised learning.
Inductive Learning: Learns a general function or madel that can make
predictions en unseen data. Example: Mest deep learning medels like
CNNo or RNNs.
103. What are adversarial examples, and how de ysu defend against
them?
Answer:
Aduervarial examples are inputs deliberately perturbed to deceive a medel
inte making incorrect predictions while appearing unchanged be humans.
Defenses:
1 Adversarial training: hain the medel sn advevsarially perturbed data.
2 Gradient mashing: Obfuscate gradients te make it harder for attackers
ls compule peurbatisns.
3. Input preprocessing: Techniques like JPEG compression or Gaussian3. Input preprecessing: Techniques like JPEG compression or Gaussian
108. What are transformer medels, and bow de they differ (rem RNN?
Answer:
hansformer medels use self-attentin mechanisms be precess sequences,
unlike RNNo that prscess inputs sequentially.
Key Differences:
1. Parallelism: hansformers precess all input tokens simultanesusly,
while RNNo precess sequentially.
2 Long-term dependencies: Tansformers capture long-range
3. Efficiency: hansformers are more efficient with GPUs due te
parallelization but require more memory.Examples: BERT. GPT. T5.
109. What is the purpese of positional enceding in transformers?
Answer:
Positional enceding allows hansformers, which lach inherent sequence
|| awareness, ts incorporate the order of tskens in @ sequence.
Formula for sinussédal enceding:
PE(pes, 2é) = sén(pes / 1O000(2i/d))
FE(pes, Qe) = cea(pes. / 1OOOO(2é/d)
where pes ts the pssition, ‘is the dimension, and d is the embedding size.
110. What is the concept of layer normalization. and how. dacs d differ
fram batch normalization?
Answer:
Layer Normalization: Normalizes inputs across features within a single
baining example. Commenty used in NLP and bansformers.Layer Normalization: Normalizes inputs across features within a single
baining example. Commenty used in NLP and bansformers.
Fouula:
Ly = Ge - mean) / sqrtluariance + €)
Batch Normalization: Nornalizes inputs acreds the batch for each
feature. Commen in CNN.
Differences:
Batch normalizatisn depends an batch size: layer normalization dees nat:
Layer normalization is more effective in sequence-based msdels.
111. What are attention mechanisms, and why are they important?
Answer:
Attention mechanisms allow medels te focus en specific parts of the input
sequence when making predictions, assigning varying importance
(weights) le different tokens or elements.Answer:
isn mechanisms allow medels te fecus on specific parts of the input
sequence when making predictions, assigning varying importance
(weights) te different tokens or elements.
Types of Attention:
1. Self-Attention: Helps capture relationships within a single sequence.
2. Cisss-Attention: Used in sequence-te-sequence medels te relate input
and sulpul sequences.
Importance:
Captures long-range dependencies.
Enhances inteynretability by showing what the medel ts facusing on.
Forms the backbane of transformer architectures.
112. What are capsule networks, and hau de they differ from baditional112. What are eapsule networks, and how dle they differ frem buaditisnal
CNNs?
Answer:
Capsule networks are designed te madel spatial hievarchéo ly enceding
the pase and erientation of features in addition te their presence.
Key Differences:
Capsules: Groups of neurons represent the prsbability and parameters of
Dynamic Routing: Capsules communicate with higher-level capsules using
is.
Advantages:
Better at understanding hierarchical relatisnships.
More rsbust to changes in orientation and spatial distortisns.113. What is the difference between data augmentation and data
synthesis?
Answer:
Data Augmentation: Modifies existing data te increase diversity (c.g..
flipping, crepping, adding nsise). Commenty wsed for regularization.
Dale Synthesis: Generates entirely new data peints from a model (og.
GANs, VAEs).
Use Cases:
drastically.
Synthesis is useful for creating data in underrepresented categories.
114. What are GANs, and hou ds they work?
Answer:
Generative Adversarial Networks (GAN) consist of te models:Answer:
Generative Adversarial Networks (GANS) consist of tue models:
Generator: Creates fahe data
Discuiminator: Distinguishes between real and fake data.
haining Process:
The generator learns te create realistic data by fesling the discriminator.
Both msdels play a minimax game.
Lo38 Function:
min—G max—D Elleg(Direab)) + Elleg(1 - Difake))]
Applications: Image generation, style bansfer, and super-resolulisn.
115. What & the rele of the soflman function in neural networks?
Answer:
The soflmax functisn converls raw scored (legits) inte probabilities,115. What is the rele of the seflenan function in neural networks?
Answer:
The softmax function converts raw scores (lsgits) inte probabilities,
ensuring they sum te 1. It ts commenty used in the sutput layer of
classification tasks.
Formula:
softmar(xi) = exp(ni) / sumlerply) forj in rangeln))
Advantages:
‘|| Prauédes interpretable class, prababilities.
Highlights the mest likely class while suppressing sthers.
Helps during bess calculation using cress—enbrapy.
116. What are variatisnal auteonceders (VAES), and hew are they
different frem standard auteenceders?
Answer:
Variational auteenceders are probabilistic medels that learn a latent
representation as a distubution (mean and variance) rather than alatent representation as a distribution (mean and variance) rather than a
Differences:
Standard Auteenceders: Compress input inte fixed latent vectors.
VAEs: Use a probabilistic approach te generate diverse eutpuls.
L638 Function:
L = recenstruction—less + KL divergence(latent || prior)
Applications: Image synthests, anemaly detection, and latent space
exploration.
7 What is transfer learning, and why is it effective?
Answer:
hansfer learning invelves reusing a pre-trained medel en a related task
ts improve performance and reduce taining lime.
Effectiveness:Effectiveness:
Pre-bained medels like ResNet and BERT already learn general
features, reducing the need for large labeled datasets.
Fine-tuning adapts these features te specific tasks.
Examples:
Using ImageNet-lrained CNNo for medical imaging.
Adapting BERT for sentiment analysis.
118. What is the difference between gradiont clipping and gradient
normalization?
Answer:
Gradient Clipping: Limits the magnitude of gradients te prevent expleding
gradients.
G liqradient|l » threshold:
gradient = gradient." (Ubreshald / lIgradient|))gradient = gradient. (thresbeld / \|gradient|))
Gradient Normalization: Adjusts gradients by dividing them by their
Use Cases:
Clipping is: commen in RNNo with vanishing/erpleding gradients.
Normalization is used for ensuring smesth sptimization.
119. Hew da yeu caleulate the reveptive field of a canveldtional layer?
Answer:
The receptive field is the area of input pixels that influence a particular
sulpul feature.
Formula: For n layers:
Run = R-n-D) + (Kon - 1) * shride_n
where R_n és the receplive field. K is the kernel size, and stride is the
stride of the cwrront layer.shride is the stride of the current layer.
Importance: Determines the spatial context captured by a corvelutional
layer.
120. What are the limitations of backprepagation?
Answer:
Vanishing/expleding gradients: Can hinder optimization in deep
networks.
High computation cast: Requires significant memory and computation for
lage networks.
Dependence en labeled data: Backprspagation requires labeled datasets,
which can be expensive te acquire.
Non- ity: Optimization often cenruerges ts lscal minima or saddle
points.
Solutions: Advanced sptimizers (Adam, RMSPrep). initializationAmar Sharma
Al Engineer
Follow me on LinkedIn for more
informative content #