0% found this document useful (0 votes)
91 views68 pages

120 Deep Learning Important Questions + Answers ?

Deep learning important questions , answers to interview , etc , 120 questions

Uploaded by

Joely Silva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
91 views68 pages

120 Deep Learning Important Questions + Answers ?

Deep learning important questions , answers to interview , etc , 120 questions

Uploaded by

Joely Silva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
120 Important Deep Learning Interview Questions+ Answers Notes Ef -- Amar Sharma Important Deep Learning Intowiew Questions with Answers L What is deep learning? How is it different from machine learning? Answer: Deep learning is a subset of machine learning that wes newral networks with multiple layers te automatically learn representations from data. Key differences: Deep learning requires large datasets. and cemputatisnal power. It learns features directly from data, whereas traditisnal machine learning often requires feature engineering. Deep learning algorithms are typically based on newral networks with many hidden layers. 2 What os a newral network, and hew dees tt work? Answer: A newal network s a computational medel inspired by the human brain, consisting of layers of intercennected nedes (newrens). Input layer receives data. Hidden layers perform computations and learn features. Output layer provides predictions. Hidden layers perform computations and learn features. Output layer prevides predictions. The network learns by adjusting weights using a process called backprepagation and an aptimizatian algorithm like gradient descent. 3. What is backprspagaticn? Answer: Backprepagation is an algorithm used be brain newral netuorks by minimizing the enror. The error from the sutput layer is propagated backward through the network. Gradients are computed for each weight using the chain rule. Weights are updated using an sptimizer (e.g.. SGD or Adam) te reduce He evr. 4. What are activation functions, and why ar6 they important? Answer: Activation functions intreduce nen-linearity te newral networks, enabling them te medel complex relationships. Commen functions: networks, enabling them to medtel compler relationships. Commen functions: RLU (Rectified Linear Unit): Fast convergence, aveids vanishing gradient issues. ||| Siqmetcl: Output between O and 1, used for binary classification. Softmax: Outputs probabilities for multiclass classification. Tanbs: Outputs between -Land 1, centered around sere. 5. What is suerfitting, and hew can it be prevented? Answer: Ouerfitting securs when a medel performs well on training data but psorly on unseen data. Prevention techniques: Use regularization methads like LI/L2 (Ridge, Lasse). Apply agp layers Reduce medel complexity. Use more training data or data augmentation. Perform early stepping during baining. Parform early stepping during bratning. 6. What is the difference betwen batch sise. epschs, and iterations? Answer: Batch size: Number of samples processed before updating the medels weights. pach: One complete pass threugh the entire training dataset. Iteration: One batch update during training. For example, i you have 1000 samples and a batch size of 100. there will be 10 iterations per peck. 7. What is the vanishing gradient problem. and how can it be mitigated? Answer: The vanishing gradient problem securs when gradients beceme very small in deep networks, slewing or stepping learning. Métigation techniques: Ube activation functisns like ReLU. Initialize weights properly (e.g.. Xavier or He initialization). Use activation functiens like RLU. Initialize weights prsperly (eg.. Xavier er He initialization). Use batch normalization. Build networks with shép connections leg. Reset) 8. What is branafer learning? Answer: hansfer learning invelves using a pre-bained medel en a new task. Instead of training from scratch, the medels pre-bained weights are fine-tuned for the target task. This ts useful when data is limited and for tasks like image recegnitisn or natural language precessing. 9. Explain the difference between CNNNs and RIN. Answer: CNNo (Convolutional Newral Networks): Designed for spatial data like émages. They use cervelutional layers te capture spatial hierarchies. RNNo (Recurrent Newral Networks): Designed for sequential data like time series or text. They have memory cells ts capture temporal dependencies. capture temporal dependencies. 10. What are gradient descent and its variants? Answer: Gradient descent &s an eptimization algorithm used te minimize the less function. Cammen variants: || Batch. Gradient Descent: Uses the entire dataset for each update (slour for large datasets). Stechastic Gradient Descent (SGD): Uses ene sample per update (faster but netsy). Méni-batch Gradiont Descent: Uses a subset (batch) of the data for each update (balances speed and stability) Adam Optimizer: Combines mementum and adaptive learning rates for efficient raining. Ll. What is the rle of the lsss function in neural networls? Answer: The lass function measures the difference between the predicted output predicted sutput and the actual target value. Tt guides the optimization process by providing a metric for minimizing the error. Commen less functions: Mean Squared Error (MSE): For regression tasks. Binary Crsss-Enbrepy: For binary classification. Categorical Cress—Enbrepy: For multi-class classification. 12. What are weight initialization techniques, and why are they amportant? Answer: Weight initialization techniques help ensure faster convergence and aveid tues like vanishing/expleding gradients. Randem Initialization: Assigns random values to weighls. Xavier Initialization: Keeps the variance of activations canstant across ayers. He Initialization: Optimized for ReLUl activations. 13. What is the difference between L1 and LD reqularization? 13. What is the difference between L1 and LD reqularization? Answer: L1 Regularization: Adds the abselute value of weights to the lsss function (Lasse). Encourages sparsity, making some weights jer. £2 Regularization: Adds the squared value of weights to the less function (Ridge). Penalizes large weights and prevents sverfitting. || 14. What are auteencaders, and haw are they used? Answer: Autsenceders are newral networks used for undsupewised learning. designed ts reconstruct input data. They have an enceder (ts compress data) and a deceder (te recenstruct i). Applications: Danensienali ; Anemaly detection. Densising data. 15. What is the rl of batch normalization? 15. What is the rele of batch nerwnalization? Answer: Batch normalization nornalizes the input of cach layer te improve stability and cenvergence during training. Benefits: Reduces internal covariate shift. Allows for higher learning rates. Acts as a regularizer. reducing the need for drspeut. 16. What & @ reewnent newral network (RNN), and haw dees tt handle sequential data? Answer: RNNs are designed be prscess sequences of data by maintaining a hidden dependencies in data like time series, text, or speech. Variants like LSTM (Leng Short-Tem Memory) and GRU (Gated Recurrent Unit) address issuer like vanishing gradients. 17. What is the purpose of dropsul in deep learning? Answer: Dropout is a regularization lechnique that randomly sels a fraction of Answer: Dropout (3 a regularization technique that randomly sets a fraction of. neurons le sere during taining. Prevents averfitting by intreducing nase. Encsurages the network te learn more rsbust features. 18. What are GANs (Generative Adversarial Networks)? Answer: GANS are newral networks consisting of twe compenents: Generator: Creates fake data resembling real data. Discriminator: Distinguishes between real and fake data. They are trained together, improving the generators ability te create realistic data. Applications: Image generation. Style transfer. Data augmentation. 19. What is the difference between superised. unsuperwised, and reinforcement (earning? Answer: Supervised Learning: The medel learns from labeled data (e.g. lassification, regression) Unsupewised Learning: The medel identifies patterns in unlabeled data (eg., clustering, dimensisnality reduction). Reinforcement Learning: The medel learns by interacting with the envitenment and receiving feedback tn the form of rewards or penalties. 20. What are attention mechanisms in deep learning? Answer: Attention mechanisms allow the medel be fecus on relevant parts of the input while making predictions. Example: Tn machine translation, the attentian mechanism helps the medel fscus on specific words in the ssurce sentence while translating. Applicatiens: Transformer medels like BERT and GPT. Image captisning. Text summarization. Tent oe QL. What are the main companents of a esruslational neural network (CNN)? Answer: Convolutional Layers: Entract features by applying fillers over the inpul Pasting Layers: Reduce the spatial dimensisns of feature maps (e.g.. max peeling). Fully Connected Layers: Combine high-level features for classification or regression. Drepout/Bias Layers: Prevent sverfitting and impreve generalization. 22. What is the difference between a feedforward newral network and a reewrvrent neural network? Answer: Feedforvard Neural Network (FNN): Processes input data in ene direction, witheut leaps. Ideal for tasks like image recegnition. ene direction, witheut loops. Tdeal for tasks like image recegnition. Recurrent Newral Network (RNN): Processes sequential data with feedback leaps be maintain memory. Used for time-series and language madeling. 23. What are LSTMs and GRUs? Hew. are they different? Answer: LSTM (Long Short-Term Memory): Use gates (tnput, forget, eutput) te maintain leng-leum dependencies in sequences. GRUs (Gated Recurrent Units): A simplified version of L8TMe, combining forget and input gates inks ene update gate. GRUb are computationally faster, while LSTMs handle complex dependencies better. 24. What is the difference between parameterized and nen- parameterczed layers? Answer: Parameterized Layers: Contain trainable parameters (0.g., Dense, Conulutianal layer). Dense, Convslutional layers) Nen-parameterized Layers: De net contain bainable parameters but modify data (eg. Activation. Pasling layers) 25. What 6s the exploding gradient problem, and hau 6s tt mitigated? Answer: Exploding gradients sccur when large gradient values cause instability || duaing baining. Solutions: Gradient clipping: Restrict gradients te a maximum value. Use better initialization metheds. Use architectures like LSTM/GRUs for sequential data. 26. What is the purpose of the softmax function? Answer: Softmax converts rau. scores (lagtts) inte probabilities that sum ts 1. Used in the sutput layer for multé-class classification. Formula: Seflmarlri) = explri) / sumleyplj) for al. Formula: Seflrmavcri) = explri) J sumexpleg) for all. 27. What is the difference between supewised pretraining and self- supervised learning? Answer: Supervised Pretraining: The medel is bained en a related labeled dataset, then fine-tuned en the target dataset. Self-Supewised Learning: The medel generates pseudo labels from data (e.g., predicting mashed tehens in BERT) and learns representations 28. What is the Transformer architecture, and hew dees tt work? Answer: The hansformer is a deep learning architecture designed for sequence- ts-sequence tasks. It uses: Self-Attention Mechanism: Te focus en relevant parts of input sequences. Pasitional Encading: Ts maintain order in input sequences. Tk replaced RNNs for tasks like machine translation (e.g. BERT. GPT medels). Te replaced RIN for tasks libe machine translation (eg. BERT. GPT medels). 29. What are the main challenges in training deep newral networks? Answer: Venishinglerplading gradients. Ouerfitting on training data. Difficulty ir hyperparameter tuning. Data scarcity or imbalance. 30. What is the difference between medel-based and data-based parallelism in deep learning? Answer: Medel-based Parallelism: Splits the medel acres multiple devices (e.g. aplitting layers of a large neural network). Data-based Parallelism: Splits the data inte batches processed in parallel Data based Parallelism: Splits the data inte batches processed in parallel across devices. 31. What is bansfer learning, and why is it important in deep learning? Answer: hansfer learning invelves using a pre-bained medel en a related task and fine-tuning it for a target lasle Benefits: Reduces training time. Requires less data for the target task. Leverages learned features frem a larger dataset (cg. ImageNel) 32. What is the purpose of an activation function in @ neural network? Answer: Activation functions introduce non-linearity inte the networb, enabling it ts learn complex patterns. ReLU (Rectified Linear Unit): max(0, x). ReLUU Rectified Linear Unit): mal. ») Tank: Outputs values between -1 and 1. ||| Seftmax: Converts sulputs inte probabilities. 33. What is knowledge distillation in deep learning? Answer: Knowledge distillation transfers knowledge from a large, complex medel (eacher) to a smaller, simpler medel (student) without significant performance lass. Steps: Train the teacher medel. Use the teachers soft predictisns ts brain the student. 34. What is the rle of learning rake scheduling in training deep learning medels? Answer: Learning rate scheduling adjusts the learning rate during bratning te Types of schedules: Step decay: Reduce the learning rate at fixed intewals. Expanential decay: Multiply the learning rate by a factor at each step. Cyclic learning rakes: Oseillate the learning rabe within a range. 35. What are the differences between instance normalization, bateh normalizatisn, and layer normalization? Answer: Batch Normalization: Normalizes activations acress a batch of data. Useful for training stability. Instance Normalization: Normalizes activations for cach sample. Often used in tyle transfer tasks. Layer Normalization: Normalizes acress features for each sample. Effective for RNNo and tansformer architectures. 36. What is the vanishing gradient problem, and hew de activation functions like ReLU address tt? Answer: Answer: Vanishing gradients eccur when gradients shrink exponentially during backprepagation, preventing effective weight updates. ReLU: Ausids vanishing gradients by allawing gradients te pass. unchanged for positive values, as its derivative is either O or 1. 32. What are the differences between Adam and SGD eptimizers? Answer: SGD (Stochastic Gradient Descent): Updates weights using the gradient of the (ss function. Slower convergence. Adam (Adaptive Mement Estimaticn): Combines momentum and adaptive learning rates for faster convergence and improved stability. 38. What are attention heads in the bansformer medel? Answer: Attention heads in bansfermers allow the medel te focus an different parts of the input simultanceusly. Multi-head attention splits the queries, keys, and values inte multiple parts. computes attontian independently, and esmbines resulls for better contextual understanding. Mulli-head attention splits the queries, Keys. and values inte multiple parts, computes attention independently, and combines results for better contextual understanding. 39. What is the difference between gradient clipping and gradient normalization? Answer: Gradient Clipping: Limits the magnitude of gradients be a pre-defined Hreshald te prevent exploding gradients. Gradient Normalization: Scales gradients te have a consistent magnitude, 40. What is the difference between early stepping and checkpsinting in baining? Answer: Early Stopping: Steps baining when performance on a validation set steps amproving, preventing euerfitling. Checkpeinting: Saves medel weights periedically during baining. Useful |_ for recovering from interruptions or selecting the best-performing medel. 41 What 0 the difference between the enceder and deceder in sequence fs-sequence medels? Answer: Enesder: Processes the input sequence and encsdes it inte a fixed-length vector or context. Deceder: Takes the enceded context and generates the sutput sequence step by atep. Examples: Used in machine branslation (og, English. le French). 42. What is the rsle of pesitinal enceding in bansformers? Answer: hansformers de net process data sequentially, se pesitisnal enceding is added te input embeddings te prsuide information absut the order of tskens. Paesitisnal encedings are dsinussidal functions of different frequencies. 43. What are the challenges of deploying deep learning medels in production? Answer: High inference latency and memory usage. High inference latency and memory wage. Ensuring medel robustness te real-world data. Scalability under high baffie. Maintaining medel versisning and repreducibility. 44. What is Layeuwise Relevance Prepagatieon (LRP)? Answer: LRP & an explainability technique for newal networks. It decomposes the eulput prediction back ts the input features te show their relevance. It helps interpret medel decisions and is used in sensitive demains like healthcare. 45. What is the difference between semantic segmentation and instance segmentation? Answer: Semantic Segmentation: Classifies each pixel of an image inte a category Tnatance Segmentation: Tdentifies individual objects of the same oles Instance Segmentation: Identifies individual objects of the same class | 46. What is @ dilated convelution, and wher is it wed? Answer: A dilated cervelution (alse called atreus convelutisn) expands the elements. | Used in: Semantic segmentatisn (eg.. DeepLab). Audis and time-series data analysis. 47. What are the benefits of using cosine similarity ever det product for measuring vector similarity? Answer: Cosine Similarity: Measures the cosine of the angle between tus vectors, focusing en orientation rather than magnitude. Benefits: Prevents large magnitude differences from deminating the similarity 48. What is jer0-shet learning, and how dees i work? Answer: Zera-shet learning enables. a madel le make predictions for classes it has net seen during braining. Mechanism: Leverages a shared semantic space (e.g., werd embeddings) te transfer hnewuledge from seen be unseen classes. 49. What is a Siamese network, and where is tt used? Answer: A Siamese network uses tus identical subnetuorks te compare inpuls by learning @ similarity metric. Applications: Face weificatin One-shet learning. Si ual 50. What is the purpose of weight initialization in deep learning? Answer: Prcper weight initialization prevents vanishing or explading gradients and accelerates convergence. Navier Initialization: Suitable for activations like sigmeid or tanh. He Initialization: Designed for ReLU activation functions. SL. What are vanishing and expleding gradients, and haw de they impact deep learning medels? Answer: Vanishing Gradients: Gradients become very small, causing weights ts update slowly and halting learning. Cxpleding Gradients: Gradients become very large, leading te unstable updates and passible divergence. Seltisns: Ube activation functions tibe RLU. Implement gradient clipping. Use batch normalization or better initialization methods like He initialization. Use batch normalization or better initialization methods like He initialinatis 52. What are the differences between data augmentation and data synthesis? Answer: Data Augmentation: Applies bansformations te existing data (e.g.. relations, flips, neise). It enhances diversity witheut altering class distribulis Data Synthesis: Generates entirely new data using techniques like GANs or simulations. Useful for handling imbalanced or rare classes. 53. What are the key differences between RNNo, GRUs, and L8TMs? Answer: RNNo: Precess sequential data but suffer from vanishing gradients for long sequences. GRUs (Gated Recurrent Units): Simplified LETMs with fewer parameters; combine the forget and input gates. LSTMs (Long Short-Teum Memary): Use separate forget, input. and LSTMs (Leng Short-Teun Memory): Use separate forget, input, and eutput gates te handle (sng-term dependencies effectively. 54. What is the purpose of gradient accumulation in deep learning? Answer: Gradient accumulation splits the batch inte smaller micre-batches ts compute gradients iteratively, then updates the weights after processing all micre-batehes. Bonofits: Reduces memory usage for large medels or small GPUs. Simulates larger batch sizes for better convergence. 55. What are capsule networks, and hew de they differ from CNNs? Answer: Capsule networks medel spatial relationships between features using vectors, instead of scalars like CNNs. Advantages: Beller handling of spatial hierarchies. Preseres orientation and pase information. Example: Used in tasks like image classification with fewer baining Presewes orientation and psse information. Example: Used in tasks like image classification with fewer bratning examples. 56. What are deep reinforcement learning (DRL) and its applications? Answer: DRE combines deep learning and reinforcement learning, where agents learn eptimal policies through, trial and erro. Applications: Game playing (eg. AlphaGe, Deta 2). Rebatics and contrel systems. Autenemsus uchicles. SF Hew dees dreapeut work in deep learning, and why is it effective? Answer: Dropout randomly disables a fraction of newrons during training, preventing suerfitting by reducing ce-dependencies. ameng newrsns. During inference, the full network is used with scaled-down weights. 58. What ts label smesthing, and why ts it used? weights. 58. What is label smesthing, and why is tt used? || Answer: Label smesthing replaces hard labels (e.g. 1 0r O) with smesthed probabilities (eg. 0.9 and OD Benefits: Reduces evercenfidence in predictions. | Helps the madel generalize better. Example: Commen in image classification with crsss— loss. entropy 59. What is the difference between dense and sparse embeddings? Answer: Dense Embeddings: Leur-dimensienal, cantinusus-valued vectors (¢.g.. Werd2QVee, BERT). Compact and efficient for deunstream tasks. Sparse Embeddings: High-dimensienal, mestly zere vectors (e.g., one-het encoding). Inefficient but straightforward. 60. What is the difference between teacher forcing and free-wnning in 60. What is the difference between teacher forcing and free-running in sequence medlels? Answer: Teacher Forcing: During training, the medel uses the ground bruth as input for the next time step. Speeds up convergence but can lead ts exposure bias. Pree- Running: During inference, the medel uses its sun predictions as inputs. Better simulates real-world usage. 61. What is the purpose of ship connections in deep neural networks? Answer: Shep connections, like these used in ResNet, allow gradients ts flow more easily Uoraugh the network, mitigating the vanishing gradient problem. They alse enable the medel ts learn identity mappings for shallow layers. 62. What are adversarial examples, and hew de they affect deep learning medels? Ansuser: Adwersarial examples are inputs deliberately perturbed ts fuel a model Answer: Aduervsarial examples are inputs deliberately perturbed ts feel a madel inte making incorrect predictions. They expsse vulnerabilities in deep learning medels and highlight the need for bust braining techniques like adversarial training 63. Has lacs attention mechanism work in deep learning madels? Answer: The attention mechanism assigns different. weights te different parts of the input sequence, fscusing en the mest relevant features for a specific task. For example, in translation tasks, attention aligns words between the ssurce and target languages. 64. What are variational auteencsders (VAEs), and how de they differ frum standard autsenesders? Answer: VAEs generate new data by learning a probabilistic latent space. Unlike standard auteencaders, they aptimize a variational lauer bound using both reconstruetion loss and a KL divergence teum ts ensure smesth latent space representations. 65. What i the difference between transfer learning and fine-tuning? Answer: fine-tuning? Answer: Transfer Learning: Reusing a pre-brained medels features witheut Fine- Tuning: Adapting a pre-brained medel to a specific lash by training some or all of its layers with a new dataset. 66. What «3 the concept of gradient penalty ir GANs? Answer: Gradient penalty is used ts enforce the Lipschity continuity candition in Wasserstein GANs. Tt adds a penalty te the less function based en the gradient now of the discriminator, stabilizing training. 67. What are sell-supewised learning and its applications? Answer: Self-supersised learning creates labels from raw data itself bs bain medels witheut manual annotations. Applications: Pre-training medels like BERT and SimCLR. Applications in computer vision and NLP. Applications in computer vision and NLP. 68. What 6s label imbalance, and haw can tt be addressed in deep learning? Answer: Label imbalance eceurs when classes in a dataset are net equally represented. Selutions: Oversampling minority classes. Undersampling majority classes. Using class weights in the less function. 69. What arc group normalization and layer normalizatisn? Answer: Greup Normalization: Nermalizes activations within groups ef channels. Effective for small batch. sizes. Layer Normalization: Normalizes across all features of a single data psint. Commen in NEP tasks. 70. What are hyperparameter splimization techniques in deep learning? Answer: Grid Search. Random Search. Bayesian Optimization. Hyperband or Population Based Training. Tsals like Optuna and Ray Tune help automate this prscess. 71. What is brauledge distillatisn, and why 0s useful? Answer: Knowledge distillation transfers brewledge from a large. complex madel (teacher) te a smaller, faster medel (student). It impreves inference speed 72. What is the difference between BatchNown, [Link], and InstanceNorm? Answer: BatchNown: Normalizes eser a mini-batch of samples. BatchNowm: Normalizes euer a mini-batch of samples. LayerNorm: Normalizes acress features of a single sample. InstanceNorm: Normalizes acress spatial dimensiens for eact sample. commenty used in style banofer. 73. What s spectral normalization, and why és it used in GANs? Answer: Spectral normalization constrains the Lipschit; constant of the discriminator by normalizing its weight matrices. It stabilizes raining and prevents. mede collapse. 74, What is the SWA Slechastic Weight Averaging) technique? Answer: SWA averages weights from multiple SGD steps during braining. It impraves generalization by converging te flal minima in the (ass. Gandseape. 75. What is.a Soflmax bottleneck problem in language medels? Answer: The Soflmax battlonech limits the expressiveness of language medels due ts cs restricted output distribution. Techniques like adaptive Softmax and Answer: The Seftmax battloncch limits the expressiveness of language madels due te its reatrieted output distribution. Techniques lie adaptive Seftmar and Mixture of Soflmax help address this issue. 76. How dees the bansformer architecture handle lang sequences efficiently? Answer: Transformers use self-attention mechanisms that precess sequences in parallel unlike RNNs. They can msdel (sng-range dependencies withsut sequential computation. FF. What is a mixture of experts (MeE) madel? Answer: An MoE medet combines several sub-medels (experts) and uses a gating mechanism le assign weights te each expert for a given input. It is computatisnally efficient for sealing lage medels. 78. What are the main differences between Mask R-CNN and Faster R- CNN? Answer: Faaler R-CNN: Detects objects and generates bounding bores Faster R-CNN: Detects objects and generates bounding bexes. Mask R-CNN: Extends Faster R-CNN by adding a mash head for pirel- 79. What is the purpese of casine annealing in learning rate scheduling? Answer: Cosine annealing gradually decreases the learning rate fellowing a cosine || ewe. It helps achieve better conuergence by enceuraging the medel ts settle inte. a minimum slowly. 80. What is the difference between active learning and semi-superiised learning? Answer: Active Learning: Identifies the mest informative samples te label frm an unlabeled peal Semi-Supewised Learning: Combines a small labeled dataset with a lage unlabeled dataset te improve performance. 81. What is the purpose of gradient clipping, and when isd used? Answer: Gradient clipping limits the gradient magnitude ts prevent exploding | gradients, commonly used in RNNs and deep networks. It stabilizes baaining when gradients become excessively large. “| 82. What és focal lass, and why tit useful? Answer: Fecal less is designed te address class imbalance by dewn-weighting the (039 for well-classified examples. and facusing en bard-te-classify examples. Formula: FLipd) = -(1 — pty * loglpt) where Y centrsls the focusing effect. 83. What is dilated esrwelution, and bows dees it differ from standard corvelution? Answer: Dilated convelution increases the receptive field without increasing the number of parameters by introducing spaces between kernel elements. It is woeful in tasks like semantic segmentation. 84. What is weight regularization, and how dees it work? Answer: Weight regularization reduces arerfitting by penalizing large weights. L1 Regularization: Adds |us| be the lesa. L2 Regularization (Weight Decay): Adds w°2 ta the less. 85. What are the advantages of using mixed precision baining? 85. What are the advantages of using mixed precision braining? Answer: Reduces memory usage. Inereases training speed. Achieves comparable accuracy by using lewer precision (¢.g.. FP16) for caleulations and higher precision (og.. FE3D) for hey operations. 86. Hew dees carly stepping prevent sverfitting? Answer: Early stepping halts training when performance en a validation set steps improving. It prevents the medel fram suerfitting te the baining data by stepping at an aptimal paint. 87 What is the difference between supeised prebraining and self supervised pretraining? 87. What is the difference between superised pretraining and self- supervised pretraining? Answer: Superuised Pretraining: Pretraining on a labeled dataset before fine- tuning on @ specific bash. Self-Supewised Pretraining: Pretraining using self-generated labels without human annetations, commenly used in NLP and vision. 88. What ts neural architecture search (NAS)? Answer: NAS « an automated process te find aptimal newral network Gnadient-based methsds. 89. What &s an enceder-deceder architecture? Answer: An enceder—deceder ts used in sequence-te-sequence tasks. Enceder: Compressed input inte a latent representation. Deceder: Generates eutput from the latent representation. Examples: Translation and summarization. 90. What is the purpose of cesine similarity in NLP tasks? Answer: comparing word or sentence embeddings. 91. What s the difference between 8eq)8eq msdels with and without allention? Answer: Without Attention: Encedes the entire input ints a fixed-length vector, With Attention: Dynamically fecuses on relevant parls of the input sequence for better performance. 92. How dees transfer learning benefit small datasets? Answer: hansfer learning leverages features learned from a large dataset, reducing the need for extensive data. It aveids sverfitting and improves generalization an small datasets. 93. What are banspesed cenvelutions, and where are they used? Answer: Tanspesed (decenvelutisnal cenvelutions increase spatial resolution, often used in generative tasks like image super-resolution or semantic segmentation. 94. What are GNNs (Graph Newral Networks). and where are they applied? Answer: GNNs work on graph-dtructured data, prspagating information between nedes. Applications: Social networks. Melecular analysis. | Recommendation systems. 95. What is the purpese of mashed language medels. (MLMs)? Answer: MLMs, like BERT. predict missing words in a sentence by masking parts of the input. This bidirectional understanding impreves performance en mashing parls of the input. This bidirectional understanding impreves performance en NLP tasks. 96. What 0s layer-wise learning rate scaling? Answer: Layer-wise learning rate sealing assigns different learning rakes te different layers, often smaller rates for pre-brained layers and larger || 92 what és brausledge graph embedding? Answer: Knewledge graph embedding represents entities and relationships in a knowledge graph as leuw—dimensienal vectors. Applications: Question answering. Recommendation systems. 98. What is feature pyramid networb (FPN)? Answer: FPN builds a multi-scale feature hierarchy by combining lsw-resslution, semantically sbrong features with bigh-resslution, spatially precise 99. What is the difference between a sparse and dense layer? Answer: “|| Sparse Layer: Uses sparse matrices te save memory and computational resources. Dense Layer: Fully connected, requiring more resources but capturing all | feature interactions. 100. What és capsule routing in capsule networks? 100. What és capsule reuting in capsule networks? Answer: Capsule reuting ensures that lewer—layer capsules send thetr cutpuls to Aigher-layer capsules based an agreement scores. This precess preserses 101. What és the vanishing gradient preblem, and hou és it mitigated? Answer: The vanishing gradient problem eccurs when gradients become extremely small during backprspagatisn, preventing effective weight updates in earlier layers. This aften happens in deep networks with activation functions like sigmsid or tanh. Mitigation Strategies: 1 Use ReLU activation functions: RLU avoids vanishing gradients by having @ censtant gradient for pesitive values. 2. Batch normalization: Normalizes layer inpuls to stabilize and maintain gradients. 3. Residual connections: Allow gradionts ta flows directly through. ship || connections én deep networks (eg., ResNet). 102. Explain the concept of teacher forcing in RNNo. Why is it useful? Answer: Teacher forcing is a technique used in sequence-ta-sequence medels where the actual target output is used as inpul te the next time step dering raining, instead of the predicted eutput. Advantages: Speeds up convergence by previding greund-bruth inputs. Reduces expssure bias (the discrepancy between training and inference). Challenge: Ab inference. the model might struggle witheut greund-bruth inputs. Challenges: Ab inference. the madel might struggle without ground truth inputs. Scheduled sampling can gradually reduce reliance en teacher forcing. 103. What i the difference between label smecthing and hard labels? Answer: Hard Labels: Assign a one-bet enceding for the target classes (eg.. (1. 0. oD). Label Smesthing: Medifies hard labels by assigning a small probability te incorrect classes te make the medel less confident in its predictisns. For example: (0.9. 0.05, 0.05). Aduantages of Label Smosthing: Improves generalization by preventing svercenfidence. 104. What és bnewsledge distillation, and how is dappled? Answer: Knauledge distillatisn transfers bnewledge from a large, eompler madel (leacher) lo.a smaller, simpler medet (student) Hour é works: The student medel is trained te mimic the beachers softened eutput probabilities instead of the hard labels. Leas Function: L- (1- A) * cress enbropyly, 3) + a* Kl divergencelsoftmax(; teacher/1), sefonax—student/T)) where T is the temperature and O balances the less tens. Applications: Depleying efficient medels on reseurce-eenstrained devices. Medel compression. 105. What is pradient centralization, and why is t wed? Answer: Gradient centralization normalizes gradients by sublracting their mean before upaing weights Improves optimization stability. Reduces vartance in gradients, especially in deep networks. || Commenly used in conjunction with optimizers like SGD or Adam. “| 106. What «s the difference between transeluctive and inductive learning? Answer: Tranaductive Learning: Learns lo prediet labels only for the given lest Ransductive Learning: Learns te predict labels enly for the given test data, without generalizing te unseen data. Example: Graph based semi- superised learning. Inductive Learning: Learns a general function or madel that can make predictions en unseen data. Example: Mest deep learning medels like CNNo or RNNs. 103. What are adversarial examples, and how de ysu defend against them? Answer: Aduervarial examples are inputs deliberately perturbed to deceive a medel inte making incorrect predictions while appearing unchanged be humans. Defenses: 1 Adversarial training: hain the medel sn advevsarially perturbed data. 2 Gradient mashing: Obfuscate gradients te make it harder for attackers ls compule peurbatisns. 3. Input preprocessing: Techniques like JPEG compression or Gaussian 3. Input preprecessing: Techniques like JPEG compression or Gaussian 108. What are transformer medels, and bow de they differ (rem RNN? Answer: hansformer medels use self-attentin mechanisms be precess sequences, unlike RNNo that prscess inputs sequentially. Key Differences: 1. Parallelism: hansformers precess all input tokens simultanesusly, while RNNo precess sequentially. 2 Long-term dependencies: Tansformers capture long-range 3. Efficiency: hansformers are more efficient with GPUs due te parallelization but require more memory. Examples: BERT. GPT. T5. 109. What is the purpese of positional enceding in transformers? Answer: Positional enceding allows hansformers, which lach inherent sequence || awareness, ts incorporate the order of tskens in @ sequence. Formula for sinussédal enceding: PE(pes, 2é) = sén(pes / 1O000(2i/d)) FE(pes, Qe) = cea(pes. / 1OOOO(2é/d) where pes ts the pssition, ‘is the dimension, and d is the embedding size. 110. What is the concept of layer normalization. and how. dacs d differ fram batch normalization? Answer: Layer Normalization: Normalizes inputs across features within a single baining example. Commenty used in NLP and bansformers. Layer Normalization: Normalizes inputs across features within a single baining example. Commenty used in NLP and bansformers. Fouula: Ly = Ge - mean) / sqrtluariance + €) Batch Normalization: Nornalizes inputs acreds the batch for each feature. Commen in CNN. Differences: Batch normalizatisn depends an batch size: layer normalization dees nat: Layer normalization is more effective in sequence-based msdels. 111. What are attention mechanisms, and why are they important? Answer: Attention mechanisms allow medels te focus en specific parts of the input sequence when making predictions, assigning varying importance (weights) le different tokens or elements. Answer: isn mechanisms allow medels te fecus on specific parts of the input sequence when making predictions, assigning varying importance (weights) te different tokens or elements. Types of Attention: 1. Self-Attention: Helps capture relationships within a single sequence. 2. Cisss-Attention: Used in sequence-te-sequence medels te relate input and sulpul sequences. Importance: Captures long-range dependencies. Enhances inteynretability by showing what the medel ts facusing on. Forms the backbane of transformer architectures. 112. What are capsule networks, and hau de they differ from baditional 112. What are eapsule networks, and how dle they differ frem buaditisnal CNNs? Answer: Capsule networks are designed te madel spatial hievarchéo ly enceding the pase and erientation of features in addition te their presence. Key Differences: Capsules: Groups of neurons represent the prsbability and parameters of Dynamic Routing: Capsules communicate with higher-level capsules using is. Advantages: Better at understanding hierarchical relatisnships. More rsbust to changes in orientation and spatial distortisns. 113. What is the difference between data augmentation and data synthesis? Answer: Data Augmentation: Modifies existing data te increase diversity (c.g.. flipping, crepping, adding nsise). Commenty wsed for regularization. Dale Synthesis: Generates entirely new data peints from a model (og. GANs, VAEs). Use Cases: drastically. Synthesis is useful for creating data in underrepresented categories. 114. What are GANs, and hou ds they work? Answer: Generative Adversarial Networks (GAN) consist of te models: Answer: Generative Adversarial Networks (GANS) consist of tue models: Generator: Creates fahe data Discuiminator: Distinguishes between real and fake data. haining Process: The generator learns te create realistic data by fesling the discriminator. Both msdels play a minimax game. Lo38 Function: min—G max—D Elleg(Direab)) + Elleg(1 - Difake))] Applications: Image generation, style bansfer, and super-resolulisn. 115. What & the rele of the soflman function in neural networks? Answer: The soflmax functisn converls raw scored (legits) inte probabilities, 115. What is the rele of the seflenan function in neural networks? Answer: The softmax function converts raw scores (lsgits) inte probabilities, ensuring they sum te 1. It ts commenty used in the sutput layer of classification tasks. Formula: softmar(xi) = exp(ni) / sumlerply) forj in rangeln)) Advantages: ‘|| Prauédes interpretable class, prababilities. Highlights the mest likely class while suppressing sthers. Helps during bess calculation using cress—enbrapy. 116. What are variatisnal auteonceders (VAES), and hew are they different frem standard auteenceders? Answer: Variational auteenceders are probabilistic medels that learn a latent representation as a distubution (mean and variance) rather than a latent representation as a distribution (mean and variance) rather than a Differences: Standard Auteenceders: Compress input inte fixed latent vectors. VAEs: Use a probabilistic approach te generate diverse eutpuls. L638 Function: L = recenstruction—less + KL divergence(latent || prior) Applications: Image synthests, anemaly detection, and latent space exploration. 7 What is transfer learning, and why is it effective? Answer: hansfer learning invelves reusing a pre-trained medel en a related task ts improve performance and reduce taining lime. Effectiveness: Effectiveness: Pre-bained medels like ResNet and BERT already learn general features, reducing the need for large labeled datasets. Fine-tuning adapts these features te specific tasks. Examples: Using ImageNet-lrained CNNo for medical imaging. Adapting BERT for sentiment analysis. 118. What is the difference between gradiont clipping and gradient normalization? Answer: Gradient Clipping: Limits the magnitude of gradients te prevent expleding gradients. G liqradient|l » threshold: gradient = gradient." (Ubreshald / lIgradient|)) gradient = gradient. (thresbeld / \|gradient|)) Gradient Normalization: Adjusts gradients by dividing them by their Use Cases: Clipping is: commen in RNNo with vanishing/erpleding gradients. Normalization is used for ensuring smesth sptimization. 119. Hew da yeu caleulate the reveptive field of a canveldtional layer? Answer: The receptive field is the area of input pixels that influence a particular sulpul feature. Formula: For n layers: Run = R-n-D) + (Kon - 1) * shride_n where R_n és the receplive field. K is the kernel size, and stride is the stride of the cwrront layer. shride is the stride of the current layer. Importance: Determines the spatial context captured by a corvelutional layer. 120. What are the limitations of backprepagation? Answer: Vanishing/expleding gradients: Can hinder optimization in deep networks. High computation cast: Requires significant memory and computation for lage networks. Dependence en labeled data: Backprspagation requires labeled datasets, which can be expensive te acquire. Non- ity: Optimization often cenruerges ts lscal minima or saddle points. Solutions: Advanced sptimizers (Adam, RMSPrep). initialization Amar Sharma Al Engineer Follow me on LinkedIn for more informative content #

You might also like