0% found this document useful (0 votes)
140 views14 pages

Deep Learning Revision Guide

The document covers various modules related to deep learning, including neural networks, loss functions, learning paradigms, and regularization techniques. It provides quick revision notes and multiple-choice questions (MCQs) for each module to test understanding of key concepts. Topics include supervised and unsupervised learning, optimization methods, and recent trends in deep learning research.

Uploaded by

sandip08.dev
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
140 views14 pages

Deep Learning Revision Guide

The document covers various modules related to deep learning, including neural networks, loss functions, learning paradigms, and regularization techniques. It provides quick revision notes and multiple-choice questions (MCQs) for each module to test understanding of key concepts. Topics include supervised and unsupervised learning, optimization methods, and recent trends in deep learning research.

Uploaded by

sandip08.dev
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Revision Topics Module 3: Training Neural Networks

Quick Revision:
Module 1: Introduction to Deep Learning
• Loss Functions: MSE, Cross-Entropy
Quick Revision Notes:
• Backpropagation: Gradient descent + chain rule
• Learning Paradigms:
• Regularization: L1, L2, Dropout
o Supervised: Data + labels (e.g., classification).
• Optimization: SGD, Adam, RMSProp
o Unsupervised: No labels (e.g., clustering, dimensionality reduction).
• Model Selection: Choosing best architecture/hyperparams
o Reinforcement: Agent learns via rewards.
Module 4: Conditional Random Fields
• Perspectives & Issues in Deep Learning:
Quick Revision:
o DL allows hierarchical feature learning.
• CRFs: Discriminative models used for structured prediction
o Needs huge data, high computation, prone to overfitting.
• Linear Chain CRF: For sequence data
o Interpretability is low; generalization and convergence are issues.
• Partition Function (Z): Normalizes probabilities
• Fundamental Learning Techniques:
• Belief Propagation: Inference technique
o Linear regression, logistic regression, decision trees, SVM, etc.
• HMM vs CRF: HMM is generative, CRF is discriminative
o DL builds upon these using neural network layers.
• Entropy: Measure of uncertainty
Module 2: Feedforward Neural Networks

Quick Revision:

• ANN (Artificial Neural Network): Composed of neurons (nodes),


connected with weights.

• Activation Functions: Sigmoid, Tanh, ReLU, Leaky ReLU, Softmax.

• Multi-layer NN: Multiple hidden layers enable deep learning.

• Fuzzy Relations: Deals with uncertainty and imprecision.

• Cardinality: Number of elements in a fuzzy set.


Module 5: Deep Learning Module 6: Deep Learning Research

Quick Revision: Quick Revision:

• Deep Feedforward Networks: Neural networks with multiple layers where • Recent Trends: Transformers, Diffusion Models, Vision Transformers
data flows in one direction. (ViTs), GNNs.

• Regularization Techniques: Methods like L1, L2, and dropout to prevent • Self-Supervised Learning (SSL): Learning representations from unlabeled
overfitting. data.

• Training Deep Models: Involves techniques like batch normalization and • Ethics in DL: Bias, fairness, explainability, energy usage.
careful weight initialization.
• Zero-Shot / Few-Shot Learning: Model performs tasks it wasn’t
• Dropout: A regularization method where random neurons are ignored during explicitly trained on.
training to prevent overfitting.
• Foundation Models: Large models like GPT, BERT, CLIP, trained on massive
• Convolutional Neural Networks (CNNs): Specialized for processing grid- data.
like data such as images.
• Transfer & Multitask Learning: Sharing knowledge across tasks/domains.
• Recurrent Neural Networks (RNNs): Designed for sequential data,
• Model Compression: Pruning, quantization, knowledge distillation.
capturing temporal dependencies.
• Explainability Tools: SHAP, LIME, saliency maps.
• Deep Belief Networks (DBNs): Composed of multiple layers of stochastic,
latent variables.
MCQs: Module 1 (Total: 20) 6. Which paradigm fits clustering problems?
A) Supervised
1. Which of the following is a supervised learning algorithm? B) Unsupervised
A) K-means C) Reinforcement
B) PCA D) Semi-supervised
C) Decision Tree Answer: B
D) Autoencoder
Answer: C
Explanation: Decision Trees need labeled data.
7. The main challenge in training deep networks is:
A) High bias
B) Underfitting
2. Unsupervised learning is best suited for: C) Vanishing gradients
A) Sentiment classification D) Fast convergence
B) Stock price prediction Answer: C
C) Dimensionality reduction
D) Email spam detection
Answer: C
8. Which learning type is used in ChatGPT?
A) Supervised
B) Unsupervised
3. Reinforcement learning involves: C) Reinforcement
A) Data with labels D) Both A & C
B) Learning by rewards Answer: D
C) Learning without feedback Explanation: Uses RLHF (Reinforcement Learning from Human Feedback) and supervised fine-
D) Pre-labeled clusters tuning.
Answer: B

9. Generalization in ML refers to:


4. Which one is not a fundamental issue in deep learning? A) Fitting the training data exactly
A) Overfitting B) Performing well on test/unseen data
B) Data scarcity C) Overfitting the model
C) Low computation cost D) Avoiding any learning
D) Interpretability Answer: B
Answer: C

10. Which of the following is not a machine learning paradigm?


5. Which of the following is a key feature of deep learning models? A) Supervised
A) Shallow layers B) Controlled
B) Manual feature extraction C) Reinforcement
C) Hierarchical feature learning D) Unsupervised
D) Low data dependency Answer: B
Answer: C
11. A key limitation of traditional ML compared to DL is: 16. Which of the following is NOT a traditional ML technique?
A) Accuracy A) SVM
B) Human-designed features B) Decision Tree
C) Data storage C) CNN
D) Output speed D) Naive Bayes
Answer: B Answer: C

12. A neuron in a neural network computes: 17. One major advantage of DL over ML:
A) Only output A) Smaller models
B) Weighted sum + activation B) Less training time
C) Gradient only C) Feature engineering is automated
D) Bias only D) More memory needed
Answer: B Answer: C

13. What is the bias in neural networks? 18. An AI system that plays chess using rewards is using:
A) Irrelevant data A) Supervised Learning
B) Constant added to weighted input B) Unsupervised Learning
C) Noise in output C) Reinforcement Learning
D) Penalty for error D) Regression
Answer: B Answer: C

14. Deep learning often suffers from: 19. "Epoch" in deep learning refers to:
A) Data explosion A) Time to build the model
B) Interpretability issues B) Full pass over training data
C) Low accuracy C) Layer depth
D) Constant learning rate D) Regularization rate
Answer: B Answer: B

15. Overfitting occurs when: 20. Interpretability in DL is low because:


A) Model is too simple A) It's too accurate
B) Model is too complex for data B) It uses simpler models
C) Too little training C) It has many complex layers
D) No bias D) It’s easy to visualize
Answer: B Answer: C
MCQs: Module 2 (20 Questions) 6. Cardinality in fuzzy sets refers to:
A) Degree of fuzziness
1. What does a neuron in ANN compute? B) Total elements
A) Gradient C) Number of neurons
B) Activation function only D) Shape of fuzzy graph
C) Weighted sum + activation Answer: B
D) Data labels
Answer: C

7. A perceptron is:
A) Linear classifier
2. The purpose of an activation function is to: B) Deep network
A) Increase training time C) Optimization algorithm
B) Normalize input D) Error correction mechanism
C) Introduce non-linearity Answer: A
D) Reduce accuracy
Answer: C

8. A multilayer perceptron is called deep if:


A) It has over 100 neurons
3. Which of these is not an activation function? B) It uses sigmoid
A) Sigmoid C) It has more than 1 hidden layer
B) ReLU D) It uses unsupervised learning
C) Softmax Answer: C
D) SVM
Answer: D

9. In a neural network, weights are:


A) Static values
4. Which activation function is used for binary classification? B) Updated during training
A) ReLU C) Random noise
B) Sigmoid D) Biases
C) Softmax Answer: B
D) Tanh
Answer: B

10. Tanh activation function ranges from:


A) 0 to 1
5. ReLU returns: B) -1 to 1
A) Only 1 or -1 C) -∞ to ∞
B) Linear values D) 0 to ∞
C) Max(0, x) Answer: B
D) Negative values only
Answer: C
11. A Softmax function is used in: 16. Output of sigmoid function is:
A) Regression A) Discrete
B) Binary classification B) Probabilistic
C) Multi-class classification C) Deterministic
D) Clustering D) Undefined
Answer: C Answer: B

12. Which is NOT a property of fuzzy relations? 17. In fuzzy logic, membership value lies between:
A) Reflexivity A) -1 and 1
B) Transitivity B) 0 and 10
C) Symmetry C) 0 and 1
D) Supervised learning D) 1 and 100
Answer: D Answer: C

13. What is the primary challenge with sigmoid? 18. The number of output neurons in classification = ?
A) Always 1 A) 1
B) Not differentiable B) Number of classes
C) Vanishing gradient C) Number of samples
D) Exploding output D) Infinite
Answer: C Answer: B

14. Multi-layer neural networks solve which major problem of perceptrons? 19. ANN is inspired by:
A) Random output A) Animal fur
B) Linearly inseparable data B) Electrical circuits
C) Speed C) Human brain
D) Cost D) DNA
Answer: B Answer: C

15. Leaky ReLU helps with: 20. The final layer of a binary classifier typically uses:
A) Bias issues A) ReLU
B) Overfitting B) Softmax
C) Dying ReLU problem C) Sigmoid
D) Vanishing gradient D) Linear
Answer: C Answer: C
MCQs: Module 3 (20 Questions) 6. L2 regularization is also known as:
A) Ridge
1. Which is a loss function used for regression? B) Lasso
A) Cross-entropy C) Dropout
B) Hinge loss D) Bias
C) Mean Squared Error Answer: A
D) Negative log-likelihood
Answer: C

7. Dropout helps by:


A) Increasing loss
2. Cross-entropy loss is used in: B) Reducing variance
A) Clustering C) Increasing gradient
B) Regression D) Stopping backprop
C) Binary/Multi-class classification Answer: B
D) Reinforcement learning
Answer: C

8. Gradient descent works by:


A) Increasing error
3. Backpropagation uses: B) Maximizing loss
A) Newton's law C) Minimizing loss
B) Chain rule D) Random updates
C) Euler's method Answer: C
D) L'Hospital's Rule
Answer: B

9. Adam optimizer uses:


A) Momentum + RMS
4. Regularization is used to: B) Only gradient
A) Increase training error C) Bias correction only
B) Overfit model D) L2 norm only
C) Reduce overfitting Answer: A
D) Improve underfitting
Answer: C

10. Model selection involves:


A) Data labeling
5. L1 regularization promotes: B) Choosing optimizers
A) Complex models C) Tuning architecture and hyperparameters
B) Larger weights D) Data cleaning
C) Sparsity Answer: C
D) Overfitting
Answer: C
11. Backpropagation adjusts: 16. Validation loss helps to detect:
A) Output A) Training convergence
B) Weights and biases B) Overfitting
C) Epochs C) Input noise
D) Inputs D) Batch size
Answer: B Answer: B

12. Risk minimization refers to: 17. Hyperparameters include:


A) Increasing training size A) Bias
B) Regularization B) Weights
C) Reducing expected loss C) Learning rate, batch size
D) More epochs D) Activation outputs
Answer: C Answer: C

13. Training loss is: 18. Learning rate too high causes:
A) Always zero A) Fast convergence
B) Only for test data B) Better accuracy
C) Computed on training data C) Overshooting minimum
D) Ignored D) Underfitting
Answer: C Answer: C

14. Overfitting happens when: 19. Which optimizer adapts learning rate per parameter?
A) Model is simple A) SGD
B) Model memorizes training data B) RMSProp
C) Model ignores training data C) Gradient Boost
D) Gradient is large D) Naive Bayes
Answer: B Answer: B

15. Epoch refers to: 20. Backpropagation stops at:


A) One iteration A) Output layer
B) One complete training pass B) Input layer
C) Only one batch C) Mid layer
D) Loss value D) Randomly
Answer: B Answer: B
MCQs: Module 4 (15 Questions)

1. CRF is a: 6. Entropy is defined as:


A) Generative model A) Measure of certainty
B) Discriminative model B) Data redundancy
C) Reinforcement model C) Uncertainty in distribution
D) Unsupervised model D) Gradient
Answer: B Answer: C

2. Linear chain CRF is suitable for: 7. CRFs are best used in:
A) Tabular data A) Image classification
B) Images B) Reinforcement learning
C) Sequence labeling C) Named Entity Recognition
D) Clustering D) Clustering
Answer: C Answer: C

3. Partition function in CRF is used to: 8. Training CRFs involves maximizing:


A) Update weights A) Joint likelihood
B) Normalize probability B) Posterior probability
C) Calculate entropy C) Entropy
D) Define classes D) Partition loss
Answer: B Answer: B

4. Difference between HMM and CRF: 9. Hidden states in HMM are:


A) CRF models joint probability A) Directly observable
B) HMM uses hidden states B) Learned through gradient descent
C) HMM is discriminative C) Latent
D) CRF is generative D) Always constant
Answer: B Answer: C

5. Which is an inference technique in CRFs? 10. Which models rely on partition functions?
A) Dropout A) Linear regression
B) Forward pass B) CRFs
C) Belief propagation C) KNN
D) Stochastic sampling D) Decision Trees
Answer: C Answer: B
11. Which of the following is a key limitation of HMMs? MCQs: Module 5 (20 Questions)
A) Scalability
B) Inability to handle overlapping features 1. What distinguishes a deep feedforward network from a shallow one?
C) Fast inference A) Use of convolutional layers
D) Non-sequential learning B) Presence of recurrent connections
Answer: B C) Multiple hidden layers
D) Use of dropout
Answer: C

12. What helps compute marginals in CRFs? 2. Which regularization technique involves adding a penalty equal to the absolute value
A) Backpropagation of the magnitude of coefficients?
B) Message passing A) L1 Regularization
C) Pooling B) L2 Regularization
D) Loss calculation C) Dropout
Answer: B D) Batch Normalization
Answer: A

3. Dropout helps prevent overfitting by:


13. What improves CRF performance?
A) Increasing the learning rate
A) Fewer parameters
B) Reducing the number of layers
B) More hidden states
C) Randomly setting a fraction of input units to zero during training
C) Richer features
D) Adding more neurons
D) Larger loss
Answer: C
Answer: C
4. Which layer is primarily responsible for feature extraction in CNNs?
A) Fully connected layer
14. Markov network is: B) Convolutional layer
A) Directed C) Pooling layer
B) Undirected graphical model D) ReLU layer
C) Tree-based model Answer: B
D) Decision rule
5. RNNs are particularly suited for:
Answer: B
A) Image classification
B) Sequential data processing
C) Tabular data analysis
15. Which is better for structured output problems?
D) Clustering tasks
A) CRF
Answer: B
B) SVM
C) Naive Bayes 6. What is the primary function of pooling layers in CNNs?
D) Perceptron A) To increase the size of the feature maps
Answer: A B) To reduce the spatial dimensions of the feature maps
C) To apply activation functions
D) To normalize the data
Answer: B
7. Which activation function is commonly used in deep neural networks due to its 13. Which of the following is a common issue when training deep neural networks?
simplicity and effectiveness? A) Overfitting
A) Sigmoid B) Underfitting
B) Tanh C) High bias
C) ReLU D) All of the above
D) Softmax Answer: D
Answer: C
14. What is the role of the Softmax function in neural networks?
8. In the context of deep learning, what does 'vanishing gradient' refer to? A) To introduce non-linearity
A) Gradients that become too large B) To normalize outputs into probability distributions
B) Gradients that become too small, hindering learning C) To reduce dimensionality
C) Loss of data during training D) To prevent overfitting
D) Overfitting of the model Answer: B
Answer: B
15. Which technique involves training a model on one task and then fine-tuning it on
9. Deep Belief Networks are composed of multiple layers of: another related task?
A) Convolutional layers A) Transfer Learning
B) Recurrent layers B) Regularization
C) Restricted Boltzmann Machines C) Data Augmentation
D) Decision trees D) Ensemble Learning
Answer: C Answer: A

10. Which technique is used to prevent exploding gradients in deep networks? 16. In RNNs, the problem of long-term dependencies is addressed by:
A) Dropout A) Using more layers
B) Gradient Clipping B) Applying dropout
C) Batch Normalization C) Implementing LSTM or GRU units
D) Weight Decay D) Reducing the learning rate
Answer: B Answer: C

11. Batch normalization helps in: 17. Which of the following is NOT a characteristic of Deep Belief Networks?
A) Reducing internal covariate shift A) Unsupervised pre-training
B) Increasing overfitting B) Stacked Restricted Boltzmann Machines
C) Decreasing model complexity C) Feedforward architecture
D) Eliminating the need for activation functions D) Use of convolutional layers
Answer: A Answer: D

12. The main advantage of using CNNs in image processing is: 18. The primary purpose of using activation functions in neural networks is to:
A) Their ability to handle sequential data A) Introduce non-linearity
B) Parameter sharing and spatial invariance B) Reduce computation time
C) Requirement of less data C) Normalize the output
D) Simpler architecture D) Increase the number of parameters
Answer: B Answer: A
19. Which of the following is a benefit of using dropout during training? 5. CLIP from OpenAI learns a joint representation of:
A) Faster convergence A) Audio and video
B) Reduced training time B) Image and text
C) Improved generalization C) Text and speech
D) Increased model complexity D) Image and audio
Answer: C Answer: B

20. In CNNs, the function of the ReLU activation is to: 6. Which method reduces model size without significant accuracy loss?
A) Apply dropout A) Fine-tuning
B) Normalize data B) Pruning
C) Remove negative values by outputting zero C) BatchNorm
D) Reduce overfitting D) Data Augmentation
Answer: C Answer: B

MCQs: Module 6 (30 Questions) 7. What does quantization do?


A) Converts models to float64
1. What is the core idea behind self-supervised learning? B) Increases number of layers
A) Use labeled data only C) Reduces model precision to save space
B) Learn from noisy data D) Adds noise to training data
C) Generate supervisory signals from the data itself Answer: C
D) Use reinforcement learning rewards
Answer: C 8. In knowledge distillation, the smaller model is called the:
A) Student
2. Which architecture is the backbone of models like BERT and GPT? B) Teacher
A) CNN C) Assistant
B) RNN D) Expert
C) Transformer Answer: A
D) DBN
Answer: C 9. Which task benefits most from GNNs (Graph Neural Networks)?
A) Image recognition
3. A Vision Transformer (ViT) differs from CNNs in that it: B) Sequential tagging
A) Uses convolution for patches C) Node classification
B) Ignores positional encoding D) Language translation
C) Uses self-attention to process image patches Answer: C
D) Cannot be used for classification
Answer: C 10. Which of these is a tool for model explainability?
A) Adam Optimizer
4. What is Zero-Shot Learning? B) SHAP
A) Learning without training data C) RMSProp
B) Predicting classes never seen during training D) KNN
C) Training on zero epochs Answer: B
D) Training with infinite data
Answer: B
11. Transformers rely on which mechanism to process sequences? 17. What does “transformer” eliminate compared to RNNs?
A) Convolution A) Fully connected layers
B) Recurrent cells B) Need for sequential processing
C) Attention C) Backpropagation
D) Pooling D) Attention mechanism
Answer: C Answer: B

12. What is the main challenge of training large deep models? 18. Which loss function is commonly used in contrastive learning?
A) Too much speed A) Cross-entropy
B) Overgeneralization B) Hinge loss
C) Compute and memory constraints C) Triplet loss
D) Lack of activation functions D) Contrastive loss
Answer: C Answer: D

13. Which of the following is a foundation model? 19. In explainable AI, a saliency map is used to:
A) SVM A) Visualize neuron weights
B) CNN B) Highlight important input regions
C) GPT C) Add noise
D) LDA D) Reduce training time
Answer: C Answer: B

14. In few-shot learning, models: 20. What is the role of positional encoding in Transformers?
A) Train without supervision A) Normalize input
B) Learn from a large number of examples B) Capture spatial info
C) Adapt to new tasks with few examples C) Retain word order
D) Only predict classes seen in training D) Enhance speed
Answer: C Answer: C

15. Model distillation is used to: 21. Which company introduced the ViT model?
A) Enlarge models A) Meta
B) Increase training data B) OpenAI
C) Transfer knowledge to a smaller model C) DeepMind
D) Remove layers D) Google
Answer: C Answer: D

16. The BERT model was trained using: 22. What is a major ethical concern in deep learning research?
A) Masked Language Modeling A) Overfitting
B) Sequence-to-sequence learning B) Floating point precision
C) One-shot learning C) Model fairness and bias
D) GNN-based training D) High dropout rates
Answer: A Answer: C
23. GPT models are trained using: 29. SHAP values explain:
A) Masked tokens A) Gradients
B) Next-token prediction B) Feature importance for predictions
C) Multimodal inputs C) Model weights
D) Vision transformers D) Accuracy
Answer: B Answer: B

24. Few-shot learning typically uses which algorithmic technique? 30. Which of these is a method to improve inference efficiency in large models?
A) Meta-learning A) Ensemble models
B) SGD B) Pruning
C) MLP C) Increasing hidden layers
D) DBNs D) Decreasing batch size
Answer: A Answer: B

25. Which is a challenge in deploying large models like GPT?


A) Too few parameters
B) Model compression
C) Latency and hardware requirements
D) Inflexible architecture
Answer: C

26. What is “catastrophic forgetting” in multitask learning?


A) Losing training data
B) Forgetting old tasks while learning new ones
C) Overfitting
D) Saturated neurons
Answer: B

27. A major advantage of transformer-based models is:


A) Low compute cost
B) Training on small datasets
C) Parallelization
D) Recurrence
Answer: C

28. In diffusion models (like Stable Diffusion), output is generated by:


A) Upscaling inputs
B) Reversing noise process
C) Using RNNs
D) Hash encoding
Answer: B

Common questions

Powered by AI

Message passing in CRFs is crucial for computing marginals, as it allows the model to efficiently propagate information through the nodes, facilitating marginal probability computations and enabling efficient inference. This process is integral for performing tasks like sequence labeling by ensuring accurate probability distributions are calculated for sequences based on input data .

Deep Belief Networks (DBNs) typically use unsupervised pre-training, which involves training the network layer by layer using Restricted Boltzmann Machines (RBMs) to learn the underlying structure of the input data without labels. This pre-training helps in improving convergence during the subsequent supervised training phase, thereby minimizing overfitting and enhancing the learning of complex patterns in large networks .

The attention mechanism in transformers allows the model to weigh the influence of different parts of the input sequence dynamically, providing direct, contextually informed connections between distant sequence elements. Unlike RNNs, which process sequences sequentially and often struggle with long-term dependencies, transformers handle sequences in parallel, facilitating efficient processing and capturing complex dependencies in the data .

Regularization helps prevent overfitting by adding a penalty to the loss function, which discourages complex models. L1 regularization promotes sparsity by adding the absolute value of the weights (Lasso), leading to some weights being zeroed out, while L2 regularization adds the square of the weights (Ridge), which tends to distribute weights more evenly and prevents large weights .

Knowledge distillation is a process where a smaller model (the student) is trained to replicate the output of a larger model (the teacher). The benefits include reduced model size and improved inference speed, without significantly sacrificing accuracy. It achieves this by transferring the knowledge learned by the complex teacher model, thus retaining its predictive power in a more compact form .

The primary function of pooling layers in CNNs is to reduce the spatial dimensions of feature maps, which helps in controlling overfitting, reducing computation and memory footprints, and making the network invariant to minor changes in the position of features in the input image. This is crucial for maintaining relevant spatial hierarchies in image processing tasks .

Graph Neural Networks (GNNs) are specifically designed to operate on graph-structured data, which allows them to capture the dependencies between nodes directly. This ability to leverage the inherent graph structure is what sets them apart from traditional neural networks, making GNNs particularly effective for node classification tasks where relationships between nodes contribute significantly to the task .

Transfer learning enhances training efficiency by utilizing the pre-trained knowledge from a related task, thereby requiring less data and fewer computational resources to achieve high performance on the new task. This is especially beneficial in scenarios with limited data, as the model can leverage generalized features learned previously to adapt rapidly to new but related tasks .

The primary challenge associated with the sigmoid activation function is its susceptibility to the vanishing gradient problem, where the gradients become too small during backpropagation, thus hindering learning. This problem is typically addressed by using alternative activation functions like ReLU (Rectified Linear Unit) which do not suffer from the vanishing gradient problem .

Convolutional layers in CNNs are parameter-efficient as they share weights across spatial locations, significantly reducing the total number of parameters compared to fully connected layers. This makes them highly effective in capturing spatial hierarchies in the input data, such as identifying patterns and textures. Fully connected layers, by contrast, have a fixed number of parameters related to the input and output dimensions, often rendering them less efficient for spatial data .

You might also like