0% found this document useful (0 votes)
259 views37 pages

Understanding Recurrent Neural Networks

Uploaded by

YASH AHIRRAO
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
259 views37 pages

Understanding Recurrent Neural Networks

Uploaded by

YASH AHIRRAO
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Unit IV

Convolution Neural Network(CNN)


• Recurrent and Recursive Nets: Unfolding Computational Graphs, Recurrent Neural
Networks, Bidirectional RNNs, Encoder-Decoder Sequence-to-Sequence Architectures,
Deep Recurrent Networks, Recursive Neural Networks,
• The Challenge of Long-Term Dependencies,
• Echo State Networks, Leaky Units and Other Strategies for Multiple Time Scales,
• The Long Short-Term Memory and Other Gated RNNs,
• Optimization for Long-Term Dependencies, Explicit Memory.
• Practical Methodology: Performance Metrics, Default Baseline Models, Determining
Whether to Gather More Data, Selecting Hyper parameters.
A Recurrent Neural Network (RNN) is a type of artificial neural network designed to process
sequential data, where the order of inputs matters. It's like a neural network with memory, as it
uses previous outputs as inputs for the current step, allowing it to learn and predict patterns in
sequences. RNNs are essential for tasks like natural language processing, speech recognition,
and time series prediction.

Need for RNN:


•Sequential Data Processing: RNNs are specifically designed to handle sequential data, where
the order of inputs is crucial for understanding the overall pattern.
•Contextual Understanding: By incorporating previous outputs as inputs, RNNs can capture
the context and dependencies within a sequence.
•Memory: RNNs possess an internal memory that allows them to remember previous inputs,
making them ideal for tasks where prior information is important.
•Time Series Prediction:
•RNNs excel at predicting future values in time series data by learning patterns from past data.
Brief Working of RNN:
1. Input Layer:
The input layer receives the current input at a specific time step.
2. Hidden Layer:
The hidden layer processes the input and incorporates information from the previous time step
through a feedback loop, acting as the "memory" of the network.
3. Output Layer:
The output layer generates the output for the current time step, which is then fed back into the
hidden layer as input for the next time step.
4. Iteration:
[Link] process is repeated for each time step in the sequence, allowing the RNN to learn and
predict patterns based on the entire sequence.

Imagine you're reading a sentence. You need to understand the meaning of each word not just
in isolation, but also in relation to the words before it. An RNN is like a reader that remembers
the previous words and uses that information to understand the current word and predict the
next.
What are types of RNN (Recurrent Neural Network)? How to train RNN explain in brief.
Recurrent Neural Networks (RNNs) come in different types based on their input and output
sequences, including one-to-one, one-to-many, many-to-one, and many-to-many. They are
trained using backpropagation through time (BPTT) which is an extension of the standard
backpropagation algorithm.
Types of RNNs:
One-to-One: This is a simple neural network architecture where a single input produces a
single output.
One-to-Many: In this type, a single input sequence generates multiple output sequences.
Many-to-One: Here, a sequence of inputs is processed, and a single output is produced.
Many-to-Many: This architecture handles both input and output as sequences, making it
suitable for tasks like machine translation.
Training RNNs:
RNNs are trained using backpropagation through time (BPTT). This involves:
1. Forward Pass:
The input sequence is fed through the network, and the outputs are generated.
2. Calculating Loss:
The difference between the predicted and actual outputs is calculated.
3. Backward Propagation:
The error is propagated backward through the network's layers, adjusting the weights to
minimize the loss.
4. Repeating:
This process is repeated for multiple iterations (epochs) until the model converges and
performs well on the training data.

Key Considerations:
Vanishing/Exploding Gradients:
A common challenge in training RNNs is the vanishing or exploding gradients, which can
hinder the learning process. Techniques like gradient clipping and LSTM/GRU architectures
are used to mitigate these issues.
LSTM and GRU:
Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) are specialized RNN
architectures that effectively address the vanishing gradient problem and are commonly used
in various applications.
Recurrent Neural Networks (RNN) are for handling sequential data.
RNNs share parameters across different positions / index of time / time steps of the
sequence, which makes it possible to generalize well to examples of different sequence length.
RNN is usually a better alternative to position-independent classifiers and sequential models
that treat each position differently.
How does a RNN share parameters? Each member of the output is produced using the
same update rule applied to the previous outputs. Such update rule is often a (same) NN layer,
as the “A” in the figure below.

Notation: We refer to RNNs as operating on a sequence that contains vectors x(t) with the
time step index t ranging from 1 to τ. Usually, there is also a hidden state vector h(t) for each
time step t.
Unfolding Computational Graphs
Basic formula of RNN shown in the previous slide is shown below:

It basically says the current hidden state h(t) is a function f of the previous hidden state h(t-
1) and the current input x(t). The theta are the parameters of the function f. The network
typically learns to use h(t) as a kind of lossy summary of the task-relevant aspects of the past
sequence of inputs up to t.
Unfolding maps the left to the right in the figure below (both are computational graphs of a
RNN without output o) where the black square indicates that an interaction takes place with a
delay of 1 time step, from the state at time t to the state at time t + 1.

Unfolding/parameter sharing is better


than using different parameters per
position: less parameters to estimate,
generalize to various length.
Fig. A recurrent network with no outputs and just
processes
Recurrent Neural Network
Variation 1 of RNN (basic form): hidden 2 hidden connections, sequence output. As in Fig below

The computational graph to compute the training loss of a recurrent network that maps an
input sequence of x values to a corresponding sequence of output 0 values
The basic equations that defines the above RNN is shown in below equation:

The total loss for a given sequence of x values paired with a sequence of y values would
then be just the sum of the losses over all the time steps. For example, if L(t) is the negative
log-likelihood of y (t) given x (1), . . . , x (t) , then sum them up you get the loss for the
sequence as shown in below equation:
•Forward Pass: The runtime is O(τ) and cannot be reduced by parallelization because the
forward propagation graph is inherently sequential; each time step may only be computed after
the previous one.
•Backward Pass: Use back-propagation through time (BPTT) algorithm on on the unrolled
graph. Basically, it is the application of chain-rule on the unrolled graph for parameters of U, V ,
W, b and c as well as the sequence of nodes indexed by t for x(t), h(t), o(t) and L(t).
The derivations are w.r.t. the basic form of RNN, same Fig and Equation. We copy Fig again
here:
The computational graph to
compute the training loss of a
recurrent network that maps an
input sequence of x values to a
corresponding sequence of
output 0 values
Bidirectional RNNs
In many applications we want to output a
prediction of y (t) which may depend on the whole
input sequence. E.g. co-articulation in speech
recognition, right neighbors in POS tagging, etc.
Bidirectional RNNs combine an RNN that moves
forward through time beginning from the start of
the sequence with another RNN that moves
backward through time beginning from the end of
the sequence.
Fig. illustrates the typical bidirectional RNN,
where h(t) and g(t) standing for the (hidden) state
of the sub-RNN that moves forward and backward
through time, respectively. This allows the output
Computation of a typical bidirectional
units o(t) to compute a representation that depends
recurrent neural network, meant to learn
on both the past and the future but is most
to map input sequences x to target
sensitive to the input values around time t.
sequences y, with loss L(t) at each step t.
Encoder-Decoder, or Sequence-to-
Sequence Architectures
Encode-Decoder architecture,
Basic idea:
(1) an encoder or reader or input RNN
processes the input sequence. The
encoder emits the context C , usually
as a simple function of its final hidden
state.
(2) a decoder or writer or output RNN
is conditioned on that fixed-length
vector to generate the output sequence
Y = ( y(1) , . . . , y(ny ) ).
highlight: the lengths of input and
output sequences can vary from each Encoder –Decoder or sequence to sequence RNN
other. Now widely used in machine architecture for learning to generate an output
translation, question answering etc. sequence of y variable for a given input sequence of x
variable.
Training: two RNNs are trained jointly to maximize the average of logP(y(1),…,y(ny) |x(1),
…,x(nx)) over all the pairs of x and y sequences in the training set.
Variations: If the context C is a vector, then the decoder RNN is simply a vector-to-
sequence RNN. As we have seen there are at least two ways for a vector-to-sequence RNN to
receive input. The input can be provided as the initial state of the RNN, or the input can be
connected to the hidden units at each time step. These two ways can also be combined.
Deep Recurrent Networks
The computation in most RNNs can be decomposed into three blocks of parameters and
associated transformations:
1. from the input to the hidden state, x(t) → h(t)
2. from the previous hidden state to the next hidden state, h(t-1) → h(t)
3. from the hidden state to the output, h(t) → o(t)
These transformations are represented as a single layer within a deep MLP in the previous
discussed models. However, we can use multiple layers for each of the above transformations,
which results in deep recurrent networks.
(a) break down hidden to hidden,
(b) introduce deeper architecture for all the 1,2,3 transformations above and
(c) add “skip connections” for RNN that have deep hidden 2 hidden transformations.
Fig (below) shows the resulting deep RNN, if we
Recursive Neural Network
A recursive network has a computational graph
that generalizes that of the recurrent network from
a chain to a tree.

A variable sequence x(1) ,x(2) ,,x(t) can be


mapped to a fixed size representation (the output
o), with a fixed set of parameters (the weight
matrices U,V,W)
Figure illustrates supervised learning case in
which target y is provided that is associated with
the whole sequence
Pro: Compared with a RNN, for a sequence of the same length τ, the depth (measured as the
number of compositions of nonlinear operations) can be drastically reduced from τ to O(logτ).
Con: how to best structure the tree? Balanced binary tree is an optional but not optimal for
many data. For natural sentences, one can use a parser to yield the tree structure, but this is
both expensive and inaccurate. Thus recursive NN is NOT popular.
Compare Recurrent and Recursive Neural Network
Features Recurrent Neural Network Recursive Neural Network

Chain-like structure known as Sequential Network having Hierarchical


Architecture
structure. structure, Tree-like structure.

Data Processing It processes sequential and time-series data. It processes hierarchical data.

Memory Handling Captures context through sequential memory. Limited context handling.

Connections are based on


Connections Connections are based on sequential order.
hierarchical structure.

It involves training backpropagation through This network requires specific tree


Training Complexity
time, traversal algorithms for training.

Dependency Explicitly models dependencies in a


Implicitly captures dependencies in sequences.
Understanding tree structure.

Image parsing, document structure


Use cases Language modeling, speech recognition
analysis.
The challenge of Long-Term Dependency
The long-term dependency challenge motivates various solutions such as Echo state network,
leaky units and the infamous LSTM as well as clipping gradient, neural turing machine.

Recurrent networks involve the composition of the same function multiple times, once per time
step. These compositions can result in extremely nonlinear behavior. But let’s focus on a linear
simplification of RNN, where all the non-linearity are removed, for an easier demonstration of
why long-term dependency can be problematic.

Without non-linearity, the recurrent relation for h(t) w.r.t. h(t-1) is now simply matrix
multiplication:

If we recurrently apply this until we reach h(0), we get:


and if W admits an eigen-decomposition

the recurrence may be simplified further to:

In other words, the recurrence means that the eigenvalues are raised to the power of t. This
means that eigenvalues with magnitude less than one to vanish to zero and eigenvalues with
magnitude greater than one to explode. The above analysis shows the essence of
the vanishing and exploding gradient problem for RNNs.
Echo State Networks
Basic Idea: Since the recurrence causes all the vanishing/exploding problems, we can set the
recurrent weights such that the recurrent hidden units do a good job of capturing the history of
past inputs (thus “echo”), and only learn the output weights.
Specifics: The original idea was to make the eigenvalues of the Jacobian of the state-to-
state transition function be close to 1. But that is under the assumption of no non-linearity. So
The modern strategy is simply to fix the weights to have some spectral radius such as 3,
where information is carried forward through time but does not explode due to the stabilizing
effect of saturating nonlinearities like tanh.
Leaky Units and Other Strategies for Multiple Time Scales
A common idea shared by various methods in the following sections: design a model that
operates on both fine time scale (handle small details) and coarse time scale (transfer
information through long-time).
•Adding skip connections. One way to obtain coarse time scales is to add direct connections
from variables in the distant past to variables in the present. Not an ideal solution.
Leaky Units
Idea: each hidden state u(t) is now a “summary of history”, which is set to memorize both a
coarse-grained immediate past summary of history u(t-1) and some “new stuff” of present
time v(t):

where alpha is a parameter. Then this introduces a linear self-connections from u(t-1) → u(t),
with a weight of alpha.
In this case, the alpha substitutes the matrix W’ of the plain RNN. So if alpha ends up near 1,
the several multiplications will not leads to zero or exploded number.
Removing Connections: actively removing length-one connections and replacing them with
longer connections.
Compare Recurrent and Recursive Neural Network
Factors Recurrent Neural Networks Recursive Neural Networks (RvNNs
(RNNs):
Purpose Ideal for tasks involving temporal Best suited for hierarchical data structures like trees or
dependencies and sequential data, such as graphs, where the relationships between nodes are
language modeling, time series prediction, important.
and machine translation.
Structur Processes data sequentially, with each Applies the same neural function recursively across the data
e element in the sequence influencing structure, combining information from child nodes to create
subsequent calculations. They maintain a parent node representations.
hidden state that stores information from
previous time steps, allowing them to
"remember" past inputs.

Strengt Well-suited for problems where the order of Well-suited for tasks involving structured data, such as
hs data is crucial, and they can capture long- sentiment analysis, parsing, and scene understanding, where
range dependencies in sequential data. the relationships between parts are crucial.
Exampl Predicting the next word in a sentence, as the Analyzing the grammatical structure of a sentence to
e meaning of a sentence is heavily influenced understand its meaning, where the relationships between
by the preceding words. words and phrases are important.
In essence RNNs focus on sequential data and temporal RvNNs focus on hierarchical structures and combining
dynamics. information from different parts of the data.
The Long Short-Term Memory and Other Gated RNNs
LSTMs (Long Short-Term Memory) are a type of recurrent neural network (RNN) that address
the vanishing gradient problem, which is a common issue in standard RNNs. They achieve this
by using a cell and three gates: an input gate, an output gate, and a forget gate. These gates
control the flow of information in and out of the cell, allowing LSTMs to retain information
over long sequences.
Here's a more detailed explanation:
1. The Cell State:
LSTMs have a "cell state" that acts like a memory conveyor belt, carrying information through
the network.
This cell state is essentially a long-term memory that can be updated or changed, but it's
designed to preserve information across multiple time steps, says Colah's blog.
2. The Gates:
Forget Gate: Determines which information from the previous hidden state and current input
should be discarded from the cell state.
Input Gate: Decides how much of the new input and previous hidden state should be added to
the cell state.
Output Gate: Controls which parts of the cell state should be used to calculate the next hidden
state.
These gates are essentially neural networks themselves, learning how to selectively retain or
forget information.
They use sigmoid functions to output values between 0 and 1, determining how much of the
information should pass through.
3. How Information Flows:
The current input and previous hidden state are used as inputs to the gates and the cell.
The gates decide what information to keep, discard, or add to the cell state.
The cell state is updated based on the input and the gates' decisions.
The output of the cell, along with the current input and previous hidden state, is used to
calculate the next hidden state.
This process repeats for each time step in the sequence.
4. Solving the Vanishing Gradient Problem:
By carefully controlling the flow of information using gates, LSTMs can prevent gradients
from vanishing or exploding during training.
This allows them to learn long-term dependencies in sequential data more effectively than
traditional RNNs.
5. Applications:
LSTMs are widely used in various applications that involve sequential data, including:
Natural Language Processing (NLP): language modeling, machine translation
Speech Recognition: audio and video data analysis, speech synthesis, says NVIDIA Developer
Time Series Forecasting: time series prediction
Bidirectional LSTM
LSTM and Bidirectional LSTM are types of Recurrent Neural Networks (RNNs) used for
processing sequential data like text or time series. LSTM addresses the vanishing gradient
problem in traditional RNNs, while Bidirectional LSTM leverages information from both past
and future context within a sequence.
Bidirectional LSTM (BiLSTM)
•Two LSTM Networks:
•BiLSTMs consist of two LSTM networks: one processing the input forward and the other
backward.
•Forward and Backward Context:
•This architecture allows the model to capture context from both the past and future of the
input sequence.
•Enhanced Modeling:
•BiLSTMs are particularly effective in scenarios where the order and context of the sequence
are crucial, like in Natural Language Processing (NLP).
•Combining Outputs:
•The outputs from the forward and backward LSTMs can be combined in various ways, such
as concatenation or averaging, according to a blog post on Medium.

•In essence: LSTMs are powerful for handling sequential data and capturing long-term
dependencies, while BiLSTMs extend this by incorporating information from both directions
within the sequence to enhance the model's understanding of context.
What’s new in LSTM
By contrast, the LSTM uses a group of layers to generate the current hidden state h(t). In
short, the extra elements in LSTM include:
•an extra state C(t) is introduced to keep track of the current “history”.
•this C(t) is the sum of the weighted previous history C(t-1) and the weighted “new stuff”, the
latter of which is generated similarly as the h(t) in a plain RNN.
•the weighted C(t) yields the current hidden state h(t).
As mentioned earlier, such weighting is conditioned on the context. In LSTM, this is done by
the following gates, those information (signal) flow out from yellow-colored sigmoid nodes in
the following figure (Fig: LSTM), from left to right:
[Link] forget gate f that controls how much previous “history” C(t-1) flows into the current
“history”.
[Link] external input gate g that controls how much “new stuff” flows into the new “history”
[Link] output gate q that controls how much current “history” C(t) follows into the current
hidden states h(t)
The actual control by the gate signal occurs at the blue-colored element-wise multiplication
nodes, where the control signal from gates are element-wisely multiplied with the states to
be weighted: the “previous history” C(t-1), the “new stuff” (output of the left tahn node)
and the “current history” C(t) after the right tahn non-linearity , respectively, in a left-to-
right order.
Other Gated RNN
If you find LSTM too complicated, gated recurrent units or GRUs might be your cup of tea.
The update rule of GRU can be described in a one-linear:

• Obviously, u(t) is a gate (called update gate) here, which decides how much the previous
hidden state h(t-1) goes into the current one h(t) and at the same time how much the “new
stuff” (the rightmost term in the formula) goes into the current hidden state.
• There is yet another gate r(t), called reset gate, which decides how much the previous
state h(t-1) goes into the “new stuff”.

Note: If reset gate were absent, GRU would look very similar to the leaky units, although (1)
u(t) is a vector that can weight each dim separately, while alpha in the leaky unit is likely to be
a scalar and (2) u(t) is context-dependent, a function of h(t-1) and x(t).
•This ability to forget via the reset gate is found to be essential, which is also true for the
forget gate of LSTM.
Practical tips: LSTM is still the best-performing RNN so far. GRU preforms slightly worse
than LSTM but better than plain RNN in many applications. It is often a good practice to set
the bias of the forget gate of LSTM to 1 (another saying is 0.5 will do for initialization ).

Optimization for Long-Term Dependencies


Notes: these techniques here are not quite useful nowadays, since most of the time using
LSTM will solve the long-term dependency problem. Nevertheless, it is good to know the old
tricks.
Take-away: It is often much easier to design a model that is easy to optimize than it is to
design a more powerful optimization algorithm.
Clipping gradient avoids gradient explode but NOT gradient vanish. One option is to clip the
parameter gradient from a mini batch element-wise just before the parameter update. Another
is to clip the norm ||g|| of the gradient g just before the parameter update.
Regularizing to encourage the information flow. Favor gradient vector being back-
propagated to maintain its magnitude, i.e. penalize the L2 norm differences between such
vectors.
Explicit Memory

Neural networks excel at storing implicit knowledge. However, they struggle to memorize
facts. So it is often helpful to introduce explicit memory component, not only to rapidly and
“intentionally” store and retrieve specific facts but also to sequentially reason with them.
• Memory networks include a set of memory cells that can be accessed via an addressing
mechanism, which requires a supervision signal instructing them how to use their memory
cells.
• Neural Turing machine which is able to learn to read from and write arbitrary content to
memory cells without explicit supervision about which actions to undertake.
Neural Turing machine (NTM) allows end-to-end training without external supervision signal,
thanks the use of a content-based soft attention mechanism It is difficult to optimize
functions that produce exact, integer addresses. To alleviate this problem, NTMs actually read
to or write from many memory cells simultaneously.
Note: It might make better sense to read the original paper to understand NTM. Nevertheless,
here’s a illustrative graph:
A schematic of Explicit Memory with key elements of neural turing machine.
The task network learns to control the memory, deciding where to read from and where to
write to within memory through the reading and writing mechanism indicated by bold
arrows at the reading and writing address.
Compare Implicit and Explicit memory
Parameter Implicit Memory Explicit Memory

Cell Type Implicit memory relies on the network's Explicit memory introduces dedicated memory
inherent architecture to maintain state cells and control mechanisms.
Vanishing Standard RNNs with only implicit memory LSTMs mitigate this problem with their explicit
Gradient suffer from the vanishing gradient problem, memory cells and gating mechanisms.
making it difficult to learn long-term
dependencies.
Long-Term LSTMs are generally more effective at Do not rely on Explicit memory at all
Dependency capturing long-term dependencies compared to
Handling standard RNNs relying on. implicit memory
alone
Computational Low computational costs Explicit memory architectures like LSTMs can
Cost have higher computational costs due to the
additional memory cells and gating mechanisms.
Essence Implicit memory is a fundamental aspect of Explicit memory, as implemented in architectures
RNNs like LSTMs, provides a more robust and effective
solution for learning and retaining information
over long time spans.
Practical methodology of Convolutional Neural Networks (CNNs),
In the practical methodology of Convolutional Neural Networks (CNNs),
• Use of performance metrics to evaluate model performance,
• Establish baseline models for comparison,
• Determine if more data is needed,
• Carefully selection of hyperparameters to optimize model training.
Here's a more detail of each aspect:
1. Performance Metrics:
•Purpose:
•These metrics quantify how well your CNN model is performing, allowing you to assess its
accuracy and identify areas for improvement.
•Common Metrics for CNNs:
•Accuracy: The percentage of correctly classified images.
•Precision: The ability to avoid false positives (e.g., correctly identifying an object when it's
actually there).
•Recall: The ability to find all relevant instances (e.g., correctly identifying all instances of an
object).
•F1-score: The harmonic mean of precision and recall, providing a balanced measure.
•AUC-ROC: Area under the receiver operating characteristic curve, useful for evaluating
binary classification models.
Considerations:
•Choose metrics that are relevant to your specific CNN task and dataset.
Example:
•If you're building a CNN for object detection, you might use metrics like mean Average
Precision (mAP).

2. Default Baseline Models:


Purpose:
•Baseline models provide a simple starting point for comparison, allowing you to determine if
your more complex CNN model offers significant improvements.
Types of Baseline Models:
•Majority Class Classifier: Predicts the most frequent class in the training data for all
instances.
•Random Classifier: Randomly assigns class labels based on the class distribution.
•Simple Linear Models: For tasks like regression or classification, you can start with a simple
linear model.
Considerations:
•The choice of baseline model should be relevant to your problem and data.
Example:
•If you're doing image classification, you could use a simple classifier like a logistic regression
model as your baseline.

3. Determining Whether to Gather More Data:


Purpose:
•Evaluate if the current dataset size is sufficient for your CNN model to achieve the desired
performance.
Methods:
•Analyze Performance Metrics: If your model's performance plateaus or shows signs of
overfitting, more data might be needed.
•Cross-Validation: Use techniques like k-fold cross-validation to assess how well your model
generalizes to unseen data.
•Data Augmentation: If obtaining more data is difficult, consider using data augmentation
techniques to artificially increase the size of your dataset.
Considerations:
•The need for more data depends on the complexity of your task, the size of your dataset, and
the performance of your model.

4. Selecting Hyperparameters:
Purpose:
•Hyperparameters are settings that control the CNN's training process and architecture, and
their optimal values are crucial for achieving good performance.
Common CNN Hyperparameters:
•Number of Layers: The depth of the CNN network.
•Filter Size: The size of the convolutional kernels.
•Stride: The step size of the convolutional kernels.
•Padding: The strategy for handling the boundaries of the input image.
•Learning Rate: Controls how quickly the model learns.
•Batch Size: The number of training examples used in each iteration.
•Number of Epochs: The number of times the entire dataset is used for training.
•Activation Functions: The non-linear functions used to introduce non-linearity into the
network.
Methods for Hyperparameter Optimization:
•Grid Search: Exhaustively tries all combinations of hyperparameters within a specified
range.
•Random Search: Randomly samples combinations of hyperparameters.
•Bayesian Optimization: Uses a probabilistic model to guide the search for optimal
hyperparameters.

Considerations:
•The choice of hyperparameters depends on your specific CNN task and dataset.

You might also like