0% found this document useful (0 votes)
40 views4 pages

33-Bidirectional Encoder Representations From Transformers (BERT) - 30!09!2024

Uploaded by

gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views4 pages

33-Bidirectional Encoder Representations From Transformers (BERT) - 30!09!2024

Uploaded by

gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

`

LSTM AND BILSTM:


So far, our deep learning models of a sequence learning task have been language modeling,
where we aim to predict the next token given all previous tokens in a sequence. In this
scenario, we wish only to condition upon the leftward context, and thus the unidirectional
chaining of a standard RNN seems appropriate.
However, there are many other sequence learning tasks contexts where it is perfectly fine to
condition the prediction at every time step on both the leftward and the rightward context.
Consider, for example, part of speech detection. Why shouldn’t we take the context in both
directions into account when assessing the part of speech associated with a given word?

Another common task—often useful as a pretraining exercise prior to fine-tuning a model on


an actual task of interest—is to mask out random tokens in a text document and then to train a
sequence model to predict the values of the missing tokens. Note that depending on what
comes after the blank, the likely value of the missing token changes dramatically:

 I am ___.
 I am ___ hungry.
 I am ___ hungry, and I can eat 2 pizzas.
In the first sentence “happy” seems to be a likely candidate. The words “not” and “very”
seem plausible in the second sentence, but “not” seems incompatible with the third sentence.

Bidirectional Recurrent Neural Networks (BiRNNs) are an extension of traditional Recurrent


Neural Networks (RNNs) that can improve model performance on various sequential data
tasks. Unlike standard RNNs that process sequences in a single direction (from past to
future), BiRNNs process the sequence in both directions with two separate hidden layers,
which are then fed forwards to the same output layer.
The architecture of a BiRNN allows it to capture information from both the past and the
future within any point in the sequence. This is particularly useful for tasks where context
from the future is as important as the past for making predictions, such as in natural language
processing tasks (e.g., translation, sentiment analysis) or in any sequence classification task.
Overview of how BiRNNs work:
1. Forward Pass: In one direction, the RNN processes the sequence from the start to the
end, much like a conventional RNN. This forward pass captures the past context.
2. Backward Pass: Simultaneously, another RNN processes the sequence in the reverse
direction, from end to start, capturing future context.
3. Combining Contexts: At each time step, the hidden states of both the forward and
backward passes are typically concatenated or summed and then passed to the output
layer to make a prediction.
4. Training: During training, both the forward and backward networks are trained
simultaneously. The error is backpropagated through both networks, updating the
weights in both directions.
example of how to use a simple BiRNN with Keras:
from [Link] import Sequential
from [Link] import Bidirectional, LSTM, Dense

model = Sequential()
[Link](Bidirectional(LSTM(units=50, return_sequences=True),
input_shape=(sequence_length, feature_dim)))
[Link](Bidirectional(LSTM(units=50)))
[Link](Dense(output_dim, activation='softmax'))

[Link](optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])


In this example, units=50 defines the number of LSTM units in each
direction, sequence_length is the length of the input sequences, feature_dim is the number of
features in each timestep, and output_dim is the dimensionality of the output space (e.g.,
number of classes for classification tasks).

Advantages of BiRNNs:
 Dual Context: They can capture information from both past and future contexts
simultaneously.
 Improved Accuracy: For many tasks, this leads to better performance than
unidirectional RNNs.
Disadvantages of BiRNNs:
 Increased Computational Load: They essentially double the computation because
they process the sequence twice.
 Increased Memory Usage: BiRNNs require more memory to store the intermediate
states for both directions.
 Not Suitable for Real-Time Processing: Since future context is needed, a BiRNN
can't be used in real-time applications where full sequences are not available
immediately.
It's important to note that while BiRNNs can provide more context for making predictions,
they are not always the best choice. The suitability of BiRNNs largely depends on the nature
of the task and the data.
Code to use a standard LSTM:
from [Link] import Sequential
from [Link] import LSTM, Dense
model = Sequential()
# Adding the first LSTM layer where return_sequences=True because we will add more
layers to the model
[Link](LSTM(units=50, return_sequences=True, input_shape=(sequence_length,
feature_dim)))
# Adding the second LSTM layer, no need to specify the input_shape as Keras can infer from
the previous layer's output
[Link](LSTM(units=50))
# Adding the output layer
[Link](Dense(output_dim, activation='softmax'))

# Compiling the model


[Link](optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

In this code, units=50 refers to the number of neurons in the LSTM cell. sequence_length is
the length of your input sequences, feature_dim is the number of features for each time step
in the input data, and output_dim is the size of the output layer (which often corresponds to
the number of classes in a classification problem).
The key differences between this LSTM example and the previous Bidirectional LSTM
(BiLSTM) one are:
 Directionality: This LSTM model processes the data in a single direction, from the
start of the sequence to the end, while the BiLSTM processes it in both directions.
 Layer Connections: In the unidirectional LSTM model, each layer feeds only into
the next layer moving forward, whereas in the BiLSTM model, there are forward and
backward passes that feed into the next layer.

data is available to predict the future, a unidirectional LSTM is more appropriate. However,
in tasks where the context in both directions is beneficial, a Bidirectional LSTM may perform
better.

You might also like