CISC 867 Deep Learning
14. Text Classification with Recurrent Neural Networks and Word
Embeddings
Credits: Vassilis Athitsos, Yu Li
1
Learning Sequence-Based Features
• Bigrams are manually-crafted features that preserve
some information about the order of words.
• Can we have the model learn to construct its own
features that contain information about word order?
• This is what recurrent models are designed to do:
– They process a sequence one step at a time.
– The units of a recurrent layer receive information both from
previous steps and from the current step, and combine that
information in computing their output.
– Compared to SimpleRNN units, LSTM units have even more
capacity to preserve information from previous steps, and from
longer ago in the past.
2
Preprocessing Text for an RNN
• A text document should be converted to a time series before it
is given as an input to an RNN.
– We first tokenize the document.
– Then, each token is mapped to a number or vector.
• What would each element of this time series be?
– What should each token map to?
• We have already seen two options:
– An integer, indicating the position of the token in the vocabulary.
– A one-hot vector, whose dimensions equal the size of the vocabulary.
• We have discussed why one-hot vectors are a better idea.
– Integer representations of tokens can map tokens with very different
meanings to integers close to each other.
– With one-hot vector, each token is mapped to a vector equally
different from all other vectors.
3
Preprocessing Text for an RNN
train_ds = keras.utils.text_dataset_from_directory(“aclImdb/train", batch_size=32)
val_ds = keras.utils.text_dataset_from_directory(“aclImdb/val", batch_size=32)
test_ds = keras.utils.text_dataset_from_directory(“aclImdb/test", batch_size=32)
text_vectorization = TextVectorization(max_tokens=20000, output_mode="int")
text_only_train_ds = train_ds.map(lambda x, y: x)
text_vectorization.adapt(text_only_train_ds)
int_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y))
int_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y))
int_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y))
• This code maps each document into a sequence of integers.
• We have used every part of this code before, but not all together.
4
From Integers to One-Hot Vectors
text_vectorization = TextVectorization(max_tokens=20000,
output_mode="int")
text_only_train_ds = train_ds.map(lambda x, y: x)
text_vectorization.adapt(text_only_train_ds)
int_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y))
int_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y))
int_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y))
• Our preprocessing code converts each document into a
sequence of integers.
• As we have discussed several times before, eventually we
want to map each integer to a one-hot vector.
• Why don’t we do that as part of preprocessing?
5
Preprocessing Text for an RNN
• If we map each document to a sequence of one-hot vectors, and we
store the results, we hit a problem: memory.
• We have:
– 50,000 documents (20,000 training, 5,000 validation, 25,000 test).
– 230 words per document on average.
– 20,000 dimensions per one-hot vector (since we have set our vocabulary to be
20,000 tokens).
• The resulting one-hot vectors consist of 230 billion ones and zeros.
• Even if we save them as bits, it requires about 28 gigabytes.
• This may or may not fit in a modern computer’s main memory.
• A choice that reduces memory requirements dramatically is to:
– Preprocess the documents to sequences of integers (<50MB needed).
– Convert each document to a one-hot vector on the fly as needed.
6
An RNN Model for Our Dataset
inputs = keras.Input(shape=(None,), dtype="int64")
oh_vec = tf.one_hot(inputs, depth=max_tokens)
x1 = layers.Bidirectional(layers.LSTM(32))(oh_vec)
x2 = layers.Dropout(0.5)(x1)
outputs = layers.Dense(1, activation="sigmoid")(x2)
model = keras.Model(inputs, outputs)
• This code creates an RNN model, using a Keras style that we have
not seen before: the Functional API. We will explain how it works.
• The main steps of the model are shown below.
Bidirectional Dense
Input inputs oh_vec x2 outputs
one_hot LSTM + Output
Layer
dropout Layer 7
An RNN Model for Our Dataset
inputs = keras.Input(shape=(None,), dtype="int64")
oh_vec = tf.one_hot(inputs, depth=max_tokens)
x1 = layers.Bidirectional(layers.LSTM(32))(oh_vec)
x2 = layers.Dropout(0.5)(x1)
outputs = layers.Dense(1, activation="sigmoid")(x2)
model = keras.Model(inputs, outputs)
• This code creates an RNN model, using a Keras style that we
have not seen before: the Functional API.
• Up to now, we have created all our models calling the
Sequential() function.
• The Functional API provides more flexibility.
8
Why Use the Functional API
inputs = keras.Input(shape=(None,), dtype="int64")
oh_vec = tf.one_hot(inputs, depth=max_tokens)
x1 = layers.Bidirectional(layers.LSTM(32))(oh_vec)
x2 = layers.Dropout(0.5)(x1)
outputs = layers.Dense(1, activation="sigmoid")(x2)
model = keras.Model(inputs, outputs)
• In this model, we have these layers:
– Input layer: outputs sequence of integers
– A layer converting the input to a sequence of one-hot vectors.
– A bidirectional LSTM layer.
– A fully connected output layer, with a 50% dropout rate.
• Why not use the Sequential() method to create this model?
– Because there is no predefined Keras layer to produce one-hot vectors.
9
Why Use the Functional API
inputs = keras.Input(shape=(None,), dtype="int64")
oh_vec = tf.one_hot(inputs, depth=max_tokens)
x1 = layers.Bidirectional(layers.LSTM(32))(oh_vec)
x2 = layers.Dropout(0.5)(x1)
outputs = layers.Dense(1, activation="sigmoid")(x2)
model = keras.Model(inputs, outputs)
• With the Functional API, we can convert each input, which
is a sequence of integers, to a sequence of one-hot vectors
using the tf.one_hot() method.
10
RNN with One-Hot Vectors: Results
• Training this model is much slower than what we are used to.
• On my computer:
– About 1.5 hours per epoch.
– 15 hours for 10 epochs.
• Accuracy: about 87%.
– Bigrams with bag-of-words vectors gave us about 90% on average.
• Why is it so slow?
• The average document is represented using 230 one-hot vectors.
• Each one-hot vector is 20,000-dimensional.
• So, the average document is represented by 4.6 million numbers.
• The model itself has about 5 million trainable parameters.
– 64 LSTM units, each with about 80,000 weights.
11
Representing Words as Vectors
• If we map each word to a one-hot vector, then all resulting
vectors are equally far from each other.
– The Euclidean distance between any two such vectors is 2.
• Mapping words to vectors that are equally far from each other
has its own conceptual disadvantages.
• Suppose that M is the function mapping words to vectors.
• Some words have meanings very similar to each other.
– For example, “excellent” and “outstanding”.
• We would like M to capture that relationship, so that
M(“excellent”) is very close to M(“outstanding”).
• That would simplify the learning problem.
– If the model learns that “excellent movie” is associated with a positive
review, then it automatically treats “outstanding movie” the same
way.
12
Representing Words as Vectors
• It would also be useful if the differences between word
vectors had meaning in themselves.
• For example, consider these pairs:
– “boy” and “girl”.
– “man” and “woman”.
– “male” and “female”.
• The difference between these pairs is the gender, going
from male in the first element of each pair to female in
the second element.
• So, intuitively, we would like a mapping M such that:
M(“boy”) – M(“girl”) = M(“man”) – M(“woman”) = M(“male”) –
M(“female”)
13
Word Embeddings
• To recap, we would like a mapping M such that:
M(“boy”) – M(“girl”) = M(“man”) – M(“woman”) = M(“male”) – M(“female”)
M(“large”) is similar to M(“big”)
M(“buy”) is similar to M(“purchase”)
• One-hot vectors are, by definition, incapable of such behavior.
– They do not depend in any way on the meaning of each word.
• A word embedding is a function mapping words to vectors, that
aims to capture semantic relationships like the ones above.
• We can learn such a function as part of training our model.
14
Learning a Word Embedding
• The word embedding can be implemented as a
multiplication of one-hot vector 𝒗 by a matrix 𝑾:
– 𝒗 = one_hot(token)
– M(token) = 𝑾 × 𝒗.
RNN model not using word embeddings
Bidir. Dense
Input inputs one oh x2 outputs
LSTM + Output
Layer hot
dropout Layer
RNN model using word embeddings
em Bidir. x2 Dense
Input inputs one oh matrix outputs
LSTM + Output
Layer hot multiplication
dropout Layer 15
Learning a Word Embedding
• The word embedding can be implemented as a
multiplication of one-hot vector 𝒗 by a matrix 𝑾:
– 𝒗 = one_hot(token)
– M(token) = 𝑾 × 𝒗.
• If the one-hot vector 𝒗 is 𝐾-dimensional, and the word
embedding is 𝐿-dimensional, then matrix 𝑾 is of size 𝐾 ×
𝐿.
– The model learns those K*L values of matrix 𝑾 during training.
em Bidir. x2 Dense
Input inputs one oh matrix outputs
LSTM + Output
Layer hot multiplication
dropout Layer 16
Word Embeddings in Keras
• The keras.layers.Embedding layer can be used directly for
word embeddings.
– It directly maps each integer to a word embedding.
RNN model using word embeddings, NOT using the Keras Embedding layer
em Bidir. x2 Dense
Input inputs one oh matrix outputs
LSTM + Output
Layer hot multiplication
dropout Layer
RNN model using word embeddings, using the Keras Embedding layer
em Bidir. x2 Dense
Input inputs outputs
Embedding LSTM + Output
Layer
dropout Layer 17
Word Embeddings in Keras
inputs = keras.Input(shape=(None,), dtype="int64")
em = layers.Embedding(input_dim=max_tokens, output_dim=256)(inputs)
x1 = layers.Bidirectional(layers.LSTM(32))(em)
x2 = layers.Dropout(0.5)(x1)
outputs = layers.Dense(1, activation="sigmoid")(x2)
model = keras.Model(inputs, outputs)
• This code creates an RNN model that uses word embeddings.
– Setting output_dim=256 specifies that each embedding is 256-
dimensional.
RNN model using word embeddings, using the Keras Embedding layer
em Bidir. x2 Dense
Input inputs outputs
Embedding LSTM + Output
Layer
dropout Layer 18
Results for Movie Reviews
• For movie review classification, the results do not
improve much.
• We still get around 87% accuracy, same as with the
previous RNN model that did not use word embeddings.
– As a reminder, bag-of-words with bigrams gave us around 90%
accuracy.
• Nonetheless, word embeddings are very commonly used
in text processing models.
– We will use them again for our English-to-Spanish translation
system.
19
Playing with Word Embeddings
• We can get the distance of the vectors corresponding to
two words, using this code:
def we_diff(model, tv_layer, s1, s2):
em_model = keras.Sequential(model.layers[0:2])
v1 = em_model(tv_layer([s1]))
v2 = em_model(tv_layer([s2]))
diff = v2[0,0,:] - v1[0,0,:]
Key idea: em_model contains
return diff
only the first two layers of our
def we_distance(model, tv_layer, s1, s2): RNN model (input layer,
diff = we_diff(model, tv_layer, s1, s2) embedding layer), and thus
dist = np.linalg.norm(diff) maps a sequence of words to
a sequence of the
return dist
corresponding vectors.
20
Playing with Word Embeddings
• Using the code before, we try out various pairs of words:
we_distance(model, text_vectorization, "great", "excellent")
we_distance(model, text_vectorization, "great", "awful")
Output:
distance from "great" to "excellent" = 1.90
distance from "great" to "awful" = 3.63
• Reasonable result:
– In the word embedding space, “great” is mapped closer to
“excellent” than to “awful”.
21
Playing with Word Embeddings
• Using the code before, we try out various pairs of words:
we_distance(model, text_vectorization, "big", "large")
we_distance(model, text_vectorization, "big", "small")
Output:
distance from "big" to "large" = 0.91
distance from "big" to "small" = 0.79
• Unexpected result:
– In the word embedding space, “big” is mapped closer to “small” than
to “large”.
• Perhaps for the purposes of separating positive and negative
reviews, distinguishing these three words is not important.
22
Using Pretrained Word Embeddings
• Instead of learning word embeddings from our training data,
we can use pre-trained embeddings.
• This is another form of transfer learning:
– Learn word embeddings from a larger dataset.
– Use those pre-learned embeddings in a smaller dataset.
• Some popular pre-trained word embeddings include:
– GloVe:
Paper: “Global Vectors for Word Representation.” J. Pennington, R. Socher, C.
D. Manning. EMNLP 2014.
Link: https://nlp.stanford.edu/projects/glove/
– word2vec:
Paper: “Distributed Representations of Words and Phrases and their
Compositionality.” T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean.
NeurIPS 2013.
23
GloVe Embeddings
• You can download pre-trained GloVe embeddings from
here:
https://nlp.stanford.edu/projects/glove
• On my computer, using Anaconda, I got errors running
the textbook code with those files.
• The problem was that some characters (both in the GloVe
embedding files and in the movie reviews dataset) had
ASCII codes greater than 127.
– Some functions complained when encountering these characters.
• I wrote code that replaces all those problematic
characters with SPACE (ASCII code 32).
24
Results with Glove Embeddings
• On the movie review dataset, test accuracy using pre-
trained GloVe embeddings drops to 80.5%.
– We got about 87% using word embeddings that were learned
together with the rest of the model.
• Likely reasons that accuracy drops:
– The embeddings that were learned together with the rest of the
model focused on words that correspond to a review positive or
negative.
– It looks like the movie review dataset had enough training data to
learn word embeddings that were more useful than the pre-trained
ones.
25
Comparing the Two Embeddings
Output using word embeddings learned from the movie reviews:
distance from "buy" to "purchase" = 0.85
distance from "buy" to "shop" = 0.73
distance from "buy" to "study" = 0.77
distance from "buy" to "swim" = 1.05
Output using pre-trained GloVe embeddings:
distance from "buy" to "purchase" = 3.31
distance from "buy" to "shop" = 5.86
distance from "buy" to "study" = 6.83
distance from "buy" to "swim" = 7.16
• Words “buy”, “purchase”, “shop”, “study”, “swim” are not relevant for
classifying movie reviews.
• GloVe embeddings capture that buy is closer to “purchase”, and to “shop”.
26
Comparing the Two Embeddings
Output using word embeddings learned from the movie reviews:
distance from "big" to "large" = 0.91
distance from "big" to "small" = 0.79
Output using pre-trained GloVe embeddings:
distance from "big" to "large" = 4.37
distance from "big" to "small" = 4.25
• Surprisingly, “big” is mapped closer to “small” than “large” with both approaches.
• Once again, we have models that give reasonably good results in end-to-end
systems, but do not exhibit a level of understanding that resembles human
intelligence.
27
Next Lecture
Generative Adversarial Networks (GANs)
28