0% found this document useful (0 votes)

122 views28 pages

CISC 867 Deep Learning: 14. Text Classification With Recurrent Neural Networks and Word Embeddings

This document discusses using recurrent neural networks (RNNs) for text classification with word embeddings. It covers preprocessing text for RNNs by converting documents to sequences of integers or one-hot vectors. An RNN model is presented that uses a bidirectional LSTM layer and word embeddings learned during training to classify movie reviews. Word embeddings map words to vectors in a way that captures semantic relationships between words.

Uploaded by

adel hany

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

122 views28 pages

CISC 867 Deep Learning: 14. Text Classification With Recurrent Neural Networks and Word Embeddings

Uploaded by

adel hany

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

CISC 867 Deep Learning

14. Text Classification with Recurrent Neural Networks and Word

Embeddings

Credits: Vassilis Athitsos, Yu Li

1
Learning Sequence-Based Features

• Bigrams are manually-crafted features that preserve

some information about the order of words.
• Can we have the model learn to construct its own
features that contain information about word order?
• This is what recurrent models are designed to do:
– They process a sequence one step at a time.
– The units of a recurrent layer receive information both from
previous steps and from the current step, and combine that
information in computing their output.
– Compared to SimpleRNN units, LSTM units have even more
capacity to preserve information from previous steps, and from
longer ago in the past.

2
Preprocessing Text for an RNN

• A text document should be converted to a time series before it

is given as an input to an RNN.
– We first tokenize the document.
– Then, each token is mapped to a number or vector.
• What would each element of this time series be?
– What should each token map to?
• We have already seen two options:
– An integer, indicating the position of the token in the vocabulary.
– A one-hot vector, whose dimensions equal the size of the vocabulary.
• We have discussed why one-hot vectors are a better idea.
– Integer representations of tokens can map tokens with very different
meanings to integers close to each other.
– With one-hot vector, each token is mapped to a vector equally
different from all other vectors.

3
Preprocessing Text for an RNN

train_ds = keras.utils.text_dataset_from_directory(“aclImdb/train", batch_size=32)

val_ds = keras.utils.text_dataset_from_directory(“aclImdb/val", batch_size=32)
test_ds = keras.utils.text_dataset_from_directory(“aclImdb/test", batch_size=32)

text_vectorization = TextVectorization(max_tokens=20000, output_mode="int")

text_only_train_ds = train_ds.map(lambda x, y: x)
text_vectorization.adapt(text_only_train_ds)
int_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y))
int_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y))
int_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y))

• This code maps each document into a sequence of integers.

• We have used every part of this code before, but not all together.

4
From Integers to One-Hot Vectors

text_vectorization = TextVectorization(max_tokens=20000,
output_mode="int")

• Our preprocessing code converts each document into a

sequence of integers.
• As we have discussed several times before, eventually we
want to map each integer to a one-hot vector.
• Why don’t we do that as part of preprocessing?

5
Preprocessing Text for an RNN

• If we map each document to a sequence of one-hot vectors, and we

store the results, we hit a problem: memory.
• We have:
– 50,000 documents (20,000 training, 5,000 validation, 25,000 test).
– 230 words per document on average.
– 20,000 dimensions per one-hot vector (since we have set our vocabulary to be
20,000 tokens).
• The resulting one-hot vectors consist of 230 billion ones and zeros.
• Even if we save them as bits, it requires about 28 gigabytes.
• This may or may not fit in a modern computer’s main memory.
• A choice that reduces memory requirements dramatically is to:
– Preprocess the documents to sequences of integers (<50MB needed).
– Convert each document to a one-hot vector on the fly as needed.

6
An RNN Model for Our Dataset

inputs = keras.Input(shape=(None,), dtype="int64")

oh_vec = tf.one_hot(inputs, depth=max_tokens)
x1 = layers.Bidirectional(layers.LSTM(32))(oh_vec)
x2 = layers.Dropout(0.5)(x1)
outputs = layers.Dense(1, activation="sigmoid")(x2)
model = keras.Model(inputs, outputs)

• This code creates an RNN model, using a Keras style that we have
not seen before: the Functional API. We will explain how it works.
• The main steps of the model are shown below.

Bidirectional Dense
Input inputs oh_vec x2 outputs
one_hot LSTM + Output
Layer
dropout Layer 7
An RNN Model for Our Dataset

inputs = keras.Input(shape=(None,), dtype="int64")

• This code creates an RNN model, using a Keras style that we

have not seen before: the Functional API.
• Up to now, we have created all our models calling the
Sequential() function.
• The Functional API provides more flexibility.

8
Why Use the Functional API

inputs = keras.Input(shape=(None,), dtype="int64")

• In this model, we have these layers:

– Input layer: outputs sequence of integers
– A layer converting the input to a sequence of one-hot vectors.
– A bidirectional LSTM layer.
– A fully connected output layer, with a 50% dropout rate.
• Why not use the Sequential() method to create this model?
– Because there is no predefined Keras layer to produce one-hot vectors.

9
Why Use the Functional API

inputs = keras.Input(shape=(None,), dtype="int64")

• With the Functional API, we can convert each input, which

is a sequence of integers, to a sequence of one-hot vectors
using the tf.one_hot() method.

10
RNN with One-Hot Vectors: Results

• Training this model is much slower than what we are used to.
• On my computer:
– About 1.5 hours per epoch.
– 15 hours for 10 epochs.
• Accuracy: about 87%.
– Bigrams with bag-of-words vectors gave us about 90% on average.
• Why is it so slow?
• The average document is represented using 230 one-hot vectors.
• Each one-hot vector is 20,000-dimensional.
• So, the average document is represented by 4.6 million numbers.
• The model itself has about 5 million trainable parameters.
– 64 LSTM units, each with about 80,000 weights.

11
Representing Words as Vectors

• If we map each word to a one-hot vector, then all resulting

vectors are equally far from each other.
– The Euclidean distance between any two such vectors is 2.
• Mapping words to vectors that are equally far from each other
has its own conceptual disadvantages.
• Suppose that M is the function mapping words to vectors.
• Some words have meanings very similar to each other.
– For example, “excellent” and “outstanding”.
• We would like M to capture that relationship, so that
M(“excellent”) is very close to M(“outstanding”).
• That would simplify the learning problem.
– If the model learns that “excellent movie” is associated with a positive
review, then it automatically treats “outstanding movie” the same
way.

12
Representing Words as Vectors

• It would also be useful if the differences between word

vectors had meaning in themselves.
• For example, consider these pairs:
– “boy” and “girl”.
– “man” and “woman”.
– “male” and “female”.
• The difference between these pairs is the gender, going
from male in the first element of each pair to female in
the second element.
• So, intuitively, we would like a mapping M such that:

M(“boy”) – M(“girl”) = M(“man”) – M(“woman”) = M(“male”) –

M(“female”)

13
Word Embeddings

• To recap, we would like a mapping M such that:

M(“boy”) – M(“girl”) = M(“man”) – M(“woman”) = M(“male”) – M(“female”)

M(“large”) is similar to M(“big”)

M(“buy”) is similar to M(“purchase”)

• One-hot vectors are, by definition, incapable of such behavior.

– They do not depend in any way on the meaning of each word.
• A word embedding is a function mapping words to vectors, that
aims to capture semantic relationships like the ones above.
• We can learn such a function as part of training our model.

14
Learning a Word Embedding

• The word embedding can be implemented as a

multiplication of one-hot vector 𝒗 by a matrix 𝑾:
– 𝒗 = one_hot(token)
– M(token) = 𝑾 × 𝒗.

RNN model not using word embeddings

Bidir. Dense
Input inputs one oh x2 outputs
LSTM + Output
Layer hot
dropout Layer

RNN model using word embeddings

em Bidir. x2 Dense
Input inputs one oh matrix outputs
LSTM + Output
Layer hot multiplication
dropout Layer 15
Learning a Word Embedding

• The word embedding can be implemented as a

multiplication of one-hot vector 𝒗 by a matrix 𝑾:
– 𝒗 = one_hot(token)
– M(token) = 𝑾 × 𝒗.
• If the one-hot vector 𝒗 is 𝐾-dimensional, and the word
embedding is 𝐿-dimensional, then matrix 𝑾 is of size 𝐾 ×
𝐿.
– The model learns those K*L values of matrix 𝑾 during training.

em Bidir. x2 Dense
Input inputs one oh matrix outputs
LSTM + Output
Layer hot multiplication
dropout Layer 16
Word Embeddings in Keras

• The keras.layers.Embedding layer can be used directly for

word embeddings.
– It directly maps each integer to a word embedding.

RNN model using word embeddings, NOT using the Keras Embedding layer

em Bidir. x2 Dense
Input inputs one oh matrix outputs
LSTM + Output
Layer hot multiplication
dropout Layer

RNN model using word embeddings, using the Keras Embedding layer

em Bidir. x2 Dense
Input inputs outputs
Embedding LSTM + Output
Layer
dropout Layer 17
Word Embeddings in Keras

inputs = keras.Input(shape=(None,), dtype="int64")

em = layers.Embedding(input_dim=max_tokens, output_dim=256)(inputs)
x1 = layers.Bidirectional(layers.LSTM(32))(em)
x2 = layers.Dropout(0.5)(x1)
outputs = layers.Dense(1, activation="sigmoid")(x2)
model = keras.Model(inputs, outputs)

• This code creates an RNN model that uses word embeddings.

– Setting output_dim=256 specifies that each embedding is 256-
dimensional.

RNN model using word embeddings, using the Keras Embedding layer

em Bidir. x2 Dense
Input inputs outputs
Embedding LSTM + Output
Layer
dropout Layer 18
Results for Movie Reviews

• For movie review classification, the results do not

improve much.
• We still get around 87% accuracy, same as with the
previous RNN model that did not use word embeddings.
– As a reminder, bag-of-words with bigrams gave us around 90%
accuracy.
• Nonetheless, word embeddings are very commonly used
in text processing models.
– We will use them again for our English-to-Spanish translation
system.

19
Playing with Word Embeddings

• We can get the distance of the vectors corresponding to

two words, using this code:
def we_diff(model, tv_layer, s1, s2):
em_model = keras.Sequential(model.layers[0:2])
v1 = em_model(tv_layer([s1]))
v2 = em_model(tv_layer([s2]))
diff = v2[0,0,:] - v1[0,0,:]
Key idea: em_model contains
return diff
only the first two layers of our
def we_distance(model, tv_layer, s1, s2): RNN model (input layer,
diff = we_diff(model, tv_layer, s1, s2) embedding layer), and thus
dist = np.linalg.norm(diff) maps a sequence of words to
a sequence of the
return dist
corresponding vectors.

20
Playing with Word Embeddings

• Using the code before, we try out various pairs of words:

we_distance(model, text_vectorization, "great", "excellent")
we_distance(model, text_vectorization, "great", "awful")

Output:

distance from "great" to "excellent" = 1.90

distance from "great" to "awful" = 3.63

• Reasonable result:
– In the word embedding space, “great” is mapped closer to
“excellent” than to “awful”.

21
Playing with Word Embeddings

• Using the code before, we try out various pairs of words:

we_distance(model, text_vectorization, "big", "large")
we_distance(model, text_vectorization, "big", "small")

Output:

distance from "big" to "large" = 0.91

distance from "big" to "small" = 0.79
• Unexpected result:
– In the word embedding space, “big” is mapped closer to “small” than
to “large”.
• Perhaps for the purposes of separating positive and negative
reviews, distinguishing these three words is not important.

22
Using Pretrained Word Embeddings

• Instead of learning word embeddings from our training data,

we can use pre-trained embeddings.
• This is another form of transfer learning:
– Learn word embeddings from a larger dataset.
– Use those pre-learned embeddings in a smaller dataset.
• Some popular pre-trained word embeddings include:
– GloVe:
Paper: “Global Vectors for Word Representation.” J. Pennington, R. Socher, C.
D. Manning. EMNLP 2014.
Link: https://nlp.stanford.edu/projects/glove/
– word2vec:
Paper: “Distributed Representations of Words and Phrases and their
Compositionality.” T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean.
NeurIPS 2013.
23
GloVe Embeddings

• You can download pre-trained GloVe embeddings from

here:
https://nlp.stanford.edu/projects/glove

• On my computer, using Anaconda, I got errors running

the textbook code with those files.
• The problem was that some characters (both in the GloVe
embedding files and in the movie reviews dataset) had
ASCII codes greater than 127.
– Some functions complained when encountering these characters.
• I wrote code that replaces all those problematic
characters with SPACE (ASCII code 32).

24
Results with Glove Embeddings

• On the movie review dataset, test accuracy using pre-

trained GloVe embeddings drops to 80.5%.
– We got about 87% using word embeddings that were learned
together with the rest of the model.
• Likely reasons that accuracy drops:
– The embeddings that were learned together with the rest of the
model focused on words that correspond to a review positive or
negative.
– It looks like the movie review dataset had enough training data to
learn word embeddings that were more useful than the pre-trained
ones.

25
Comparing the Two Embeddings

Output using word embeddings learned from the movie reviews:

distance from "buy" to "purchase" = 0.85
distance from "buy" to "shop" = 0.73
distance from "buy" to "study" = 0.77
distance from "buy" to "swim" = 1.05

Output using pre-trained GloVe embeddings:

distance from "buy" to "purchase" = 3.31
distance from "buy" to "shop" = 5.86
distance from "buy" to "study" = 6.83
distance from "buy" to "swim" = 7.16

• Words “buy”, “purchase”, “shop”, “study”, “swim” are not relevant for
classifying movie reviews.
• GloVe embeddings capture that buy is closer to “purchase”, and to “shop”.
26
Comparing the Two Embeddings

Output using word embeddings learned from the movie reviews:

distance from "big" to "large" = 0.91
distance from "big" to "small" = 0.79

Output using pre-trained GloVe embeddings:

distance from "big" to "large" = 4.37
distance from "big" to "small" = 4.25

• Surprisingly, “big” is mapped closer to “small” than “large” with both approaches.
• Once again, we have models that give reasonably good results in end-to-end
systems, but do not exhibit a level of understanding that resembles human
intelligence.

27
Next Lecture

Generative Adversarial Networks (GANs)

A Quick Recap: Artificial Intelligence LAB
No ratings yet
A Quick Recap: Artificial Intelligence LAB
29 pages
Keras NLP Encoding and Sentiment Analysis
No ratings yet
Keras NLP Encoding and Sentiment Analysis
8 pages
Deep DL Manual Nainish
No ratings yet
Deep DL Manual Nainish
8 pages
WINSEM2024-25 CSE4006 ETH AP2024254000689 2025-02-28 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000689 2025-02-28 Reference-Material-I
39 pages
DL Record
No ratings yet
DL Record
11 pages
Model5 Partial
No ratings yet
Model5 Partial
52 pages
RNN LSTM
No ratings yet
RNN LSTM
37 pages
Overview of Recurrent Neural Networks
No ratings yet
Overview of Recurrent Neural Networks
53 pages
Image Captions With Deep Learning: Yulia Kogan & Ron Shiff
No ratings yet
Image Captions With Deep Learning: Yulia Kogan & Ron Shiff
24 pages
Lecture8 421
No ratings yet
Lecture8 421
85 pages
RNN and LSTM for Sentiment Analysis
No ratings yet
RNN and LSTM for Sentiment Analysis
14 pages
Deep Learning Cat 2
No ratings yet
Deep Learning Cat 2
14 pages
Text Classification with CNN, RNN, HAN
No ratings yet
Text Classification with CNN, RNN, HAN
15 pages
2AMM30+AY23 24+Text+Mining+Lecture+3
No ratings yet
2AMM30+AY23 24+Text+Mining+Lecture+3
88 pages
NN UNIT 5 Notes
No ratings yet
NN UNIT 5 Notes
23 pages
Neural Networks in Information Retrieval
No ratings yet
Neural Networks in Information Retrieval
290 pages
RNN Overview: Types, Applications, and Code
No ratings yet
RNN Overview: Types, Applications, and Code
8 pages
Chapter 2
No ratings yet
Chapter 2
28 pages
Outline
No ratings yet
Outline
50 pages
Sequence Models Notes
No ratings yet
Sequence Models Notes
4 pages
3 - Deep Learning
No ratings yet
3 - Deep Learning
33 pages
A M3 RD Ipjn Yd Ps GKF
No ratings yet
A M3 RD Ipjn Yd Ps GKF
20 pages
Sequence Models231205
No ratings yet
Sequence Models231205
72 pages
Day 4
No ratings yet
Day 4
22 pages
Understanding Recurrent Neural Networks
No ratings yet
Understanding Recurrent Neural Networks
48 pages
Exercise 8
No ratings yet
Exercise 8
6 pages
11 RNN
No ratings yet
11 RNN
32 pages
LSTM, RNN
No ratings yet
LSTM, RNN
38 pages
CS663 Sentiment Analysis Assignment
No ratings yet
CS663 Sentiment Analysis Assignment
4 pages
Sentiment Analysis With An Recurrent Neural Networks
No ratings yet
Sentiment Analysis With An Recurrent Neural Networks
12 pages
NLP Lab Assignment - 05
No ratings yet
NLP Lab Assignment - 05
6 pages
Lec RNNs 2 LLMs - 1
No ratings yet
Lec RNNs 2 LLMs - 1
117 pages
RNNs for Sequential Data Modeling
No ratings yet
RNNs for Sequential Data Modeling
33 pages
Natural Language Processing With Neural Network - Class3
No ratings yet
Natural Language Processing With Neural Network - Class3
25 pages
Keras RNN Guide for Beginners
No ratings yet
Keras RNN Guide for Beginners
13 pages
1 AI - Introduction and ML
No ratings yet
1 AI - Introduction and ML
32 pages
NLP Techniques and RNN Modeling Guide
No ratings yet
NLP Techniques and RNN Modeling Guide
29 pages
RNNs and Their Types - 15 Slides (Easy Copy-Paste Format)
No ratings yet
RNNs and Their Types - 15 Slides (Easy Copy-Paste Format)
6 pages
Part 5
No ratings yet
Part 5
37 pages
Over Description About The Model
No ratings yet
Over Description About The Model
3 pages
10 RNN
No ratings yet
10 RNN
56 pages
Deep Learning with Keras & NLP
No ratings yet
Deep Learning with Keras & NLP
21 pages
Assignment4 - Deeplearning
No ratings yet
Assignment4 - Deeplearning
10 pages
تمثيل النص كموترات - تدريب - مايكروسوفت ليرن
No ratings yet
تمثيل النص كموترات - تدريب - مايكروسوفت ليرن
14 pages
Lecture 4 Part2
No ratings yet
Lecture 4 Part2
28 pages
Polynomial Expansion Paper
No ratings yet
Polynomial Expansion Paper
4 pages
Unit 5b - Natural Language Processing
No ratings yet
Unit 5b - Natural Language Processing
41 pages
For Seminar
No ratings yet
For Seminar
17 pages
Lesson 7 - RNN
No ratings yet
Lesson 7 - RNN
89 pages
Background
No ratings yet
Background
3 pages
RNN LSTM GRU Transformers
0% (1)
RNN LSTM GRU Transformers
123 pages
Cse 477 Neural Network Final Term
No ratings yet
Cse 477 Neural Network Final Term
21 pages
Steps
No ratings yet
Steps
3 pages
LSTM RNNs in NLP: Lecture Notes
No ratings yet
LSTM RNNs in NLP: Lecture Notes
57 pages
Text Classification - Movie Review - News Wires
No ratings yet
Text Classification - Movie Review - News Wires
5 pages
LSTM Lecture
No ratings yet
LSTM Lecture
163 pages
Lec 10
No ratings yet
Lec 10
37 pages
Cover Letter: Digital Egypt Builders
No ratings yet
Cover Letter: Digital Egypt Builders
1 page
Logistic Regression Q/As From hands-on-ML: Gradient Descent, or Mini-Batch Gradient Descent
No ratings yet
Logistic Regression Q/As From hands-on-ML: Gradient Descent, or Mini-Batch Gradient Descent
3 pages
Ohamed Bd-Elnaser: Profile
No ratings yet
Ohamed Bd-Elnaser: Profile
2 pages
CISC 867 Deep Learning: 12. Recurrent Neural Networks
No ratings yet
CISC 867 Deep Learning: 12. Recurrent Neural Networks
72 pages
Deep Learning: Text Processing Guide
No ratings yet
Deep Learning: Text Processing Guide
106 pages
Fashion-MNIST CNN Data Prep & Training
No ratings yet
Fashion-MNIST CNN Data Prep & Training
2 pages
CISC 867 Deep Learning: 15. Generative Adversarial Networks
No ratings yet
CISC 867 Deep Learning: 15. Generative Adversarial Networks
71 pages
VMware SD-WAN Datasheet 20210805
No ratings yet
VMware SD-WAN Datasheet 20210805
15 pages
TTDF Guidelines
No ratings yet
TTDF Guidelines
31 pages
Venus Series VoIP Gateway Solutions
No ratings yet
Venus Series VoIP Gateway Solutions
4 pages
3.4. Ông Nguyễn Văn Sơn. MiMi
No ratings yet
3.4. Ông Nguyễn Văn Sơn. MiMi
17 pages
Cdacc Report
No ratings yet
Cdacc Report
9 pages
English Project
No ratings yet
English Project
12 pages
GSM Dynamic Power Sharing (GBSS14.0 - 02)
No ratings yet
GSM Dynamic Power Sharing (GBSS14.0 - 02)
34 pages
Reading Skills for Students
No ratings yet
Reading Skills for Students
6 pages
UNIX API Module-3 Open and Creat
No ratings yet
UNIX API Module-3 Open and Creat
8 pages
MCA Mini Project Report Guideliness
No ratings yet
MCA Mini Project Report Guideliness
9 pages
Smart India Hackathon
No ratings yet
Smart India Hackathon
5 pages
Digital Literacy
No ratings yet
Digital Literacy
17 pages
User Manual UM-DM.01: Document Management System V1.0
No ratings yet
User Manual UM-DM.01: Document Management System V1.0
17 pages
Cyber Cafe Management System Project Class12 Final
No ratings yet
Cyber Cafe Management System Project Class12 Final
13 pages
FortiAP W2 v7.2.2 Release Notes
No ratings yet
FortiAP W2 v7.2.2 Release Notes
11 pages
Python Crash Course 2nd Edition Part I
0% (1)
Python Crash Course 2nd Edition Part I
87 pages
Lab - 8051 C - Assembly Programs
No ratings yet
Lab - 8051 C - Assembly Programs
7 pages
AES Encryption System Project
No ratings yet
AES Encryption System Project
90 pages
SEP3704 Oct-Nov Exam 2023
No ratings yet
SEP3704 Oct-Nov Exam 2023
6 pages
Java 6 Marks
No ratings yet
Java 6 Marks
3 pages
STEP 7 Safety V13 SP1 Overview
No ratings yet
STEP 7 Safety V13 SP1 Overview
50 pages
Quantum 6400: Enterprise Security
No ratings yet
Quantum 6400: Enterprise Security
4 pages
AI Agent To Chat With Airtable and Analyze Data
No ratings yet
AI Agent To Chat With Airtable and Analyze Data
27 pages
Create All Time Zone Tables in HANA Schema SYSTEM
No ratings yet
Create All Time Zone Tables in HANA Schema SYSTEM
4 pages
CS 231: Data Structures Exam Questions
No ratings yet
CS 231: Data Structures Exam Questions
2 pages
Jails: Lightweight, Operating-System-level Virtualization
No ratings yet
Jails: Lightweight, Operating-System-level Virtualization
14 pages
Chemical Equations for Exams
No ratings yet
Chemical Equations for Exams
8 pages
ArubaOS-8.10.0.6-issues Note
No ratings yet
ArubaOS-8.10.0.6-issues Note
39 pages
BBF 2005 HDI Operateur
No ratings yet
BBF 2005 HDI Operateur
38 pages
PCAN-Diag Firmware Update - V1.0
No ratings yet
PCAN-Diag Firmware Update - V1.0
4 pages