Harvard CS197 Lecture 4 Notes
Harvard CS197 Lecture 4 Notes
Abstract
I’ve found that building is the most effective way of learning when it comes to AI/ML
engineering. Instead of a typical theoretical introduction to deep learning, I want to start our
first dive into deep learning through engineering using Huggingface, which has created a set
of libraries that are being rapidly adopted in the AI community. We’ll focus today on natural
language processing, which has seen some of the biggest AI advancements, most recently
through large language models. This lecture is structured as a live coding walkthrough: we will
fine-tune a pre-trained language model on a dataset. Through an engineering lens, this
walkthrough will cover dataset loading, tokenization, and fine-tuning.
Learning outcomes:
- Load up and process a natural
language processing dataset using the
datasets library.
- Tokenize a text sequence, and
understand the steps used in
tokenization.
- Construct a dataset and training
step for causal language modeling.
1
HuggingFace
For this example, we are going to work with libraries from Huggingface. Hugging Face has
become a community and data science center for building, training and deploying ML models
🤗
based on open source (OS) software. Fun fact: Huggingface was initially a chatbot, and named
after the emoji that looks like a smiling face with jazz hands – .
We’re going to use Huggingface to fine-tune a language model on a dataset. You may have to
follow the installation instructions here later in the lecture. Our lecture today will closely follow
this, and this, but with some of my own spin on things.
Loading up a dataset
We are going to use the 🤗 Datasets library. This library has three main features: (1) efficient
way to load and process data from raw files (CSV/JSON/text) or in-memory data (python dict,
pandas dataframe), (2) a simple way to access and share datasets with the research and
practitioner communities (over 1,000 datasets are already accessible in one line), and (3) is
interoperable with DL frameworks like pandas, NumPy, PyTorch and TensorFlow.
For this demo, we are going to work with the SQuAD dataset. Briefly, the Stanford Question
Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions
posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a
segment of text, or span, from the corresponding reading passage, or the question might be
unanswerable. Fun fact: SQuAD came out of one of my first projects in my PhD.
Today, we’re going to see whether we can fine-tune GPT on the questions posed in SQuAD,
so we have a question completion agent. We will load the dataset from here:
https://huggingface.co/datasets/squad
This method (1) downloads and import in the library the dataset loading script from the path if
it’s not already cached inside the library, (2) run the dataset loading script which will download
the dataset file from the original URL if it’s not already downloaded and cached, process and
cache the dataset, and (3) return a dataset built from the requested splits in split (default: all).
The method returns a dictionary (datasets.DatasetDict) with a train and a validation subset;
what you get here will vary per dataset.
We can remove columns that we are not going to use, and use the map function to add a
special <|endoftext|> token that GPT2 uses to mark the end of a document.
3
Note the use of the map() function. As specified here, the primary purpose of map() is to
speed up processing functions. It allows you to apply a processing function to each example in
a dataset.
Tokenizer
Before we can use this data, we need to process it to be in an acceptable format for the
model. So how do we feed in text data into the model? We are going to use a tokenizer. A
tokenizer prepares the inputs for a model.
The above steps show how we can go from text into tokens. There are multiple rules that
govern the process that are specific to certain models. For tokenization, there are three main
subword tokenization algorithms: BPE (used by GPT-2 and others), WordPiece (used for
example by BERT), and Unigram (used by T5 and others); we won’t go into any of these, but if
you’re curious, you can learn about them here.
5
Since tokenization processes are model-specific, if we want to fine-tune the model on new
data, we need to instantiate the tokenizer using the name of the model, to make sure we use
the same rules that were used when the model was pretrained. This is all done by the
AutoTokenizer class:
Pro-tip: The huggingface library contains tokenizers for all the models. Tokenizers are available
in a Python implementation or “Fast” implementation which uses the Rust language.
Here, you can see the sentence broken into subwords. In GPT2 and other model tokenizers,
the space before a word is part of a word; spaces are converted in a special character (the Ġ )
in the tokenizer.
Once we have split text into tokens (what we’ve seen above), we now need to convert tokens
into numbers. To do this, the tokenizer has a vocabulary, which is the part we download when
we instantiate it with the from_pretrained() method. Again, we need to use the same
vocabulary used when the model was pretrained.
6
The tokenizer actually automatically chains these operations for us when we use __call__:
The tokenizer returns a dictionary with 2 important items: (1) input_ids are the indices
corresponding to each token in the sentence, and (2) attention_mask indicates whether a
token should be attended to or not. We are going to ignore the attention_mask for now; if
you’re curious, you can read more about it here.
We are going to now tokenize our dataset. We apply a tokenize function to all the splits in our
“datasets” object.
7
We use the 🤗 Datasets map function to apply the preprocessing function over the entire
dataset. By setting batched=True, we process multiple elements of the dataset at once and
increase the number of processes with num_proc=4. Finally, we remove the “questions”
column because we won’t need it now.
Data Processing
For causal language modeling (CLM), one of the data preparation steps often used is to
concatenate the different examples together, and then split them into chunks of equal size.
This is so that we can have a common length across all examples without needing to pad. So
Say we have: [
8
Let’s implement this transformation. We are going to use chunks defined by block_size of 128
(although GPT-2 should be able to process a length of 1024, we might not have the capacity to
do that locally).
We need to concatenate all our texts together then split the result in small chunks of a certain
block_size. To do this, we will use the map method again, with the option batched=True. This
option actually lets us change the number of examples in the datasets by returning a different
number of examples than we got. This way, we can create our new samples from a batch of
examples.
9
Note that we duplicate the inputs for our labels. The 🤗 Transformers library will automatically
be able to use this label to set up the causal language modeling task (by shifting all tokens to
the right.
Note how we can use tokenizer’s decode function to go from our encoded ids back to the
text.
Finally, we will make a smaller version of our training and validation so we can fine-tune our
model in a reasonable amount of time.
11
As part of our training args, we specify that we will push this model to the Hub. The Hub is a
huggingface platform where anyone can share and explore models, datasets, and demos.
We can now evaluate the model. Because we want our model to assign high probabilities to
sentences that are real, and low probabilities to fake sentences, we seek a model that assigns
the highest probability to the test set. The metric we use is ‘perplexity’, which we can think of
as the inverse probability of the test set normalized by the number of words in the test set.
Therefore, a lower perplexity is better.
12
We can now upload our final model and tokenizer to the hub.
Exercises
Exercise 1: Now rather than starting with a pre-trained model, start with a model from scratch.
Exercise 2: Replace DistilGPT with a non-GPT causal language model.
Exercise 3: Replace the SQuAD dataset with another dataset (except for wikitext).
We can now tokenize some text, including some context and the start of a question:
13
Finally, we can now pass this input into the model for generation:
The generate function is one we haven’t seen before and has a lot of arguments that it takes
in. The generation isn’t the main focus of our lecture, but if you’re curious, Huggingface has
great walkthroughs here & here.
And there we have it – our own model used for autocompleting a question! Awesome!