0% found this document useful (0 votes)
44 views19 pages

Do Massively Pretrained Language Models Make Better Storytellers?

The document compares the performance of the GPT2 language model to a state-of-the-art neural story generation model on the task of story generation. It evaluates the generated text from both models across various automatic metrics to characterize how pretraining impacts text generation abilities and whether task-specific architectures are necessary for storytelling.

Uploaded by

Alwindo Siski
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views19 pages

Do Massively Pretrained Language Models Make Better Storytellers?

The document compares the performance of the GPT2 language model to a state-of-the-art neural story generation model on the task of story generation. It evaluates the generated text from both models across various automatic metrics to characterize how pretraining impacts text generation abilities and whether task-specific architectures are necessary for storytelling.

Uploaded by

Alwindo Siski
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Do Massively Pretrained Language Models Make Better Storytellers?

Abigail See, Aneesh Pappu∗, Rohun Saxena∗, Akhila Yerukola∗, Christopher D. Manning
Stanford University
{abisee,apappu,rohun,akhilay,manning}@cs.stanford.edu

Abstract Though the model has generated certain impres-


Large neural language models trained on mas-
sive samples of text – such as a widely-circulated
sive amounts of text have emerged as a passage about Ovid’s Unicorn (Radford et al.,
arXiv:1909.10705v1 [cs.CL] 24 Sep 2019

formidable strategy for Natural Language Un- 2019) – there has been no detailed study to for-
derstanding tasks. However, the strength of malize these observations.
these models as Natural Language Generators In this work, we perform an in-depth study
is less clear. Though anecdotal evidence sug- of the properties of text generated by GPT2-117
gests that these models generate better quality (the smallest version of GPT2) in the context of
text, there has been no detailed study charac-
story generation. By comparing to a state-of-the-
terizing their generation abilities. In this work,
we compare the performance of an extensively art, specialized-architecture neural story genera-
pretrained model, OpenAI GPT2-117 (Rad- tion model (Fan et al., 2018), we ask the follow-
ford et al., 2019), to a state-of-the-art neural ing questions. In what ways does a large amount
story generation model (Fan et al., 2018). By of open-domain pretraining data change the char-
evaluating the generated text across a wide va- acteristics of generated text? In what ways does it
riety of automatic metrics, we characterize the make no difference? And is a task-specific archi-
ways in which pretrained models do, and do
tecture necessary?
not, make better storytellers. We find that al-
though GPT2-117 conditions more strongly on For any probabilistic language model, the gen-
context, is more sensitive to ordering of events, erated text is strongly affected by the choice of de-
and uses more unusual words, it is just as coding algorithm – this is especially true for open-
likely to produce repetitive and under-diverse ended text generation tasks such as storytelling
text when using likelihood-maximizing decod- and chitchat dialogue (Kulikov et al., 2018; Holtz-
ing algorithms. man et al., 2019). Nevertheless, most natural lan-
1 Introduction guage generation papers evaluate only one decod-
ing algorithm – this is often due to the time and
In 2018, large-scale neural models such as ELMo expense required for human evaluation. For ex-
(Peters et al., 2018), BERT (Devlin et al., 2019) ample, Fan et al. use top-k sampling (a decoding
and OpenAI GPT (Radford et al., 2018) emerged algorithm in which k governs the quality-diversity
as a dominant approach in NLP. By pretraining tradeoff), but only evaluate one value of k. How-
on massive amounts of unlabeled text (often or- ever, evaluating one k gives an incomplete view of
ders of magnitude larger than the the target task’s the generation system – several researchers have
labeled dataset), these models achieve state-of- emphasized the importance of evaluating genera-
the-art performance across a variety of Natural tion systems over the entire quality-diversity spec-
Language Understanding benchmarks. In partic- trum, rather than a single point on it (Caccia et al.,
ular, the OpenAI GPT2 language model (Rad- 2018; Hashimoto et al., 2019).
ford et al., 2019) achieves state-of-the-art perfor- In this study, we prioritize evaluating text across
mance on several language modeling benchmarks, the whole k spectrum, and measuring many dif-
even in a zero-shot setting. While GPT2’s perfor- ferent automatic metrics, rather than a few hu-
mance as a language model is undeniable, its per- man metrics. Though the lack of human evalu-
formance as a text generator is much less clear. ation limits our ability to measure overall quality

equal contribution (Liu et al., 2016; Novikova et al., 2017; Hashimoto
et al., 2019), we are able to produce an objectively a 5-layer encoder and 5-layer decoder in the sec-
defined, richly detailed and reproducible evalua- ond model – in total, 255.4 million parameters.
tion of the generated text. To our knowledge, this
work is the first comprehensive analysis of the GPT2-117 GPT2 (Radford et al., 2019) is a
characteristics of GPT2-generated text. Our study large Transformer language model trained on
provides insight into the effect of large-scale pre- WebText, a diverse corpus of internet text (not
training on open-ended natural language genera- publicly released) containing over 8 million doc-
tion, as well as the effect of k on text generated uments equalling 40GB of text in total. The full-
with top-k sampling. We hope our results will in- size GPT2 model, which has 1542 million pa-
form other researchers’ choice of models, pretrain- rameters, obtains state-of-the-art results on a va-
ing schemes, and decoding algorithms – decisions riety of language modeling and other Natural Lan-
that can often feel like blind choices. To enable guage Understanding benchmarks. At the time
readers to browse the generated text, conduct their of our experiments, Radford et al. had only re-
own evaluations, or run our evaluations on their leased the smallest of the models, known as GPT2-
own text, we publicly release our generated stories 117.2 This model, which we use for our experi-
and evaluation code.1 ments, has 12 layers and 117 million parameters.
Like the full-size GPT2 model, it has a vocabu-
2 Background lary of 50,257 byte-pair-encoding (BPE) tokens.
The BPE encoding allows the model to encode
WritingPrompts dataset WritingPrompts (Fan
and generate any Unicode string, regardless of pre-
et al., 2018) is a story generation dataset contain-
processing, tokenization, or vocabulary size. The
ing 303,358 human-written (prompt, story) pairs
model has a context size of 1024, meaning it can
collected from the /r/WritingPrompts subreddit –
process text up to 1024 BPE tokens in length.
a forum where Reddit users compose short stories
inspired by other users’ prompts. An example can Decoding algorithms Inspired by Neural Ma-
be seen at the top of Table 2. The mean prompt chine Translation, most early attempts at open-
length is 28.4 words and the mean story length is ended neural text generation (such as conversa-
734.5 words. The dataset is 887MB of text in total, tional response generation) used the beam search
contains 200 million story words, and is divided decoding algorithm (Shang et al., 2015; Serban
into 90% train, 5% validation and 5% test splits. et al., 2016). Like greedy decoding, beam search
The Fusion Model The Fusion Model is a is a likelihood-maximizing decoding algorithm –
state-of-the-art neural story generation architec- given the input sequence x, the objective is to find
ture trained on the WritingPrompts dataset (Fan an output sequence y which maximizes P (y|x).
et al., 2018). It is based on the Convolutional However, researchers have shown that for open-
Seq2seq model of Gehring et al. (2017) and aims ended generation tasks (including storytelling),
to improve two aspects of story generation: mod- beam search produces repetitive, generic and de-
eling long-range context and increasing relevance generate text (Holtzman et al., 2019).
of the story to the prompt. To achieve the former, More recently, top-k sampling has emerged as
the model uses a multi-scale gated self-attention a primary decoding algorithm for open-ended text
mechanism. For the latter, the model uses a fu- generation (Fan et al., 2018; Radford et al., 2019).
sion mechanism (Sriram et al., 2018) in which one In top-k sampling, on each step of the decoder
seq2seq model is trained on the task, then frozen, the probability distribution over the vocabulary is
and a second seq2seq model is trained on the truncated to the top k tokens, then re-normalized.
task with access to the first model’s hidden states. The next token is sampled from the new distribu-
Compared to the Convolutional Seq2seq model tion. Top-k sampling can be regarded as some-
and other baselines, the Fusion Model achieves where between a likelihood maximizing algorithm
improved perplexity, story-prompt relevance and (when k = 1; greedy decoding) and an unbiased
human preference scores. The Fusion Model has sampling algorithm (when k = vocabulary size).
a vocabulary of 104,960 words, a 3-layer encoder Fan et al. use top-k sampling (with k = 10) to
and 8-layer decoder in the first seq2seq model, and 2
Since conducting our experiments, larger models have
1
Code and generated stories available at https:// been publicly released. At the time of writing, the full-size
github.com/abisee/story-generation-eval GPT2 model has not been publicly released.
generate stories, and Radford et al. show impres- Model Valid ppl Test ppl
sive samples of generated text (primarily from the Fusion Model 37.05 37.54
full-size GPT2 model) for k = 40. GPT2-117 31.13 31.54

3 Experimental Details Table 1: Word-level perplexities on WritingPrompts-


1024 for the Fusion Model and finetuned GPT2-117.
Preprocessing Fan et al. truncate Writing-
Prompts stories to 1000 words before training and
several values of k ranging from 1 to vocabulary
testing. Due to the limited context size of GPT2-
size. We use softmax temperature 1. Like Fan
117, we additionally exclude (prompt, story) ex-
et al., we generate exactly 150-word stories and
amples that are longer than 1024 BPE tokens when
block the Fusion Model from generating <UNK>.
concatenated. The resulting dataset, which we
To obtain human-written stories for compari-
call WritingPrompts-1024, has 192,364 training,
son, we truncate WritingPrompts-1024 test set sto-
11,115 validation, and 10,686 test examples.
ries to 150 words (discarding those shorter than
The Fusion Model We use the pretrained ver- 150 words). To reduce variance, measurements
sion of the Fusion Model, which is available in for human stories are computed over this entire set
the Fairseq framework (Ott et al., 2019). For com- (rather than just 1000 stories).
parability with GPT2-117, we evaluate the Fusion
Model on WritingPrompts-1024 (see Table 1), ob- 4 Story-prompt relatedness
taining perplexities similar to those reported by Prior research has observed that seq2seq systems
Fan et al. on the full WritingPrompts dataset. frequently produce text that is unrelated to the
provided context – particularly under likelihood-
GPT2-117 In order for the model to condition
maximizing decoding algorithms such as beam
on prompts and generate stylistically correct sto-
search. The issue has inspired multiple explana-
ries, we finetune GPT2-117 on WritingPrompts-
tions (Jiang and de Rijke, 2018) and multiple so-
1024.3 We frame WritingPrompts as a language
lutions – such as alternative training objectives (Li
modeling task, representing the prompt and story
et al., 2016), decoding objectives (Baheti et al.,
as a single sequence separated by a delimiter to-
2018; See et al., 2019), and architectural changes
ken. We finetune the pretrained model until con-
(Fan et al., 2018). In this section, we measure how
vergence using the default hyperparameters pro-
strongly the models condition on the prompt.
vided in the HuggingFace repository (though we
reduce batch size to fit on a single GPU), and use Prompt ranking accuracy For both models, we
the finetuned model for all further evaluations. compute prompt ranking accuracy (Fan et al.,
We compute the word-level perplexity of the 2018), which measures the language model’s sen-
finetuned GPT2-117 on the WritingPrompts-1024 sitivity to the provided prompt. Following the
dataset. That is, we normalize the total negative methodology of Fan et al., we randomly select
log probability of the target text by the number 1000 human-written stories from the test set, and
of word-level (i.e. Fusion Model) tokens, not the measure the probability (according to the model)
number of BPE tokens. This enables us to com- of each story conditioned on 10 different prompts
pare the perplexities of the two models, despite – the true prompt, plus nine randomly selected
the tokenization difference (Radford et al., 2019). prompts. The prompt ranking accuracy of a model
The finetuned GPT2-117 obtains a test set word- is the percentage of cases in which the model as-
perplexity of 31.544 – six points lower than the signs a higher probability to the story under its
Fusion Model. true prompt than under all of the other nine. We
find that GPT2-117 scores 80.16% on this task,
Generation settings For both models, we gen-
while the Fusion Model scores 39.8%.5 Random
erate stories using top-k sampling, obtaining 1000
chance scores 10%. This striking result indicates
stories (from 1000 different test set prompts) for
5
Fan et al. (2018) report a prompt ranking accuracy of
3
We use the PyTorch re-implementation of GPT2-117 16.3% for the Fusion Model. We provided the authors with
available at https://github.com/huggingface/ our prompt ranking accuracy code (which was built on top of
pytorch-transformers the authors’ code). The authors indicated that the discrepancy
4
This is similar to other GPT2-117 WritingPrompts fine- may be due to some code version changes between the time
tuning experiments (Mao et al., 2019; Ziegler et al., 2019). of their original experiments and their code release.
ties that appear in the story.7 As shown in Figure
Story-prompt sent sim Human 7 in the Appendix, we find that GPT2-117 uses
Fusion Model
0.10 GPT2-117
more of the prompt named entities than the Fusion
Model (as well as more named entities overall),
but both models use fewer named entities than hu-
0.05 mans when k is less than vocabulary size.
100 101 102 103 104 105 These patterns can be seen in Table 2: GPT2-
k (Top-k sampling) 117 uses the prompt entities Queen and England
whereas the Fusion Model does not (for either k),
Figure 1: Compared to the Fusion Model, GPT2-117 and GPT2-117 uses specific time entities such as
produces stories that are more semantically similar to Thursday and 3:26 PM. While the human story
the prompt. Similarity decreases as k increases. introduces highly-related entities such as Charles
Windsor and Prince of Wales that were not in the
that GPT2-117 conditions on the prompt much prompt, neither model does this (for either k).
more strongly than the Fusion Model. This is no- Conclusion In this section, we found that
table, especially because the fusion technique is GPT2-117 conditions on the prompt much more
intended to improve story-prompt relevance. strongly than the Fusion Model – a result which
N-gram similarity For n = 1, 2, 3, we measure holds both in language modeling and generation
the percentage of generated n-grams that also ap- settings. The latter result supports Radford et al.’s
pear in the prompt. For all n and k, we find that informal observation that GPT2 has a ‘chameleon-
GPT2-117 has a higher overlap (i.e. copies more like’ ability to ‘adapt to the style and content of the
from the prompt) than the Fusion Model – see Fig- conditioning text’.8 We speculate that GPT2-117’s
ure 6 in the Appendix. Furthermore, for k < 100, stronger conditioning ability may derive from its
the GPT2-117 overlap is generally much higher Transformer decoder architecture, whose power-
than human levels. Both these phenomena can be ful self-attention is used for story-prompt atten-
seen in Table 2, where, for k = 10, GPT2-117 tion. Though the Fusion Model uses a similar
copies words such as queen more often than both self-attention mechanism in the decoder (i.e., story
the Fusion Model and the human-written story. side), the prompt-story attention has a simpler for-
mulation – for example, there are no separate key
Sentence embedding similarity To capture a and value vectors (Gehring et al., 2017). Lastly,
higher-level notion of semantic similarity, we we note that very strong prompt-conditioning is
measure story-prompt sentence similarity – the co- not always a good thing – GPT2-117 often gen-
sine similarity of story-prompt sentence pairs, av- erates stories that copy too much or too literally
eraged by taking the mean over all pairs. Sen- from the prompt when k is small (this can be seen
tences are represented by the embedding method in Figure 6 in the Appendix).
of Arora et al. (2017) – a weighted average of
the GloVe embeddings (Pennington et al., 2014) 5 Coherence
of the words, with the first principal component
A good story generation model should produce co-
removed. As shown in Figure 1, we find a similar
herent text with a logical ordering of events. Sim-
pattern as for n-gram similarity: GPT2-117 gener-
ilarly, the underlying language model should be a
ates sentences that are more similar to the prompt
good coherence scorer – assigning higher proba-
than the Fusion Model for all k, and both models’
bility to coherent text than incoherent text. Barzi-
prompt similarity decreases as k increases.
lay and Lapata (2008) evaluate a coherence scorer
Named entity usage Generally, most named en- by measuring its ability to rank shuffled human-
tities mentioned in the prompt (such as Queen and written text as less coherent than the original un-
England in Table 2), should also be mentioned in shuffled text. We use this method to evaluate our
the story. Using the spaCy named entity recog- story generation models.
nizer,6 we measure the prompt entity usage rate, 7
Given that we limit stories to 150 words, this percentage
which is the percentage of all prompt named enti- is lower than it would be if we generated longer stories.
8
https://openai.com/blog/
6
https://spacy.io better-language-models/
Mean rank (1-14) 0.6
7.5

Distinct-1
0.4
7.0 Human
Fusion Model 0.2 Fusion Model
6.5 GPT2-117 GPT2-117
1 2 3 4 5 6 7 8 9 10 11 12 13 14 100 101 102 103 104 105
Position of swapped sentences k (Top-k sampling)

Figure 2: Sensitivity of the models to swapped sen- Figure 3: Repetition (low distinct-1) is primarily
tences in different positions. A higher mean rank in- caused by choice of decoding algorithm (here low k),
dicates higher sensitivity (i.e. the model assigns lower not insufficient training data. GPT2-117 is trained on
probability) relative to other positions. Both models are 45× more data than the Fusion Model, but is similarly
less sensitive to swapped sentences at the beginning of repetitive for all k.
the text, compared to later. GPT2-117 shows this pat-
tern more strongly, indicating greater use of context.
6 Repetition and rareness

For each story in the test set, we select the first Generic, under-diverse and repetitive text is a
15 sentences. We then produce 14 corrupted ver- well-documented problem in neural text genera-
sions of the story by switching each pair of ad- tion (Jiang and de Rijke, 2018). While there are
jacent sentences. We use the language model to many proposed solutions to the problem (Li and
compute the probability of each of the 14 cor- Jurafsky, 2016; Vijayakumar et al., 2018; Baheti
rupted stories, as well as the original story. The et al., 2018; Zhang et al., 2018; See et al., 2019), it
model’s error rate is the percentage of cases in has been shown that a primary cause is likelihood-
which it rates any of the 14 corrupted candidates maximizing decoding algorithms such as greedy
better than the original candidate. Random guess- decoding and beam search (Holtzman et al., 2019).
ing yields 93.33% error. Both models perform In this section we investigate the role of large-scale
well on this task – the Fusion Model has an er- pretraining, and the role of k, in this problem.
ror rate of 3.44% and GPT2-117 an error rate of N-gram repetition The distinct-n metric of a
2.17%. This 36.92% error reduction indicates that piece of text is the number of unique n-grams di-
GPT2-117 is more sensitive to ordering of events. vided by the total number of generated n-grams
We also investigate how the position of the swap (Li et al., 2016). We measure distinct-n of the
affects its plausibility (relative to other positions). generated stories for n = 1, 2, 3. A high ratio
Figure 2 shows, for each swap position, the mean indicates a high level of within-story lexical di-
rank assigned to that swap by the model (where versity, while a low ratio indicates a large amount
rank 1 is the most probable of the 14 corrupted of within-story repetition. As shown in Figure 3,
candidates, and rank 14 the least probable). GPT2- both models’ unigram diversity is far below that
117 assigns a much lower rank to the first few of human text when k is small. For example, at
swap positions (i.e., rates them more probable) k = 10 (the setting used by Fan et al.), the Fu-
than the later positions. The Fusion Model shows sion Model obtains a distinct-1 of 42.4%; much
a similar but less pronounced pattern. This shows less than the human level of 60.0%. This results in
that both models are less sensitive to out-of-order a high level of repetition, as shown in Table 2: for
sentences that occur at the beginning of the text, k = 10, both models repeat many phrases (such as
than those occurring later.9 The stronger pattern always, so scared, and finally).
for GPT2-117 may be due to its stronger context For bigrams and trigrams, the pattern is similar
conditioning (as shown in Section 4) – thus be- to unigrams (see Figure 9 in the Appendix). For
coming more sensitive as context increases. How- both models, distinct-n increases as k increases,
ever, even for the first three swaps, GPT2-117 is converging to a value close to the human level
more accurate than the Fusion Model at distin- as k approaches vocabulary size. Though GPT2-
guishing the swapped text from the original. 117 has a slightly higher distinct-n than the Fu-
9
It’s also possible that out-of-order sentences are inher- sion Model for most values of k, the difference
ently harder to detect at the beginning of text. is negligible compared to the influence of k. We
make three conclusions from these patterns: (1) its training data. Low complexity can be a sign of
Our findings support Holtzman et al.’s observation less sophisticated writing, while high complexity
that repetition is strongly related to choice of de- can be a sign of poor readability (Beers and Nagy,
coding algorithm, and that likelihood maximizing 2009; McNamara et al., 2010). In this section,
algorithms (such as top-k sampling with low k) we measure some features related to the syntactic
are a primary cause of repetition. (2) The models style and complexity of the generated stories.
have in fact learned the correct rate of repetition in
Sentence length Sentence length is a simple but
human text – they are able to match this rate when
effective feature to estimate readability and syn-
they sample from their full (untruncated) distribu-
tactic complexity of text (Kincaid et al., 1975;
tion. (3) Repetition is unlikely to be solved by
Roemmele et al., 2017). We find that both models
more pretraining data alone – even though GPT2-
generate sentences that are on average shorter than
117 is trained on 45 times as much data as the Fu-
human sentences when k is small, but converge to
sion Model, it produces text that is almost equally
approximately human length as k increases (see
repetitive (for equal k).
Figure 8 in the Appendix).
Rare word usage We compute the mean log
Part-of-speech usage It has been shown that the
unigram probability of the words in the gener-
distribution of parts-of-speech (POS), and more
ated story10 – a high value indicates using fewer
generally the distribution of POS n-grams11 is a
rare words while a low value indicates more rare
useful feature to represent the style of a piece
words. As shown in Figure 12 in the Appendix,
of text (Argamon et al., 1998; Ireland and Pen-
word rareness is primarily governed by k – how-
nebaker, 2010; Roemmele et al., 2017).
ever, GPT2-117 has a lower mean log unigram
Firstly, we compare the part-of-speech distri-
probability (i.e., uses more rare words) than the
butions of the model-generated text and the hu-
Fusion Model for all equal values of k ≥ 2. This
man text (see Figure 11 in the Appendix). Both
can be seen for example in Table 2, where GPT2-
models (especially GPT2-117) closely fit the hu-
117 generates rarer words such as idle and copi-
man POS distribution as k approaches vocabulary
ous for k = 1000. GPT2-117 also generates fewer
size.12 This implies that, as with lexical diver-
stopwords than the Fusion Model, for all equal k.
sity, the models have no difficulty fitting the sta-
GPT2-117’s slightly higher rare word usage
tistical distribution of human syntax. However,
(compared to the Fusion Model) might be ex-
under likelihood-maximizing decoding algorithms
plained by: (1) its BPE encoding, which allows it
such as low k, a completely different distribution
to generate new words, not just those in a fixed vo-
emerges, in which text contains more verbs and
cabulary; (2) pretraining on a large amount of di-
pronouns than human text, and fewer nouns, ad-
verse text, allowing it to learn to produce a greater
jectives and proper nouns.
variety of words; (3) stronger conditioning on the
Secondly, we measure the syntactic diversity of
prompt as described in Section 4 – which may in-
the text using the distinct-n metric for POS n-
ject more rareness into the generated text.
grams (n = 1, 2, 3) – see Figure 10 in the Ap-
Conclusion Choice of decoding algorithm is a pendix. As with lexical diversity (see Section 6),
primary factor in diversity and repetition prob- we find that syntactic diversity is similar for the
lems, with likelihood-maximizing algorithms the two models, is very low when k is small, and
main culprit. Although GPT2-117 generates more matches human level as k approaches vocabulary
rare words and is very slightly less repetitive than size. It’s likely that for low k, the syntactic under-
the Fusion Model, the difference is small com- diversity of the text is largely caused by lexical
pared to the effect of k, indicating that training under-diversity (i.e. repetition). However, we note
data alone is unlikely to solve these problems. that as k increases, lexical diversity reaches human
level sooner than syntactic diversity – for exam-
7 Syntactic style and complexity ple, GPT2-117’s lexical distinct-3 reaches human
level at k = 600 (Figure 9c), but its POS distinct-
A well-trained story generation model should
11
match both the syntactic style and complexity of For example, the sentence I like cats has the POS bi-
grams PRONOUN VERB and VERB NOUN.
10 12
The unigram probability distribution was calculated with One exception is Proper Noun: both models fail to pro-
respect to the WritingPrompts training set. duce enough of these even as k approaches vocabulary size.
1.0 1.0 1.0
Token probability

Token probability

Token probability
0.5 0.5 0.5

0.0 0.0 0.0


0 50 100 150 0 50 100 150 0 50 100 150
Token index Token index Token index

(a) Fusion Model (k = 2): I had (b) Human Text: “Looks like the (c) GPT2-117 (k = 2): I’ve always
never seen a man so young before. I rain’s stopped.” I peered out the been a man of the people. I’ve always
had never seen him before, but he had window. Art was right; time to get to been a strong man. I’ve always been
always seemed to be a man of a man. work. “Alright, let’s move out.” I a strong man. I was born in the city, I
He was young, and he was young. He could hear the scraping of the stone was raised in the country. I was
was a man of a man, and a man who armor as the men slowly stood. raised in a family that wasn’t very
was young, and a man who was [...] Despite the training, [...] good. I ’m not a good man. [...]

Figure 4: Under top-k sampling with small k (k = 2), the two models (left and right) produce text that falls into
increasingly confident repeating loops. By contrast, human text (center) maintains an irregular pattern of surprising
(low probability) tokens. The human text probabilities are measured with respect to the Fusion Model, but similar
patterns hold for GPT2-117. Inspired by Holtzman et al. 2019’s figure showing probabilities under beam search.

3 reaches human level at k = 6000 (Figure 10c). rising trend in probability, and intermittently uses
This implies that, even when the text is no more low probability words throughout.13
repetitive than human text, it may still be syntacti- We formalize these anecdotal observations by
cally repetitive (using the same part-of-speech pat- measuring the average probability of each of the
terns repeatedly). first 150 word-level tokens in the story (Figure
5). We find that even when teacher-forcing on hu-
Conclusion We find when k is small, syntac- man text, the token probabilities increase slightly
tic complexity of generated text is low, consist- as the story progresses. This is likely due to the
ing of shorter sentences and a narrower range of usefulness of additional context, which increases
syntactic patterns. However, as k approaches vo- the model’s prediction accuracy. By comparison,
cabulary size, the syntactic style of generated text we find that when generating with top-k sampling,
closely matches human syntactic patterns. As with the probabilities increase more rapidly, and the in-
n-gram diversity in Section 6, our results show crease is even more rapid for smaller k. This con-
that syntactic under-diversity is primarily caused firms that likelihood-maximizing decoding algo-
by low k, not insufficient training data. rithms (such as top-k sampling with small k) lead
to more rapidly increasing model over-confidence.
8 The element of surprise Furthermore, we find this pattern holds for both
Model confidence over time Several re- models, with probabilities increasing at a similar
searchers have observed that model over- rate for equal k. This indicates that, like rep-
confidence (the model placing high probability on etition, model over-confidence is unlikely to be
a small range of tokens) can cause poor quality solved by more training data, and is largely gov-
generation (Jiang and de Rijke, 2018; Holtzman erned by choice of k.
et al., 2019). In particular, they show that for
Overall model confidence We also measure the
likelihood-maximizing decoding algorithms such
models’ overall confidence, as represented by the
as beam search, model confidence can increase in
total log probability (according to the model) of
a snowball-like effect, getting stuck in a loop of
the generated story. For both models, we find
repetitive but increasingly self-confident text. We
that story probability decreases as k increases
observe this problem in both our models when k
– see Figure 13 in the Appendix. This makes
is small. For example, in Figure 4, both models
sense, as higher k means sampling tokens with
fall into self-reinforcing repetitive loops with
lower probability. As k approaches the vocab-
rising confidence. The loop is difficult to break
ulary size, the Fusion Model’s generated story
– the Fusion Model briefly escapes (shown as a
sudden downwards spike), but quickly returns. By 13
Gehrmann et al. (2019) also identify presence of low
contrast, the human text does not show a strong probability words as an indicator of human-generated text.
0.45 text – see Figure 14 in the Appendix.
We find that, for the same k, GPT2-117 tends
0.40
Mean probability of token
to generate more concrete words than the Fusion
0.35 Model, and that for both models, concreteness
0.30 converges to approximately human levels as k in-
0.25 creases. Interestingly, however, when k is small,
0.20 the noun concreteness is much higher than hu-
man levels, whereas the verb concreteness is much
0.15 Top-k sampling, k=5
Top-k sampling, k=20 lower. This indicates that for small k, both models
0.10 Teacher-force on human text produce stories that, compared to human-written
stories, have too many physical objects (as op-
0 25 50 75 100 125 150
Token index posed to abstract nouns), and too few physical
actions (as opposed to abstract verbs). This re-
Figure 5: Mean probability for each of the first 150 flects the trend demonstrated in Table 2: when k is
word-level story tokens. When teacher-forcing the small, the models tend to generate descriptive sen-
model on human text, probability increases slowly. tences with mostly is verbs (e.g. I was always so
When generating with top-k sampling, probability in- excited), and physical nouns (e.g. mother, father,
creases faster, especially for smaller k. This plot is for queen). Only when k increases do we see more
the Fusion Model; similar patterns hold for GPT2-117. tangible actions (e.g. The bar patrons snickered)
and abstract nouns (e.g. pain, glances). A detailed
probability matches the probability it assigns to example, with all nouns and verbs annotated with
human-written WritingPrompts stories. Interest- concreteness, is in the Appendix (Table 3).
ingly however, the same is not true for GPT2-
117, which converges to a story probability that 10 Conclusions
is lower than the probability it assigns the human The effect of massive pretraining In this study,
stories. This means that under full (non-truncated) we find that GPT2-117 is a better story genera-
sampling, the Fusion Model produces text that tion model than the Fusion Model in several spe-
is equally surprising (to itself) as the Writing- cific ways: it conditions much more strongly on
Prompts stories, whereas GPT2-117 produces text the provided context, is more sensitive to cor-
that is more surprising to itself. Explaining this rect ordering of events, and generates text that
observation is an open question – we speculate that is more contentful (using more rare words, con-
GPT2-117’s WebText pretraining may cause it to crete words, and named entities). In particu-
generate (under high k) text in a style or genre that lar, the stronger conditioning result is striking, as
is less predictable than WritingPrompts stories. the Fusion Model is a complex task-specific ar-
chitecture designed to increase story-prompt rel-
9 Concreteness
evance. This demonstrates that sometimes, a
Brysbaert et al. (2014) define the concreteness of general-purpose model architecture can outper-
a word as ‘the degree to which the concept de- form a complex task-specific architecture when
noted by a word refers to a perceptible entity’. provided with enough pretraining data.
Concrete words are generally easier to remem- However, we find that in other aspects, GPT2-
ber than abstract words, and psycholinguists have 117 performs no better than the Fusion Model:
theorized they may be learned differently (i.e., when k is small, the models generate text that
concrete words by direct experience and abstract is equally lexically under-diverse, syntactically
words by text and discourse). Brysbaert et al. pro- under-complex, and repetitive – with a tendency
vide human concreteness ratings for 40,000 com- to fall into a snowball effect of increasing over-
mon English lemmas rated on a scale from 1 to confidence. However, these problems correct
5.14 We use these ratings to measure the mean themselves (i.e., the metrics match human levels)
concreteness of the nouns and verbs in the story when the models generate from their untruncated
14
distribution. Our results show that these oft-cited
For example, the nouns television, darkness, and idea are
rated 4.83, 3.85 and 1.61 respectively, and the verbs talk, see, neural generation problems are not the fault of
and hope are rated 4.07, 3.21 and 1.25 respectively. the models themselves (which are in fact statisti-
Prompt: You just discovered England’s biggest secret: every time they say long live the queen it extends the queens life.
Human-Written Story: I am immortal, I have inside me blood of kings. My name in Charles Windsor and I am heir to a
throne. Prince of Wales, they’ve called me that my entire life and now I lay on my death bed and my mother - still going
strong. I’m not sure how it’s possible but she seems ageless, almost timeless. My whole life I’ve waited for that [...]

Fusion Model (k = 10): I have no idea when my father GPT2-117 (k = 10): * ” So what are we going to do? “ *
would come to visit this place, I was always so excited. I It was the first time she asked me this question that I ever
was always so scared. I had to go back to work. I was a had asked her. It has taken years but finally I am finally able
little scared of him, but when I was younger, he had never to do it. It seems that this queen is the queen of England
come in for a visit. His mother told me, “I’m so scared”, as and she’s finally the one to tell me that this queen is the
he always did. She never talked to me about anything [...] reason why she’s not being allowed to die. [...]

Fusion Model (k = 1000): “Where am I? What GPT2-117 (k = 1000): It was an odd occasion for the
happened?” “Having been working on my job for over 6 Queen of England to meet with her. The plane sat idle at
hours now, I do not know how you worked!” “I have been 3:26 PM on a Thursday night. Yesterday, the Queen had
working for the last three years. Surely I am an ancient god taken it upon herself to try and get a good look at the plane
now.” The bar patrons snickered. “Hello?” “Those last which had recently been found abandoned. A copious
three years have been worse than a year ago.” Pain. [...] amount of curious glances from around the room until [...]

Table 2: A prompt and human story from the dataset, plus the models’ top-k generated stories, for two values of k.

cally well-trained to match human text for these these high k problems – i.e., strategies to imbue
metrics), nor caused by too little training data (as the language model with better reasoning, knowl-
these problems are not improved by GPT2-117’s edge and planning abilities – rather than continu-
extensive pretraining). Instead, they are primarily ing to seek ways to mitigate the diversity and rep-
caused by likelihood-maximizing decoding algo- etition problems of the low k setting.
rithms – such as greedy decoding, beam search,
Limitations of this study This study uses only
and top-k sampling with low k.
the smallest version of GPT2. It is likely that
The effect of k This study detailed the typical the larger versions of GPT2 may exhibit stronger
characteristics of long-form text generated by neu- statistical differences for the metrics we examine.
ral language models in open-ended settings, under Such a study would illustrate the effect of larger
both high entropy (large k) and low entropy (small model capacity, and more fully reveal the possible
k) decoding algorithms. The negative characteris- benefits of massive pretraining. We release our an-
tics of low k output (genericness, repetition, over- notation code so that other researchers may repeat
simplicity) are by now familiar to researchers. our study on more models and datasets.
However, we also uncovered some less obvious This study did not include human evaluation,
characteristics of low-k generated text: compared which is currently the only reliable way to assess
to human-written text, it tends to copy more from overall text quality, as well as quantify the defi-
the provided context (particularly GPT2-117); it ciencies of high k output described above (coher-
contains more verbs and pronouns but fewer nouns ence, reasoning, and world knowledge). As such,
and adjectives; its nouns are more concrete but its this study quantifies the diversity side more than
verbs are less concrete; and it uses a smaller range the quality side of the quality-diversity tradeoff.
of syntactic patterns (a phenomenon that can’t be Consequently, this study demonstrates the impor-
entirely attributed to n-gram repetition). tance of developing better methods to computa-
As k increases to vocabulary size, we find that tionally quantify notions such as text coherence,
the model-generated text closely fits the human logicality and commonsense correctness – an ef-
text on most of the metrics we measured. How- fort that may ultimately hold the key to generating
ever, it is clear by inspection that the high-k text with those desirable attributes.
model-generated text lacks many crucial aspects
11 Acknowledgments
such as commonsense reasoning, world knowl-
edge and multi-sentence coherence – an example This work was funded by the Gerald J. Lieberman
of this superficially fluent but nonsensical text can Fellowship, Tencent, and the DARPA CwC pro-
be seen in Table 4 in the Appendix. We believe gram under ARO prime contract no. W911NF-
that true progress in open-ended Natural Language 15-1-0462. We also thank the reviewers for their
Generation will come from attempting to address helpful comments.
References Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin
Choi. 2019. The curious case of neural text degen-
Shlomo Argamon, Moshe Koppel, and Galit Avneri. eration. arXiv preprint arXiv:1904.09751.
1998. Routing documents according to style. In
First International workshop on innovative informa- Molly E Ireland and James W Pennebaker. 2010. Lan-
tion systems. guage style matching in writing: Synchrony in es-
says, correspondence, and poetry. Journal of per-
Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2017. sonality and social psychology, 99(3):549.
A simple but tough-to-beat baseline for sentence em-
beddings. In International Conference on Learning Shaojie Jiang and Maarten de Rijke. 2018. Why
Representations. are sequence-to-sequence models so dull? Under-
standing the low-diversity problem of chatbots. In
Ashutosh Baheti, Alan Ritter, Jiwei Li, and Bill Dolan. EMNLP Workshop on Search-Oriented Conversa-
2018. Generating more interesting responses in tional AI.
neural conversation models with distributional con-
straints. In Empirical Methods in Natural Language J Peter Kincaid, Robert P Fishburne Jr, Richard L
Processing. Rogers, and Brad S Chissom. 1975. Derivation of
new readability formulas (automated readability in-
Regina Barzilay and Mirella Lapata. 2008. Modeling dex, Fog count and Flesch reading ease formula) for
local coherence: An entity-based approach. Compu- navy enlisted personnel.
tational Linguistics, 34(1):1–34.
Ilya Kulikov, Alexander H Miller, Kyunghyun Cho,
Scott F Beers and William E Nagy. 2009. Syntactic and Jason Weston. 2018. Importance of a search
complexity as a predictor of adolescent writing qual- strategy in neural dialogue modelling. arXiv
ity: Which measures? which genre? Reading and preprint arXiv:1811.00907.
Writing, 22(2):185–200.
Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao,
Marc Brysbaert, Amy Beth Warriner, and Victor Ku- and Bill Dolan. 2016. A diversity-promoting ob-
perman. 2014. Concreteness ratings for 40 thousand jective function for neural conversation models. In
generally known english word lemmas. Behavior North American Chapter of the Association for
Research Methods, 46(3):904–911. Computational Linguistics: Human Language Tech-
nologies.
Massimo Caccia, Lucas Caccia, William Fedus, Hugo
Larochelle, Joelle Pineau, and Laurent Charlin. Jiwei Li and Dan Jurafsky. 2016. Mutual information
2018. Language gans falling short. In NeurIPS and diverse decoding improve neural machine trans-
Workshop on Critiquing and Correcting Trends in lation. arXiv preprint arXiv:1601.00372.
Machine Learning.
Chia-Wei Liu, Ryan Lowe, Iulian Serban, Mike Nose-
worthy, Laurent Charlin, and Joelle Pineau. 2016.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
How not to evaluate your dialogue system: An em-
Kristina Toutanova. 2019. Bert: Pre-training of deep
pirical study of unsupervised evaluation metrics for
bidirectional transformers for language understand-
dialogue response generation. In Empirical Meth-
ing. In North American Chapter of the Association
ods in Natural Language Processing.
for Computational Linguistics: Human Language
Technologies. Huanru Henry Mao, Bodhisattwa Prasad Majumder,
Julian McAuley, and Garrison W. Cottrell. 2019.
Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hi- Improving neural story generation by targeted com-
erarchical neural story generation. In Association mon sense grounding. In Empirical Methods in Nat-
for Computational Linguistics. ural Language Processing.
Jonas Gehring, Michael Auli, David Grangier, Denis Danielle S McNamara, Scott A Crossley, and Philip M
Yarats, and Yann N Dauphin. 2017. Convolutional McCarthy. 2010. Linguistic features of writing qual-
sequence to sequence learning. In International ity. Written communication, 27(1):57–86.
Conference on Machine Learning.
Jekaterina Novikova, Ondřej Dušek, Amanda Cercas
Sebastian Gehrmann, Hendrik Strobelt, and Alexan- Curry, and Verena Rieser. 2017. Why we need new
der M Rush. 2019. Gltr: Statistical detection evaluation metrics for NLG. In Empirical Methods
and visualization of generated text. arXiv preprint in Natural Language Processing.
arXiv:1906.04043.
Myle Ott, Sergey Edunov, Alexei Baevski, Angela
Tatsunori Hashimoto, Hugh Zhang, and Percy Liang. Fan, Sam Gross, Nathan Ng, David Grangier, and
2019. Unifying human and statistical evaluation Michael Auli. 2019. Fairseq: A fast, extensible
for natural language generation. In North American toolkit for sequence modeling. In North American
Chapter of the Association for Computational Lin- Chapter of the Association for Computational Lin-
guistics: Human Language Technologies. guistics: Human Language Technologies.
Jeffrey Pennington, Richard Socher, and Christo-
pher D. Manning. 2014. Glove: Global vectors for
word representation. In Empirical Methods in Natu-
ral Language Processing.
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt
Gardner, Christopher Clark, Kenton Lee, and Luke
Zettlemoyer. 2018. Deep contextualized word repre-
sentations. In North American Chapter of the Asso-
ciation for Computational Linguistics: Human Lan-
guage Technologies.
Alec Radford, Karthik Narasimhan, Tim Salimans, and
Ilya Sutskever. 2018. Improving language under-
standing by generative pre-training. OpenAI tech re-
port.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Dario Amodei, and Ilya Sutskever. 2019. Language
models are unsupervised multitask learners. OpenAI
tech report.
Melissa Roemmele, Andrew S Gordon, and Reid
Swanson. 2017. Evaluating story generation sys-
tems using automated linguistic analyses. In KDD
Workshop on Machine Learning for Creativity.
Abigail See, Stephen Roller, Douwe Kiela, and Jason
Weston. 2019. What makes a good conversation?
how controllable attributes affect human judgments.
In North American Chapter of the Association for
Computational Linguistics: Human Language Tech-
nologies.
Iulian V Serban, Alessandro Sordoni, Yoshua Bengio,
Aaron Courville, and Joelle Pineau. 2016. Building
end-to-end dialogue systems using generative hier-
archical neural network models. In AAAI Confer-
ence on Artificial Intelligence.
Lifeng Shang, Zhengdong Lu, and Hang Li. 2015.
Neural responding machine for short-text conversa-
tion. In Association for Computational Linguistics.
Anuroop Sriram, Heewoo Jun, Sanjeev Satheesh, and
Adam Coates. 2018. Cold fusion: Training seq2seq
models together with language models. In Proc. In-
terspeech 2018, pages 387–391.
Ashwin K Vijayakumar, Michael Cogswell, Ram-
prasath R Selvaraju, Qing Sun, Stefan Lee, David
Crandall, and Dhruv Batra. 2018. Diverse beam
search: Decoding diverse solutions from neural se-
quence models. In AAAI Conference on Artificial
Intelligence.
Ruqing Zhang, Jiafeng Guo, Yixing Fan, Yanyan Lan,
Jun Xu, and Xueqi Cheng. 2018. Learning to con-
trol the specificity in neural response generation. In
Association for Computational Linguistics.
Zachary M Ziegler, Luke Melas-Kyriazi, Sebastian
Gehrmann, and Alexander M Rush. 2019. Encoder-
agnostic adaptation for conditional language gener-
ation. arXiv preprint arXiv:1908.06938.
Appendix

22% Human

% Story unigrams in prompt


Fusion Model
GPT2-117
20%

18%

16%

100 101 102 103 104 105


k (Top-k sampling)
(a) Percent of all story unigrams that are in the prompt.

5%
Human
% Story bigrams in prompt

4% Fusion Model
GPT2-117
3%
2%
1%
0%
100 101 102 103 104 105
k (Top-k sampling)
(b) Percent of all story bigrams that are in the prompt.

3.0%
Human
% Story trigrams in prompt

2.5% Fusion Model


GPT2-117
2.0%
1.5%
1.0%
0.5%
0.0%
100 101 102 103 104 105
k (Top-k sampling)
(c) Percent of all story trigrams that are in the prompt.

Figure 6: n-gram similarity between prompt and story, for n = 1, 2, 3, for both models and all k. GPT2-117 copies
many more n-grams from the prompt than the Fusion Model. See Section 4 for discussion.
4

# Unique named entities per story


15%
Prompt entity usage rate

3
10%
2
5%
Human 1 Human
0% Fusion Model Fusion Model
GPT2-117 GPT2-117
0
100 101 102 103 104 105 100 101 102 103 104 105
k (Top-k sampling) k (Top-k sampling)
(a) The proportion of all prompt named entities that are (b) The number of unique named entities that appear in
used in the story. the story.

Figure 7: Prompt entity usage rate (left) and mean number of unique named entities in the story (right), for both
models and all k. GPT2-117 generally uses a larger proportion of the prompt named entities, and more named
entities overall, than the Fusion Model. Both models generally use fewer named entities than human text when k
is less than vocabulary size. See Section 4 for discussion.

15
Mean sent len (# words)

14

13 Human
Fusion Model
GPT2-117
12
100 101 102 103 104 105
k (Top-k sampling)

Figure 8: Mean sentence length for both models and all k. For both models, sentence length increases as k
increases. The spike at k = 1 is due to long repeating sequences with no sentence-ending token. See Section 7 for
discussion.
0.6
0.5
0.4

Distinct-1
0.3
0.2 Human
0.1 Fusion Model
GPT2-117
0.0
100 101 102 103 104 105
k (Top-k sampling)
(a) Distinct-1 (ratio of unique unigrams in the story to
total number of generated unigrams in the story).

1.0
0.8
0.6
Distinct-2

0.4
Human
0.2 Fusion Model
GPT2-117
0.0
100 101 102 103 104 105
k (Top-k sampling)
(b) Distinct-2 (ratio of unique bigrams in the story to total
number of generated bigrams in the story).

1.0
0.8
0.6
Distinct-3

0.4
Human
0.2 Fusion Model
GPT2-117
0.0
100 101 102 103 104 105
k (Top-k sampling)
(c) Distinct-3 (ratio of unique trigrams in the story to total
number of generated trigrams in the story).

Figure 9: Distinct-n for n = 1, 2, 3, for both models and all k. The ratios, which represent lexical diversity,
increase as k increases, with GPT2-117 reaching human levels at k = 2000 for unigrams, k = 800 for bigrams
and k = 600 for trigrams. Lexical diversity is slightly higher for GPT2-117 than for the Fusion Model for equal k,
but the primary determining factor is k. See Section 6 for discussion.
0.08

Distinct-1 for POS tags


0.07
0.06
0.05
0.04 Human
Fusion Model
0.03 GPT2-117
100 101 102 103 104 105
k (Top-k sampling)
(a) POS tag distinct-1 (ratio of unique POS unigrams in
the story to total number of generated POS unigrams in
the story).

0.4
Distinct-2 for POS tags

0.3

0.2

0.1 Human
Fusion Model
GPT2-117
0.0
100 101 102 103 104 105
k (Top-k sampling)
(b) POS tag distinct-2 (ratio of unique POS bigrams in
the story to total number of generated POS bigrams in
the story).

0.6
Distinct-3 for POS tags

0.4

0.2 Human
Fusion Model
GPT2-117
0.0
100 101 102 103 104 105
k (Top-k sampling)
(c) POS tag distinct-3 (ratio of unique POS trigrams in
the story to total number of generated POS trigrams in
the story).

Figure 10: POS tag distinct-n metric for n = 1, 2, 3, for both models and all k. The ratios, which represent
syntactic diversity, increase as k increases, with GPT2-117 reaching human levels at k = 6000 for unigrams,
k = 9000 for bigrams, and k = 6000 for trigrams. Syntactic diversity is slightly higher for GPT2-117 than for the
Fusion Model for equal k, but the primary determining factor is k. See Section 7 for discussion.
18.0%
20.0%
16.0%

Noun usage
Verb usage

15.0% 14.0%
10.0% Human 12.0% Human
Fusion Model 10.0% Fusion Model
5.0% GPT2-117 GPT2-117
8.0%
100 101 102 103 104 105 100 101 102 103 104 105
k (Top-k sampling) k (Top-k sampling)

8.0%
5.0%

Adjective usage
Adverb usage

6.0% 4.0%
Human 3.0% Human
4.0%
Fusion Model 2.0% Fusion Model
GPT2-117 GPT2-117
2.0% 1.0%
100 101 102 103 104 105 100 101 102 103 104 105
k (Top-k sampling) k (Top-k sampling)

12.0% Human
18.0%
10.0% Fusion Model
Determiner usage

Pronoun usage

16.0% GPT2-117
8.0%
14.0%
6.0%
Human 12.0%
4.0% Fusion Model 10.0%
2.0% GPT2-117
8.0%
100 101 102 103 104 105 100 101 102 103 104 105
k (Top-k sampling) k (Top-k sampling)

2.5%
0.8%
Proper Noun usage

2.0%
Numeral usage

0.6% 1.5%
0.4% 1.0%
Human Human
0.2% Fusion Model 0.5% Fusion Model
GPT2-117 GPT2-117
0.0% 0.0%
100 101 102 103 104 105 100 101 102 103 104 105
k (Top-k sampling) k (Top-k sampling)

Figure 11: Usage of different POS tags in the generated stories. GPT2-117 tends to fit the human distribution
more closely than the Fusion Model as k approaches vocabulary size, in particular producing more specific POS
categories such as Numeral and Proper Noun. When k is small, generated text is characterized by more verbs and
pronouns, and fewer nouns, adjectives, numerals and proper nouns, than human text. See Section 7 for discussion.
5.25 Human 75% Human

% Words that are stopwords


Fusion Model Fusion Model
Mean log unigram prob

5.50 GPT2-117 GPT2-117


70%
5.75
6.00 65%
6.25
6.50 60%
100 101 102 103 104 105 100 101 102 103 104 105
k (Top-k sampling) k (Top-k sampling)
(a) The mean log unigram probability of generated words. (b) The percent of generated words that are stopwords,
Higher values indicate using fewer rare words while for both models, across different k. We use the NLTK
lower values indicate using more rare words. English stopword list.

Figure 12: Rare word usage metrics for both models and all k. GPT2-117 produces slightly more rare words (left)
and slightly fewer stopwords (right) than the Fusion Model, for equal values of k. These rareness metrics do not
reach human levels until k is close to vocabulary size. See Section 6 for discussion.

0 0
Fusion Model, teacher- GPT2-117, teacher-
Mean total log prob of story

Mean total log prob of story

100 force on human text 100 force on human text


Fusion Model, generating GPT2-117, generating
200 200
300 300
400 400
500 500
600 600
100 101 102 103 104 105 100 101 102 103 104 105
k (Top-k sampling) k (Top-k sampling)

Figure 13: The mean total log probability of the story (150 words), as measured by the models on their own gener-
ated output and on human-written stories. Interestingly, the Fusion Model (left) converges to the same probability
it assigns to human-written stories as k approaches vocabulary size, whereas GPT2-117 (right) converges to a
lower probability. See Section 8 for discussion.
Prompt: A man finally discovers his superpower... well into his 80’s.

Fusion Model (k = 10): GPT2-117 (k = 10):


I be a child . bear into a world of darkness , and in the A man in an old white coat sit in a corner . He wear a
world , there be only a few people . My mother and I be black jacket with a blue tie . His hair be grey. His eye s be
always alone . She have never be like this . But , she have black, but his voice be deep and clear. He look to where
no idea what we be together to be . We be always alone . the man be sit , and he see a smile on his lip s. It be a
We be always together . And , of course , she have never smile he know he would see from his own eye s. But he
exist . But , I never think about it , be cause she be always be too late. He be on the sidewalk by the river when the
alone . In the world , it was n’t like she have a power . I be man come . He be wear a black coat with a purple tie . He
always alone , and there be a time when the light would have a black tie and a white shirt . But he be still wear a
turn on . There be a time when I could see the light , and white suit . And it seem he would look back at him. A
I could see it in her eye s , and I could see the light , and I smile on his face . A look his friend do n’t recognize . He
could see it have no

Mean noun concreteness: 3.892 Mean noun concreteness: 4.720


Mean verb concreteness: 2.173 Mean verb concreteness: 2.488

Fusion Model (k = 1000): GPT2-117 (k = 1000):


For a brief moment , the dark green of his eye s flash the ( First time poster , hope its ok ) The young boy , watch
last line s of reality into existence , finally fade slowly into tv , spot the television onscreen , before glance around to
existence so that he could not feel it. Only the million s see the screen start the countdown on the tv , point to the
of pixel s float in his peripheral vision almost disappear . screen in “ It ’s both the same. ” “... let ’s... let ’s try this
His radio respond to the pinging very hard silence of the and... we will team up so that... we can ... have the same
outside world . Seven people have visit his main internal power ....like... so we can use this superpower over and
network for what seem like a lifetime but this time , the over again. ” A brief silence . Only a familiar conversation ,
only switch to an external supply system that he could interrupt his mad dash movement , follow with his high
simply take advantage of. Unable to convey feeling s pitch slurred and wither voice : “ I ca n’t stand anyone
about the last word s he would have to endure , but it talk like that son*s*. ” More casual conversation that
have respond to the innumerable message s and countless interrupt his childish step be rush to the scissor s.
sleepless hour s. Most of them be always available on
its surface , just to make sure. In his quest for to spend
eternity on death , he send

Mean noun concreteness: 3.201 Mean noun concreteness: 3.793


Mean verb concreteness: 2.435 Mean verb concreteness: 3.162

Table 3: Generated stories from both models, under k = 10 and k = 1000. Nouns are highlighted in green
and verbs in yellow. The highlighting intensity reflects the word’s concreteness rating. For equal k, GPT2-117
generally generates more concrete words than the Fusion Model. For both models, low k is characterized by high
noun concreteness (e.g. physical objects such as jacket) and low verb concreteness (e.g. non-physical actions such
as be). Conversely, high k is characterized by low noun concreteness (e.g. abstract concepts such as reality) and
high verb concreteness (e.g. physical actions such as talk). See Section 9 for discussion.
4.3 Human Human
Fusion Model 3.0 Fusion Model
Mean noun concreteness

Mean verb concreteness


4.2
GPT2-117 GPT2-117
4.1 2.8
4.0
2.6
3.9
3.8 2.4

3.7 2.2
100 101 102 103 104 105 100 101 102 103 104 105
k (Top-k sampling) k (Top-k sampling)
(a) Mean concreteness rating (1-5) of nouns in the story. (b) Mean concreteness rating (1-5) of verbs in the story.

Figure 14: Mean concreteness rating of the nouns and verbs in the story, for both models and all k. GPT2-117
generally produces nouns and verbs that are more concrete than the Fusion Model for the same k. For both models,
as k increases, noun concreteness reduces and verb concreteness increases. See Section 9 for discussion.

Prompt: In an alternative reality where sleep is non-existent among living beings, our protagonist (spontaneously or
after an event) falls asleep in which he/she experiences for the first time in human history what a dream is.

GPT2-117 (k = 1000): I sat in my bed as my girlfriend sat behind me, buzzing into her e-reader, letting the day’s
stories write themselves on her’s monitor. Like a blur, all the usual high-asyllabic drivel and senseless ramblings that
normally attracted the attention of a horrid sleeping creature huddled about me like a faucet. She did not know how I
placed this car with her. But I tried, first tried to ignore that I had hired the services of a dog to help and then quietly
used it in a desperate bid to drive the car through the lawn. Each and every day, I watched her drool down an old dusty
hardwood mattress her beady eyes trying desperately to think of this rotting dream. [...]

Table 4: An example of syntactically natural but nonsensical text, generated with high k. Though the text is fluent
and readable, it is surreal and bizarre. See Section 10 for discussion.

You might also like