2023 - Fidelity-Enriched Contrastive Search
2023 - Fidelity-Enriched Contrastive Search
55
Abstract greedy
50 beam
nucleus
In this paper, we address the hallucination prob- 45 contrastive
lem commonly found in natural language gen- FECS (ours)
40
eration tasks. Language models often generate
Factuality
35
fluent and convincing content but lack consis-
tency with the provided source, resulting in 30
potential inaccuracies. We propose a new de- 25
coding method called Fidelity-Enriched Con- 20
trastive Search (FECS), which augments the 1.3B
15 2.7B
Contrastive Search framework with context- 6.7B
aware regularization terms. FECS promotes 10
50 60 70 80 90 100
tokens that are semantically similar to the pro- Diversity
vided source while penalizing repetitiveness in
Figure 1: Results on CNN-DailyMail show our pro-
the generated text. We demonstrate its effec-
posed FECS mitigates hallucination (i.e., improves fac-
tiveness across two tasks prone to hallucination:
tuality) while maintaining diversity of the generated
abstractive summarization and dialogue gener-
summarization.
ation. Results show that FECS consistently
enhances faithfulness across various language
model sizes while maintaining output diversity
turn to a less investigated lens—decoding—to im-
comparable to well-performing decoding algo-
rithms.1 prove faithfulness,2 and introduces a novel decod-
ing method named Fidelity-Enriched Contrastive
1 Introduction Search (FECS).
Decoding algorithms can be categorized into de-
Language models (LMs) have achieved remarkable terministic and stochastic groups. Deterministic
success in generating human-like text, fostering methods such as beam search and greedy decoding
advancements across numerous Natural Language aim to generate the most probable text continua-
Processing (NLP) applications. Despite the flu- tions. While these methods might appear to be
ent and seemingly convincing outputs produced less unfaithful, they are often degenerated. That
by LMs, these models can occasionally generate is, the outputs are uninformative, monotonous, or
content that is factually inconsistent with the pro- repetitive (Li et al., 2016; Holtzman et al., 2019;
vided source (Koehn and Knowles, 2017; Rohrbach Welleck et al., 2019). Conversely, stochastic meth-
et al., 2018; Raunak et al., 2021), an issue known ods such as top-k (Fan et al., 2018) and nucleus
as the hallucination problem (Maynez et al., 2020; sampling (Holtzman et al., 2019) inject randomness
Ji et al., 2023). Methods to mitigate hallucination into the generation process, thereby promoting the
have been explored from various facets, including diversity. Yet, these sampling-based approaches of-
data perspectives (Wang, 2019; Filippova, 2020; ten come at the cost of coherency and semantic con-
Shuster et al., 2021), model architectures (Cao sistency (Basu et al., 2020; Su et al., 2022; Su and
et al., 2018; Aralikatte et al., 2021; Xiao and Wang, Collier, 2023), where increasing the output diver-
2021), and training strategies (Huang et al., 2020; sity positively correlates with hallucinating (Dziri
Chen et al., 2021; Li et al., 2021). In this work, we
2
We follow (Ji et al., 2023) and refer to faithfulness as an
*
Work done during an internship at AIST. antonym to hallucination, i.e., maximizing faithfulness equals
1
https://github.com/ntunlplab/FECS minimizing hallucination.
843
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 843–851
December 6-10, 2023 ©2023 Association for Computational Linguistics
et al., 2021). To reconcile this faithfulness-diversity Specifically, sim(·, ·) employs the token repre-
trade-off, we proposed FECS—a simple yet ef- sentation hxi and hv from the model’s last hid-
fective decoding strategy which extends the Con- den state, calculated by appending v to x0:c+t as
trastive Search framework (Su et al., 2022) and model input. α serves as a pre-determined, non-
introduces context-aware regularization terms to negative hyper-parameter; when α equals 0, Con-
enhance faithfulness and penalize degeneration. trastive Search reduces to greedy decoding. Essen-
Specifically, a candidate token which exhibits (1) tially, Contrastive Search preserves coherence by
a great semantic similarity with tokens from the choosing outputs from the top-k probable candi-
provided source and (2) a low semantic similarity dates while also curbing degeneration behaviors
with previously generated tokens is rewarded with such as repetitions, thereby promoting diversity.
a higher score to promote its selection. Importantly,
FECS can be readily applied to existing LMs off- 2.2 Fidelity-Enriched Contrastive Search
the-shelf, without requiring further training. Motivated by Contrastive Search, we extend this
We evaluate FECS on two tasks particularly framework by integrating a faithfulness term that
prone to text hallucination: abstractive summariza- encourages factuality and reduces hallucination.
tion and dialogue generation (Ji et al., 2023). Ex- Using the notations from Section 2.1, we define
perimental results show that FECS consistently im- FECS as follows:
proves faithfulness across various LM sizes while Consider an input x0:c+t at time step t, where
preserving a level of diversity comparable to pre- x0:c represents the prefix context, and xc:c+t is the
dominant decoding algorithms. previously generated tokens. We further decom-
pose x0:c into: (1) the prompts x0:s , and (2) the
2 Methodology provided source xs:c , which the output is expected
In this section, we present preliminary informa- to remain faithful to. FECS generates the next to-
tion on Contrastive Search (Su et al., 2022) before ken xc+t via the following formula:
detailing our proposed FECS.
n
2.1 Preliminary xc+t = arg max (1 − α − β) × pθ (v|x0:c+t )
v∈V (k) | {z }
model confidence
To address shortcomings in existing decoding meth-
ods, Su et al. (2022) propose Contrastive Search, a − α × max sim(hv , hxi )
c≤i≤c+t−1
new decoding approach capable of generating di- | {z }
degeneration penalty
verse content without compromising coherency. At o
time step t, given an input x0:c+t , where x0:c sig- + β × max sim(hv , hxj )
s≤j≤c−1
nifies the prefix context and xc:c+t represents the | {z }
previously generated tokens, Contrastive Search faithfulness reward
generates the next token xc+t via the following The newly introduced faithfulness term rewards
formula: candidate tokens exhibiting high semantic similar-
ity to tokens in the source content. Specifically, the
n faithfulness term denotes the maximum value of the
xc+t = arg max (1 − α) × pθ (v|x0:c+t )
v∈V (k) | {z } cosine similarity sim(·, ·) between the candidate
model confidence token v and all source tokens {xs , ..., xc−1 }. Here,
o
− α × max sim(hv , hxj ) β is also a pre-determined, non-negative hyper-
c≤j≤c+t−1
| {z } parameter.
degeneration penalty
3 Experimental Setup
Here, V k denotes a set of k candidate tokens with
the top-k probability from the model’s prediction 3.1 Datasets, Models, and Configurations
distribution pθ (·|x0:c+t ). The model confidence We evaluate our method, FECS, on two tasks
term represents the probability of the candidate known for their susceptibility to hallucination is-
token v, while the degeneration penalty term sig- sues: abstractive summarization and dialogue gen-
nifies the maximum value of the cosine similarity eration. For the abstractive summarization task, we
sim(·, ·) between candidate token v and all previ- adopt CNN-DailyMail (CNN-DM) dataset (Nal-
ously generated tokens {xc , ..., xc+t−1 }. lapati et al., 2016), a widely-used benchmark in
844
CNN-DM WoW
Model Size Method
R-1 R-2 R-L BERTSc. FEQA B-4 R-L BERTSc. Q2
Greedy 27.89 12.14 20.37 86.54 32.38 3.76 11.44 74.40 24.37
Beam 28.10 14.14 20.35 84.34 23.59 7.65 17.33 76.51 36.10
Nucleus 20.58 5.25 13.82 84.34 15.54 1.54 10.72 72.27 12.97
1.3B Contrastive 30.06 11.74 20.80 86.70 32.73 4.50 15.89 74.57 25.42
FECS (ours) 30.06 13.07 21.80 87.02 39.87 5.37 14.73 77.59 32.08
Greedy 28.61 12.15 20.99 86.81 37.78 4.14 13.33 70.71 26.39
Beam 28.83 14.28 20.71 86.63 20.89 7.64 18.79 76.58 41.26
Nucleus 24.48 7.14 16.73 85.62 22.62 1.46 11.19 72.19 12.60
2.7B Contrastive 30.33 12.17 21.38 87.08 38.38 3.80 16.32 73.63 27.52
FECS (ours) 28.74 12.56 21.45 87.49 45.75 9.32 22.42 75.27 45.10
Greedy 33.77 14.59 23.95 87.47 42.46 0.27 4.48 67.79 7.14
Beam 29.99 14.77 21.18 86.70 24.59 0.15 4.46 74.86 9.15
Nucleus 27.14 8.11 17.93 85.96 22.75 1.31 9.06 71.21 13.22
6.7B / 6B Contrastive 33.45 13.08 23.07 87.33 40.75 0.87 9.89 72.60 14.13
FECS (ours) 34.80 15.08 24.86 87.75 52.01 2.48 10.32 75.03 23.12
Table 1: Experimental results comparing FECS with other decoding methods across model scales.
several recent studies (Dong et al., 2020; Cao and BLEU-4 (Papineni et al., 2002). In addition, we
Wang, 2021; Cao et al., 2020). The dialogue genera- also report BERTScore (Zhang et al., 2019) on both
tion task employs the popular Wizard of Wikipedia tasks for a more advanced soft metric.
(WoW) dataset (Dinan et al., 2018). The objec-
tive here is to generate responses based on given Faithfulness Metrics. To measure factuality in
knowledge snippets, taken from Wikipedia, that are summarization, we use FEQA (Durmus et al.,
pertinent to the conversation topic. 2020) following prior studies (Aralikatte et al.,
In our experiments involving abstractive sum- 2021; Chen et al., 2021). Higher FEQA scores
marization, we adopt OPT (Zhang et al., 2022) indicate greater faithfulness of the summary to the
with three scales: 1.3B, 2.7B, and 6.7B. For di- source article. For evaluating dialogue, we employ
alogue generation, we follow the Few-Shot Bot Q2 (Honovich et al., 2021), a question-answering
approach (Madotto et al., 2021), using GPT-Neo (QA) based metric designed for assessing factual
1.3B and 2.7B (Black et al., 2021), along with consistency in knowledge-grounded dialogue gen-
GPT-J 6B (Wang and Komatsuzaki, 2021). All ex- eration. Both FEQA and Q2 exhibit strong correla-
periments are conducted with few-shot prompting, tions with human judgments.
using two shots.3 We compare FECS with Con-
trastive Search, Greedy Decoding, Beam Search, Diversity Metric. For both summarization and
and Nucleus Sampling. For Beam Search, we set dialogue tasks, we evaluate the diversity of the
the beam size to 4; for Nucleus Sampling, p = 0.95; generated text x by calculating
and for Contrastive Search, (k, α) = (4, 0.6). For
FECS, we retain the same α value as Contrastive 4
Y Rep-n(x)
Search, setting (k, α, β) = (4, 0.3, 0.3) without diversity(x) = (1.0 − )
100
hyper-parameter tuning. n=2
846
CNN-DM WoW Contrastive Search FECS
Method α (α, β)
1.3B 2.7B 6.7B 1.3B 2.7B 6B Metric
0.6 0.4 0.2 0.0 (0.3, 0.3)
Greedy 1.32 2.66 2.42 1.79 2.58 3.84
Beam 3.32 5.73 5.15 2.41 3.41 4.76 R-1 33.45 34.14 33.92 33.77 34.80
Nucleus 1.31 2.52 2.34 1.78 2.69 3.79 R-2 13.08 14.17 14.43 14.59 15.08
Contrastive 3.55 6.47 6.53 2.84 4.34 5.27 R-L 23.07 23.91 23.97 23.95 24.86
Diversity 94.21 90.13 88.07 83.57 93.18
FECS (ours) 4.20 7.47 8.16 2.91 4.29 5.28 FEQA 40.75 41.12 42.37 42.46 52.01
Table 4: The averaged decoding speed (sec) per instance Table 5: Comparison of FECS and Contrastive Search
using different decoding methods across model scales. with different values of α.
As observed, FECS is comparable to Contrastive Search.
4.3 Analysis
Latency. To assess the decoding latency of our
proposed FECS objective, we report the average
decoding time (sec) per instance in Table 4. The
results are averaged across 100 randomly selected
instances. As observed in both the dialogue gener-
ation and abstractive summarization tasks, FECS
and Contrastive Search perform comparably and
slightly slower than beam search. Greedy and nu-
cleus are the fastest. Figure 2: Human evaluation results comparing the faith-
fulness of FECS against Contrastive Search(CS) on the
The role of α. To establish a more compre- abstractive summarization task. FECS outperforms Con-
hensive baseline, we evaluate FECS against Con- trastive Search, receiving more than twice the votes.
trastive Search with different values of α on the
6.7B model. Intuitively, a smaller α value (i.e.,
the outcome of automatic evaluation, suggesting
a lower degree of diversity) might contribute to
our proposed FECS is able to generated contents
a more factual performance. However, as shown
which are more faithful to the provided source.
in Table 5 lowering α only improves faithfulness
marginally and with essentially the same rouge 6 Conclusion
scores. On the contrary, FECS retains a high level
of diversity and achieves superior performance on This paper introduces a novel decoding approach,
both FEQA and standard metrics, indicating the Fidelity-Enriched Contrastive Search (FECS), de-
effectiveness of our newly introduced β term. signed to enhance faithfulness in text generation.
Our experimental results on abstractive summa-
5 Human Evaluation rization and dialogue generation demonstrated the
In addition to the automatic evaluation, we also per- efficacy of FECS. It consistently improved faithful-
form human evaluation to assess the faithfulness of ness across various LM scales while preserving a
our proposed FECS on the abstractive summariza- level of diversity that is comparable to other lead-
tion task. We compare FECS against Contrastive ing decoding algorithms. Particularly when using
Search, and ask annotators to vote which response larger LMs, it notably enhances faithfulness with
is considered more faithful to the provided source only a minor impact on diversity. This indicates
(i.e., the text to be summarized). Specifically, we that FECS performs effectively when larger LMs
randomly sample 20 instance for each of the three are employed in dialogue generation tasks. In the
model sizes, with a total of 60 instances for the future, we plan to explore how FECS performs
evaluation. More details including the full evalu- with different kinds of source content, including
ation protocol are provided in Appendix A.2. We erroneous or ambiguous inputs.
present the results in Figure 2. As observed, FECS
Limitations
shows superior results, recording more than 60%
of the votes, and outperforms Contrastive Search Firstly, while FECS presents an improvement in
with more than twice the votes. The results support faithfulness and diversity trade-off, its performance
847
could be influenced by the quality of the source Shuyang Cao and Lu Wang. 2021. CLIFF: Contrastive
content. The assumption that source content is al- learning for improving faithfulness and factuality in
abstractive summarization. In Proceedings of the
ways correct and complete may not hold true in
2021 Conference on Empirical Methods in Natural
all scenarios, particularly in cases where the in- Language Processing, pages 6633–6649, Online and
put data is ambiguous, incomplete, or erroneous. Punta Cana, Dominican Republic. Association for
Secondly, the faithfulness assessment is primarily Computational Linguistics.
quantitative, based on FEQA and Q2 established
Ziqiang Cao, Furu Wei, Wenjie Li, and Sujian Li. 2018.
metrics. Although these metrics provide an essen- Faithful to the original: Fact aware neural abstractive
tial standard for comparing models, they may not summarization. In Proceedings of the AAAI Confer-
capture all nuanced aspects of faithfulness, such as ence on Artificial Intelligence, volume 32.
the preservation of subtle implications or subjective
Sihao Chen, Fan Zhang, Kazoo Sone, and Dan Roth.
information. 2021. Improving faithfulness in abstractive sum-
marization with contrast candidate generation and
Acknowledgments selection. In Proceedings of the 2021 Conference of
the North American Chapter of the Association for
We thank the reviewers for their insightful com- Computational Linguistics: Human Language Tech-
ments. This research was supported by JSPS nologies, pages 5935–5941, Online. Association for
Computational Linguistics.
KAKENHI Grant Number 23K16956 and a project
JPNP20006, commissioned by the New Energy and Emily Dinan, Stephen Roller, Kurt Shuster, Angela
Industrial Technology Development Organization Fan, Michael Auli, and Jason Weston. 2018. Wizard
(NEDO). This work was also partially supported by of wikipedia: Knowledge-powered conversational
agents. arXiv preprint arXiv:1811.01241.
National Science and Technology Council, Taiwan,
under grants MOST 110-2221-E-002-128-MY3, Yue Dong, Shuohang Wang, Zhe Gan, Yu Cheng, Jackie
110-2634-F-002-050-, and NSTC 111-2634-F-002- Chi Kit Cheung, and Jingjing Liu. 2020. Multi-
023-, and Ministry of Education (MOE) in Taiwan, fact correction in abstractive text summarization. In
Proceedings of the 2020 Conference on Empirical
under grants NTU-112L900901.
Methods in Natural Language Processing (EMNLP),
pages 9320–9331, Online. Association for Computa-
tional Linguistics.
References
Esin Durmus, He He, and Mona Diab. 2020. FEQA: A
Rahul Aralikatte, Shashi Narayan, Joshua Maynez, question answering evaluation framework for faith-
Sascha Rothe, and Ryan McDonald. 2021. Focus fulness assessment in abstractive summarization. In
attention: Promoting faithfulness and diversity in Proceedings of the 58th Annual Meeting of the Asso-
summarization. In Proceedings of the 59th Annual ciation for Computational Linguistics, pages 5055–
Meeting of the Association for Computational Lin- 5070, Online. Association for Computational Lin-
guistics and the 11th International Joint Conference guistics.
on Natural Language Processing (Volume 1: Long
Papers), pages 6078–6095, Online. Association for Nouha Dziri, Andrea Madotto, Osmar Zaïane, and
Computational Linguistics. Avishek Joey Bose. 2021. Neural path hunter: Re-
ducing hallucination in dialogue systems via path
Sourya Basu, Govardana Sachitanandam Ramachan- grounding. In Proceedings of the 2021 Conference
dran, Nitish Shirish Keskar, and Lav R Varshney. on Empirical Methods in Natural Language Process-
2020. Mirostat: A neural text decoding algorithm ing, pages 2197–2214, Online and Punta Cana, Do-
that directly controls perplexity. arXiv preprint minican Republic. Association for Computational
arXiv:2007.14966. Linguistics.
Sid Black, Leo Gao, Phil Wang, Connor Leahy, Angela Fan, Mike Lewis, and Yann Dauphin. 2018.
and Stella Biderman. 2021. GPT-Neo: Large Hierarchical neural story generation. In Proceedings
Scale Autoregressive Language Modeling with Mesh- of the 56th Annual Meeting of the Association for
Tensorflow. If you use this software, please cite it Computational Linguistics (Volume 1: Long Papers),
using these metadata. pages 889–898, Melbourne, Australia. Association
for Computational Linguistics.
Meng Cao, Yue Dong, Jiapeng Wu, and Jackie Chi Kit
Cheung. 2020. Factual error correction for abstrac- Katja Filippova. 2020. Controlled hallucinations:
tive summarization models. In Proceedings of the Learning to generate faithfully from noisy data. In
2020 Conference on Empirical Methods in Natural Findings of the Association for Computational Lin-
Language Processing (EMNLP), pages 6251–6258, guistics: EMNLP 2020, pages 864–870, Online. As-
Online. Association for Computational Linguistics. sociation for Computational Linguistics.
848
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Computational Linguistics, pages 1906–1919, On-
Yejin Choi. 2019. The curious case of neural text line. Association for Computational Linguistics.
degeneration. arXiv preprint arXiv:1904.09751.
Ramesh Nallapati, Bowen Zhou, Cicero dos Santos,
Or Honovich, Leshem Choshen, Roee Aharoni, Ella Çağlar Gulçehre, and Bing Xiang. 2016. Abstrac-
Neeman, Idan Szpektor, and Omri Abend. 2021. tive text summarization using sequence-to-sequence
q 2 : Evaluating factual consistency in knowledge- RNNs and beyond. In Proceedings of the 20th
grounded dialogues via question generation and ques- SIGNLL Conference on Computational Natural Lan-
tion answering. In Proceedings of the 2021 Confer- guage Learning, pages 280–290, Berlin, Germany.
ence on Empirical Methods in Natural Language Pro- Association for Computational Linguistics.
cessing, pages 7856–7870, Online and Punta Cana,
Dominican Republic. Association for Computational Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Linguistics. Jing Zhu. 2002. Bleu: a method for automatic evalu-
ation of machine translation. In Proceedings of the
Luyang Huang, Lingfei Wu, and Lu Wang. 2020. 40th Annual Meeting of the Association for Compu-
Knowledge graph-augmented abstractive summariza- tational Linguistics, pages 311–318, Philadelphia,
tion with semantic-driven cloze reward. In Proceed- Pennsylvania, USA. Association for Computational
ings of the 58th Annual Meeting of the Association Linguistics.
for Computational Linguistics, pages 5094–5107, On-
line. Association for Computational Linguistics. Vikas Raunak, Arul Menezes, and Marcin Junczys-
Dowmunt. 2021. The curious case of hallucinations
Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan in neural machine translation. In Proceedings of
Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea the 2021 Conference of the North American Chap-
Madotto, and Pascale Fung. 2023. Survey of halluci- ter of the Association for Computational Linguistics:
nation in natural language generation. ACM Comput- Human Language Technologies, pages 1172–1183,
ing Surveys, 55(12):1–38. Online. Association for Computational Linguistics.
Philipp Koehn and Rebecca Knowles. 2017. Six chal- Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns,
lenges for neural machine translation. In Proceedings Trevor Darrell, and Kate Saenko. 2018. Object hallu-
of the First Workshop on Neural Machine Translation, cination in image captioning. In Proceedings of the
pages 28–39, Vancouver. Association for Computa- 2018 Conference on Empirical Methods in Natural
tional Linguistics. Language Processing, pages 4035–4045, Brussels,
Chenliang Li, Bin Bi, Ming Yan, Wei Wang, and Song- Belgium. Association for Computational Linguistics.
fang Huang. 2021. Addressing semantic drift in gen-
erative question answering with auxiliary extraction. Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela,
In Proceedings of the 59th Annual Meeting of the As- and Jason Weston. 2021. Retrieval augmentation
sociation for Computational Linguistics and the 11th reduces hallucination in conversation. In Findings
International Joint Conference on Natural Language of the Association for Computational Linguistics:
Processing (Volume 2: Short Papers), pages 942–947, EMNLP 2021, pages 3784–3803, Punta Cana, Do-
Online. Association for Computational Linguistics. minican Republic. Association for Computational
Linguistics.
Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao,
and Bill Dolan. 2016. A diversity-promoting ob- Yixuan Su and Nigel Collier. 2023. Contrastive search
jective function for neural conversation models. In is what you need for neural text generation. Transac-
Proceedings of the 2016 Conference of the North tions on Machine Learning Research.
American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies, Yixuan Su, Tian Lan, Yan Wang, Dani Yogatama, Ling-
pages 110–119, San Diego, California. Association peng Kong, and Nigel Collier. 2022. A contrastive
for Computational Linguistics. framework for neural text generation. In Advances
in Neural Information Processing Systems.
Chin-Yew Lin. 2004. ROUGE: A package for auto-
matic evaluation of summaries. In Text Summariza- Ben Wang and Aran Komatsuzaki. 2021. Gpt-j-6b: A 6
tion Branches Out, pages 74–81, Barcelona, Spain. billion parameter autoregressive language model.
Association for Computational Linguistics.
Hongmin Wang. 2019. Revisiting challenges in data-to-
Andrea Madotto, Zhaojiang Lin, Genta Indra Winata, text generation with fact grounding. In Proceedings
and Pascale Fung. 2021. Few-shot bot: Prompt- of the 12th International Conference on Natural Lan-
based learning for dialogue systems. arXiv preprint guage Generation, pages 311–322, Tokyo, Japan.
arXiv:2110.08118. Association for Computational Linguistics.
Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Di-
Ryan McDonald. 2020. On faithfulness and factu- nan, Kyunghyun Cho, and Jason Weston. 2019. Neu-
ality in abstractive summarization. In Proceedings ral text generation with unlikelihood training. arXiv
of the 58th Annual Meeting of the Association for preprint arXiv:1908.04319.
849
Yijun Xiao and William Yang Wang. 2021. On hal- Prompt for Dialogue Generation
lucination and predictive uncertainty in conditional
language generation. In Proceedings of the 16th Con- Topic: Green Eggs and Ham
ference of the European Chapter of the Association
for Computational Linguistics: Main Volume, pages System: I have, and we made green eggs and ham for the
kids when I did. Dr. Seuss wrote it.
2734–2744, Online. Association for Computational User: yeah. nice! what other cool facts can you tell me?
Linguistics.
We know that: Green Eggs and Ham. As of 2016, the book
Susan Zhang, Stephen Roller, Naman Goyal, Mikel has sold 8 million copies worldwide.
Artetxe, Moya Chen, Shuohui Chen, Christopher De- System replies: It has sold 8 million copies in many
languages. Hebrew is one because I bought it as a gift in
wan, Mona Diab, Xian Li, Xi Victoria Lin, et al. 2022. that one.
Opt: Open pre-trained transformer language models.
arXiv preprint arXiv:2205.01068. Topic: Neil Brooks
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Wein- System: Yes, I do. Have you heard of Neil Brooks. He is a
berger, and Yoav Artzi. 2019. Bertscore: Evaluating sprint freestyle swimmer that won the 100 m medley relay
at the 1980 Olympics in Moscow
text generation with bert. In International Confer- User: I have never heard of him but he sounds like he was
ence on Learning Representations. a very good swimmer.
Article:
Tiger Woods will be wondering if he can ever catch a
break after suffering a bizarre injury on the ninth hole at
the Masters on Sunday. [...] this was Woods' best finish in
over a year.
Summarization:
Table 6: The evaluation results of repetition and diversity on FECS and other decoding methods across model scales.
Given two summaries (Summary_A and Summary_B), you should determine which one is
more faithful to the provided Source, and fill in “A” or “B” in the Faithful column.
2. The summary contains information which can not be supported by the source.
○ If there is a tie, choose the one with less information that can not be
supported by the source.
Figure 5: The human evaluation protocol for the abstractive summarization task.
851