0% found this document useful (0 votes)

38 views9 pages

2023 - Fidelity-Enriched Contrastive Search

Uploaded by

lingyong fang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views9 pages

2023 - Fidelity-Enriched Contrastive Search

Uploaded by

lingyong fang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Fidelity-Enriched Contrastive Search:

Reconciling the Faithfulness-Diversity Trade-Off in Text Generation

Wei-Lin Chen12* Cheng-Kuang Wu1 Hsin-Hsi Chen1 Chung-Chi Chen2
1
National Taiwan University, Taiwan
2
Artificial Intelligence Research Center, AIST, Japan
[email protected]
[email protected]
[email protected]

55
Abstract greedy
50 beam
nucleus
In this paper, we address the hallucination prob- 45 contrastive
lem commonly found in natural language gen- FECS (ours)
40
eration tasks. Language models often generate

Factuality
35
fluent and convincing content but lack consis-
tency with the provided source, resulting in 30
potential inaccuracies. We propose a new de- 25
coding method called Fidelity-Enriched Con- 20
trastive Search (FECS), which augments the 1.3B
15 2.7B
Contrastive Search framework with context- 6.7B
aware regularization terms. FECS promotes 10
50 60 70 80 90 100
tokens that are semantically similar to the pro- Diversity
vided source while penalizing repetitiveness in
Figure 1: Results on CNN-DailyMail show our pro-
the generated text. We demonstrate its effec-
posed FECS mitigates hallucination (i.e., improves fac-
tiveness across two tasks prone to hallucination:
tuality) while maintaining diversity of the generated
abstractive summarization and dialogue gener-
summarization.
ation. Results show that FECS consistently
enhances faithfulness across various language
model sizes while maintaining output diversity
turn to a less investigated lens—decoding—to im-
comparable to well-performing decoding algo-
rithms.1 prove faithfulness,2 and introduces a novel decod-
ing method named Fidelity-Enriched Contrastive
1 Introduction Search (FECS).
Decoding algorithms can be categorized into de-
Language models (LMs) have achieved remarkable terministic and stochastic groups. Deterministic
success in generating human-like text, fostering methods such as beam search and greedy decoding
advancements across numerous Natural Language aim to generate the most probable text continua-
Processing (NLP) applications. Despite the flu- tions. While these methods might appear to be
ent and seemingly convincing outputs produced less unfaithful, they are often degenerated. That
by LMs, these models can occasionally generate is, the outputs are uninformative, monotonous, or
content that is factually inconsistent with the pro- repetitive (Li et al., 2016; Holtzman et al., 2019;
vided source (Koehn and Knowles, 2017; Rohrbach Welleck et al., 2019). Conversely, stochastic meth-
et al., 2018; Raunak et al., 2021), an issue known ods such as top-k (Fan et al., 2018) and nucleus
as the hallucination problem (Maynez et al., 2020; sampling (Holtzman et al., 2019) inject randomness
Ji et al., 2023). Methods to mitigate hallucination into the generation process, thereby promoting the
have been explored from various facets, including diversity. Yet, these sampling-based approaches of-
data perspectives (Wang, 2019; Filippova, 2020; ten come at the cost of coherency and semantic con-
Shuster et al., 2021), model architectures (Cao sistency (Basu et al., 2020; Su et al., 2022; Su and
et al., 2018; Aralikatte et al., 2021; Xiao and Wang, Collier, 2023), where increasing the output diver-
2021), and training strategies (Huang et al., 2020; sity positively correlates with hallucinating (Dziri
Chen et al., 2021; Li et al., 2021). In this work, we
2
We follow (Ji et al., 2023) and refer to faithfulness as an
*
Work done during an internship at AIST. antonym to hallucination, i.e., maximizing faithfulness equals
1
https://github.com/ntunlplab/FECS minimizing hallucination.

843
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 843–851
December 6-10, 2023 ©2023 Association for Computational Linguistics
et al., 2021). To reconcile this faithfulness-diversity Specifically, sim(·, ·) employs the token repre-
trade-off, we proposed FECS—a simple yet ef- sentation hxi and hv from the model’s last hid-
fective decoding strategy which extends the Con- den state, calculated by appending v to x0:c+t as
trastive Search framework (Su et al., 2022) and model input. α serves as a pre-determined, non-
introduces context-aware regularization terms to negative hyper-parameter; when α equals 0, Con-
enhance faithfulness and penalize degeneration. trastive Search reduces to greedy decoding. Essen-
Specifically, a candidate token which exhibits (1) tially, Contrastive Search preserves coherence by
a great semantic similarity with tokens from the choosing outputs from the top-k probable candi-
provided source and (2) a low semantic similarity dates while also curbing degeneration behaviors
with previously generated tokens is rewarded with such as repetitions, thereby promoting diversity.
a higher score to promote its selection. Importantly,
FECS can be readily applied to existing LMs off- 2.2 Fidelity-Enriched Contrastive Search
the-shelf, without requiring further training. Motivated by Contrastive Search, we extend this
We evaluate FECS on two tasks particularly framework by integrating a faithfulness term that
prone to text hallucination: abstractive summariza- encourages factuality and reduces hallucination.
tion and dialogue generation (Ji et al., 2023). Ex- Using the notations from Section 2.1, we define
perimental results show that FECS consistently im- FECS as follows:
proves faithfulness across various LM sizes while Consider an input x0:c+t at time step t, where
preserving a level of diversity comparable to pre- x0:c represents the prefix context, and xc:c+t is the
dominant decoding algorithms. previously generated tokens. We further decom-
pose x0:c into: (1) the prompts x0:s , and (2) the
2 Methodology provided source xs:c , which the output is expected
In this section, we present preliminary informa- to remain faithful to. FECS generates the next to-
tion on Contrastive Search (Su et al., 2022) before ken xc+t via the following formula:
detailing our proposed FECS.
n
2.1 Preliminary xc+t = arg max (1 − α − β) × pθ (v|x0:c+t )
v∈V (k) | {z }
model confidence
To address shortcomings in existing decoding meth-
ods, Su et al. (2022) propose Contrastive Search, a − α × max sim(hv , hxi )
c≤i≤c+t−1
new decoding approach capable of generating di- | {z }
degeneration penalty
verse content without compromising coherency. At o
time step t, given an input x0:c+t , where x0:c sig- + β × max sim(hv , hxj )
s≤j≤c−1
nifies the prefix context and xc:c+t represents the | {z }
previously generated tokens, Contrastive Search faithfulness reward

generates the next token xc+t via the following The newly introduced faithfulness term rewards
formula: candidate tokens exhibiting high semantic similar-
ity to tokens in the source content. Specifically, the
n faithfulness term denotes the maximum value of the
xc+t = arg max (1 − α) × pθ (v|x0:c+t )
v∈V (k) | {z } cosine similarity sim(·, ·) between the candidate
model confidence token v and all source tokens {xs , ..., xc−1 }. Here,
o
− α × max sim(hv , hxj ) β is also a pre-determined, non-negative hyper-
c≤j≤c+t−1
| {z } parameter.
degeneration penalty
3 Experimental Setup
Here, V k denotes a set of k candidate tokens with
the top-k probability from the model’s prediction 3.1 Datasets, Models, and Configurations
distribution pθ (·|x0:c+t ). The model confidence We evaluate our method, FECS, on two tasks
term represents the probability of the candidate known for their susceptibility to hallucination is-
token v, while the degeneration penalty term sig- sues: abstractive summarization and dialogue gen-
nifies the maximum value of the cosine similarity eration. For the abstractive summarization task, we
sim(·, ·) between candidate token v and all previ- adopt CNN-DailyMail (CNN-DM) dataset (Nal-
ously generated tokens {xc , ..., xc+t−1 }. lapati et al., 2016), a widely-used benchmark in
844
CNN-DM WoW
Model Size Method
R-1 R-2 R-L BERTSc. FEQA B-4 R-L BERTSc. Q2
Greedy 27.89 12.14 20.37 86.54 32.38 3.76 11.44 74.40 24.37
Beam 28.10 14.14 20.35 84.34 23.59 7.65 17.33 76.51 36.10
Nucleus 20.58 5.25 13.82 84.34 15.54 1.54 10.72 72.27 12.97
1.3B Contrastive 30.06 11.74 20.80 86.70 32.73 4.50 15.89 74.57 25.42
FECS (ours) 30.06 13.07 21.80 87.02 39.87 5.37 14.73 77.59 32.08
Greedy 28.61 12.15 20.99 86.81 37.78 4.14 13.33 70.71 26.39
Beam 28.83 14.28 20.71 86.63 20.89 7.64 18.79 76.58 41.26
Nucleus 24.48 7.14 16.73 85.62 22.62 1.46 11.19 72.19 12.60
2.7B Contrastive 30.33 12.17 21.38 87.08 38.38 3.80 16.32 73.63 27.52
FECS (ours) 28.74 12.56 21.45 87.49 45.75 9.32 22.42 75.27 45.10
Greedy 33.77 14.59 23.95 87.47 42.46 0.27 4.48 67.79 7.14
Beam 29.99 14.77 21.18 86.70 24.59 0.15 4.46 74.86 9.15
Nucleus 27.14 8.11 17.93 85.96 22.75 1.31 9.06 71.21 13.22
6.7B / 6B Contrastive 33.45 13.08 23.07 87.33 40.75 0.87 9.89 72.60 14.13
FECS (ours) 34.80 15.08 24.86 87.75 52.01 2.48 10.32 75.03 23.12

Table 1: Experimental results comparing FECS with other decoding methods across model scales.

several recent studies (Dong et al., 2020; Cao and BLEU-4 (Papineni et al., 2002). In addition, we
Wang, 2021; Cao et al., 2020). The dialogue genera- also report BERTScore (Zhang et al., 2019) on both
tion task employs the popular Wizard of Wikipedia tasks for a more advanced soft metric.
(WoW) dataset (Dinan et al., 2018). The objec-
tive here is to generate responses based on given Faithfulness Metrics. To measure factuality in
knowledge snippets, taken from Wikipedia, that are summarization, we use FEQA (Durmus et al.,
pertinent to the conversation topic. 2020) following prior studies (Aralikatte et al.,
In our experiments involving abstractive sum- 2021; Chen et al., 2021). Higher FEQA scores
marization, we adopt OPT (Zhang et al., 2022) indicate greater faithfulness of the summary to the
with three scales: 1.3B, 2.7B, and 6.7B. For di- source article. For evaluating dialogue, we employ
alogue generation, we follow the Few-Shot Bot Q2 (Honovich et al., 2021), a question-answering
approach (Madotto et al., 2021), using GPT-Neo (QA) based metric designed for assessing factual
1.3B and 2.7B (Black et al., 2021), along with consistency in knowledge-grounded dialogue gen-
GPT-J 6B (Wang and Komatsuzaki, 2021). All ex- eration. Both FEQA and Q2 exhibit strong correla-
periments are conducted with few-shot prompting, tions with human judgments.
using two shots.3 We compare FECS with Con-
trastive Search, Greedy Decoding, Beam Search, Diversity Metric. For both summarization and
and Nucleus Sampling. For Beam Search, we set dialogue tasks, we evaluate the diversity of the
the beam size to 4; for Nucleus Sampling, p = 0.95; generated text x by calculating
and for Contrastive Search, (k, α) = (4, 0.6). For
FECS, we retain the same α value as Contrastive 4
Y Rep-n(x)
Search, setting (k, α, β) = (4, 0.3, 0.3) without diversity(x) = (1.0 − )
100
hyper-parameter tuning. n=2

3.2 Evaluation Metrics where Rep-n(x) measures the proportion of n-gram

Our evaluation process employs the following met- repetitions in x, and is calculated as
rics:
|unique-n-gram(x)|
Standard Metrics. For assessing the quality of Rep-n(x) = (1 − ) × 100
|total-n-gram(x)|
summarization, we employ ROUGE (Lin, 2004).
For dialogue generation, we use ROUGE-L and A higher diversity score suggests the model outputs
3
Detailed examples of prompts and additional configura- exhibit less degeneration (Welleck et al., 2019; Su
tion information can be found in Appendix A. et al., 2022).
845
Article Dataset Model Size Faithfulness Diversity
West Ham are discussing a deal for Jamaican starlet
DeShane Beckford after he impressed on trial. The skilful 1.3B +21.83% -5.00%
17-year-old forward from Montego Bay United was invited CNN-DM 2.7B +19.20% -0.20%
to train with West Ham’s academy earlier this month and 6.7B +27.63% -1.10%
has impressed coaches after spending two weeks with the 1.3B +26.20% -35.00%
club. Beckford also has offers from clubs in Belgium. [...] 2.7B +63.88% -11.20%
The Hammers will have the cheapest pricing strategy in WoW
6B +63.62% -3.30%
the Barclays Premier League in a bid to fill the 54,000 ca-
pacity stadium when they make the switch for the 2016-17
season. Table 3: Relative improvements in faithfulness and re-
Summary by Contrastive Search duction of diversity of FECS over Contrastive Search.
West Ham are discussing DeShane Beckford .
Jamaican starlet impressed on trial at Upton Park .
Summary by FECS tities are already present in the previous output,
West Ham are discussing a deal for Jamaican starlet
DeShane Beckford . the degeneration penalty can inadvertently increase
Beckford impressed on trial at West Ham earlier this hallucinations. For instance, the term “Upton Park”
month . produced by Contrastive Search lacks support from
the source, whereas the correct output should be
Table 2: An actual example of news summaries gener-
ated by Contrastive Search and FECS on an article from the previously generated “West Ham”. In this case,
CNN-DailyMail. Text highlighted in green indicates FECS accurately reproduces “West Ham”. Build-
factual information; red indicates hallucination not sup- ing on the framework of Contrastive Search, FECS
ported by the article. not only inherits its properties of coherency and di-
versity (avoidance of degeneration) but also fosters
the utilization of tokens that faithfully represent the
4 Experimental Results
provided source content.
4.1 Faithfulness
4.2 Diversity
Table 1 presents the results for abstractive sum-
marization and dialogue generation. For abstrac- As we discussed in Section 1, model outputs must
tive summarization, FECS achieves substantial im- balance faithfulness and diversity. To better un-
provements on the factuality score across all scales, derstand the impact of our proposed faithfulness
with 7.14%, 7.37%, and 9.55% increases for the reward on these two facets in the context of the
1.3B, 2.7B, and 6.7B models, respectively. More- original Contrastive Search, we calculated the im-
over, FECS records strong results in the ROUGE provements in faithfulness and the reductions in
score and outperforms all other methods at the 6.7B diversity based on the results from both the pro-
scale. For dialogue generation, on the 1.3B scale, posed FECS and the Contrastive Search.4 Ta-
all stochastic algorithms, including FECS, fall short ble 3 presents these evaluations. With the CNN-
of Beam Search in most metrics. However, FECS DailyMail dataset, FECS notably enhances faith-
surpasses other stochastic algorithms in terms of fulness while marginally affecting diversity. Es-
BLEU-4 and Q2 . Upon scaling up to 2.7B and pecially when the model size exceeds 2.7B, the
6B, FECS outperforms all methods substantially decrease in diversity ranges only from 0.2% to
in terms of BLEU-4, ROUGE-L, and Q2 . Notably, 1.1%. These findings suggest that FECS success-
the 6B model performs worse than its smaller coun- fully negotiates the faithfulness-diversity trade-off
terparts, consistent with previous findings (Madotto in abstractive summarization. Contrastingly, in the
et al., 2021). Wizard of Wikipedia dataset, FECS shows greater
Compared to Contrastive Search, FECS ex- improvements in faithfulness and lesser reductions
hibits a superior ability to focus on entities within in diversity as the model size increases. Specif-
the source material, emphasizing factual informa- ically, when the model size reaches 6.7B, FECS
tion more comprehensively. As evident in Fig- demonstrates a 63.62% improvement in faithful-
ure 2, FECS provides more complete informa- ness and experiences a mere 3.3% decrease in di-
tion—comparing “Jamaican starlet DeShane Beck- versity. This implies that FECS performs more
ford” versus “DeShane Beckford”—and generates effectively when larger LMs are employed in dia-
output more comprehensively, evidenced by Con- logue generation tasks.
trastive Search’s failure to produce the time phrase 4
The raw results and full evaluation for other decoding
“earlier this month". Furthermore, when factual en- methods are provided in Table 6.

846
CNN-DM WoW Contrastive Search FECS
Method α (α, β)
1.3B 2.7B 6.7B 1.3B 2.7B 6B Metric
0.6 0.4 0.2 0.0 (0.3, 0.3)
Greedy 1.32 2.66 2.42 1.79 2.58 3.84
Beam 3.32 5.73 5.15 2.41 3.41 4.76 R-1 33.45 34.14 33.92 33.77 34.80
Nucleus 1.31 2.52 2.34 1.78 2.69 3.79 R-2 13.08 14.17 14.43 14.59 15.08
Contrastive 3.55 6.47 6.53 2.84 4.34 5.27 R-L 23.07 23.91 23.97 23.95 24.86
Diversity 94.21 90.13 88.07 83.57 93.18
FECS (ours) 4.20 7.47 8.16 2.91 4.29 5.28 FEQA 40.75 41.12 42.37 42.46 52.01

Table 4: The averaged decoding speed (sec) per instance Table 5: Comparison of FECS and Contrastive Search
using different decoding methods across model scales. with different values of α.
As observed, FECS is comparable to Contrastive Search.

4.3 Analysis
Latency. To assess the decoding latency of our
proposed FECS objective, we report the average
decoding time (sec) per instance in Table 4. The
results are averaged across 100 randomly selected
instances. As observed in both the dialogue gener-
ation and abstractive summarization tasks, FECS
and Contrastive Search perform comparably and
slightly slower than beam search. Greedy and nu-
cleus are the fastest. Figure 2: Human evaluation results comparing the faith-
fulness of FECS against Contrastive Search(CS) on the
The role of α. To establish a more compre- abstractive summarization task. FECS outperforms Con-
hensive baseline, we evaluate FECS against Con- trastive Search, receiving more than twice the votes.
trastive Search with different values of α on the
6.7B model. Intuitively, a smaller α value (i.e.,
the outcome of automatic evaluation, suggesting
a lower degree of diversity) might contribute to
our proposed FECS is able to generated contents
a more factual performance. However, as shown
which are more faithful to the provided source.
in Table 5 lowering α only improves faithfulness
marginally and with essentially the same rouge 6 Conclusion
scores. On the contrary, FECS retains a high level
of diversity and achieves superior performance on This paper introduces a novel decoding approach,
both FEQA and standard metrics, indicating the Fidelity-Enriched Contrastive Search (FECS), de-
effectiveness of our newly introduced β term. signed to enhance faithfulness in text generation.
Our experimental results on abstractive summa-
5 Human Evaluation rization and dialogue generation demonstrated the
In addition to the automatic evaluation, we also per- efficacy of FECS. It consistently improved faithful-
form human evaluation to assess the faithfulness of ness across various LM scales while preserving a
our proposed FECS on the abstractive summariza- level of diversity that is comparable to other lead-
tion task. We compare FECS against Contrastive ing decoding algorithms. Particularly when using
Search, and ask annotators to vote which response larger LMs, it notably enhances faithfulness with
is considered more faithful to the provided source only a minor impact on diversity. This indicates
(i.e., the text to be summarized). Specifically, we that FECS performs effectively when larger LMs
randomly sample 20 instance for each of the three are employed in dialogue generation tasks. In the
model sizes, with a total of 60 instances for the future, we plan to explore how FECS performs
evaluation. More details including the full evalu- with different kinds of source content, including
ation protocol are provided in Appendix A.2. We erroneous or ambiguous inputs.
present the results in Figure 2. As observed, FECS
Limitations
shows superior results, recording more than 60%
of the votes, and outperforms Contrastive Search Firstly, while FECS presents an improvement in
with more than twice the votes. The results support faithfulness and diversity trade-off, its performance
847
could be influenced by the quality of the source Shuyang Cao and Lu Wang. 2021. CLIFF: Contrastive
content. The assumption that source content is al- learning for improving faithfulness and factuality in
abstractive summarization. In Proceedings of the
ways correct and complete may not hold true in
2021 Conference on Empirical Methods in Natural
all scenarios, particularly in cases where the in- Language Processing, pages 6633–6649, Online and
put data is ambiguous, incomplete, or erroneous. Punta Cana, Dominican Republic. Association for
Secondly, the faithfulness assessment is primarily Computational Linguistics.
quantitative, based on FEQA and Q2 established
Ziqiang Cao, Furu Wei, Wenjie Li, and Sujian Li. 2018.
metrics. Although these metrics provide an essen- Faithful to the original: Fact aware neural abstractive
tial standard for comparing models, they may not summarization. In Proceedings of the AAAI Confer-
capture all nuanced aspects of faithfulness, such as ence on Artificial Intelligence, volume 32.
the preservation of subtle implications or subjective
Sihao Chen, Fan Zhang, Kazoo Sone, and Dan Roth.
information. 2021. Improving faithfulness in abstractive sum-
marization with contrast candidate generation and
Acknowledgments selection. In Proceedings of the 2021 Conference of
the North American Chapter of the Association for
We thank the reviewers for their insightful com- Computational Linguistics: Human Language Tech-
ments. This research was supported by JSPS nologies, pages 5935–5941, Online. Association for
Computational Linguistics.
KAKENHI Grant Number 23K16956 and a project
JPNP20006, commissioned by the New Energy and Emily Dinan, Stephen Roller, Kurt Shuster, Angela
Industrial Technology Development Organization Fan, Michael Auli, and Jason Weston. 2018. Wizard
(NEDO). This work was also partially supported by of wikipedia: Knowledge-powered conversational
agents. arXiv preprint arXiv:1811.01241.
National Science and Technology Council, Taiwan,
under grants MOST 110-2221-E-002-128-MY3, Yue Dong, Shuohang Wang, Zhe Gan, Yu Cheng, Jackie
110-2634-F-002-050-, and NSTC 111-2634-F-002- Chi Kit Cheung, and Jingjing Liu. 2020. Multi-
023-, and Ministry of Education (MOE) in Taiwan, fact correction in abstractive text summarization. In
Proceedings of the 2020 Conference on Empirical
under grants NTU-112L900901.
Methods in Natural Language Processing (EMNLP),
pages 9320–9331, Online. Association for Computa-
tional Linguistics.
References
Esin Durmus, He He, and Mona Diab. 2020. FEQA: A
Rahul Aralikatte, Shashi Narayan, Joshua Maynez, question answering evaluation framework for faith-
Sascha Rothe, and Ryan McDonald. 2021. Focus fulness assessment in abstractive summarization. In
attention: Promoting faithfulness and diversity in Proceedings of the 58th Annual Meeting of the Asso-
summarization. In Proceedings of the 59th Annual ciation for Computational Linguistics, pages 5055–
Meeting of the Association for Computational Lin- 5070, Online. Association for Computational Lin-
guistics and the 11th International Joint Conference guistics.
on Natural Language Processing (Volume 1: Long
Papers), pages 6078–6095, Online. Association for Nouha Dziri, Andrea Madotto, Osmar Zaïane, and
Computational Linguistics. Avishek Joey Bose. 2021. Neural path hunter: Re-
ducing hallucination in dialogue systems via path
Sourya Basu, Govardana Sachitanandam Ramachan- grounding. In Proceedings of the 2021 Conference
dran, Nitish Shirish Keskar, and Lav R Varshney. on Empirical Methods in Natural Language Process-
2020. Mirostat: A neural text decoding algorithm ing, pages 2197–2214, Online and Punta Cana, Do-
that directly controls perplexity. arXiv preprint minican Republic. Association for Computational
arXiv:2007.14966. Linguistics.

Sid Black, Leo Gao, Phil Wang, Connor Leahy, Angela Fan, Mike Lewis, and Yann Dauphin. 2018.
and Stella Biderman. 2021. GPT-Neo: Large Hierarchical neural story generation. In Proceedings
Scale Autoregressive Language Modeling with Mesh- of the 56th Annual Meeting of the Association for
Tensorflow. If you use this software, please cite it Computational Linguistics (Volume 1: Long Papers),
using these metadata. pages 889–898, Melbourne, Australia. Association
for Computational Linguistics.
Meng Cao, Yue Dong, Jiapeng Wu, and Jackie Chi Kit
Cheung. 2020. Factual error correction for abstrac- Katja Filippova. 2020. Controlled hallucinations:
tive summarization models. In Proceedings of the Learning to generate faithfully from noisy data. In
2020 Conference on Empirical Methods in Natural Findings of the Association for Computational Lin-
Language Processing (EMNLP), pages 6251–6258, guistics: EMNLP 2020, pages 864–870, Online. As-
Online. Association for Computational Linguistics. sociation for Computational Linguistics.

848
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Computational Linguistics, pages 1906–1919, On-
Yejin Choi. 2019. The curious case of neural text line. Association for Computational Linguistics.
degeneration. arXiv preprint arXiv:1904.09751.
Ramesh Nallapati, Bowen Zhou, Cicero dos Santos,
Or Honovich, Leshem Choshen, Roee Aharoni, Ella Çağlar Gulçehre, and Bing Xiang. 2016. Abstrac-
Neeman, Idan Szpektor, and Omri Abend. 2021. tive text summarization using sequence-to-sequence
q 2 : Evaluating factual consistency in knowledge- RNNs and beyond. In Proceedings of the 20th
grounded dialogues via question generation and ques- SIGNLL Conference on Computational Natural Lan-
tion answering. In Proceedings of the 2021 Confer- guage Learning, pages 280–290, Berlin, Germany.
ence on Empirical Methods in Natural Language Pro- Association for Computational Linguistics.
cessing, pages 7856–7870, Online and Punta Cana,
Dominican Republic. Association for Computational Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Linguistics. Jing Zhu. 2002. Bleu: a method for automatic evalu-
ation of machine translation. In Proceedings of the
Luyang Huang, Lingfei Wu, and Lu Wang. 2020. 40th Annual Meeting of the Association for Compu-
Knowledge graph-augmented abstractive summariza- tational Linguistics, pages 311–318, Philadelphia,
tion with semantic-driven cloze reward. In Proceed- Pennsylvania, USA. Association for Computational
ings of the 58th Annual Meeting of the Association Linguistics.
for Computational Linguistics, pages 5094–5107, On-
line. Association for Computational Linguistics. Vikas Raunak, Arul Menezes, and Marcin Junczys-
Dowmunt. 2021. The curious case of hallucinations
Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan in neural machine translation. In Proceedings of
Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea the 2021 Conference of the North American Chap-
Madotto, and Pascale Fung. 2023. Survey of halluci- ter of the Association for Computational Linguistics:
nation in natural language generation. ACM Comput- Human Language Technologies, pages 1172–1183,
ing Surveys, 55(12):1–38. Online. Association for Computational Linguistics.
Philipp Koehn and Rebecca Knowles. 2017. Six chal- Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns,
lenges for neural machine translation. In Proceedings Trevor Darrell, and Kate Saenko. 2018. Object hallu-
of the First Workshop on Neural Machine Translation, cination in image captioning. In Proceedings of the
pages 28–39, Vancouver. Association for Computa- 2018 Conference on Empirical Methods in Natural
tional Linguistics. Language Processing, pages 4035–4045, Brussels,
Chenliang Li, Bin Bi, Ming Yan, Wei Wang, and Song- Belgium. Association for Computational Linguistics.
fang Huang. 2021. Addressing semantic drift in gen-
erative question answering with auxiliary extraction. Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela,
In Proceedings of the 59th Annual Meeting of the As- and Jason Weston. 2021. Retrieval augmentation
sociation for Computational Linguistics and the 11th reduces hallucination in conversation. In Findings
International Joint Conference on Natural Language of the Association for Computational Linguistics:
Processing (Volume 2: Short Papers), pages 942–947, EMNLP 2021, pages 3784–3803, Punta Cana, Do-
Online. Association for Computational Linguistics. minican Republic. Association for Computational
Linguistics.
Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao,
and Bill Dolan. 2016. A diversity-promoting ob- Yixuan Su and Nigel Collier. 2023. Contrastive search
jective function for neural conversation models. In is what you need for neural text generation. Transac-
Proceedings of the 2016 Conference of the North tions on Machine Learning Research.
American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies, Yixuan Su, Tian Lan, Yan Wang, Dani Yogatama, Ling-
pages 110–119, San Diego, California. Association peng Kong, and Nigel Collier. 2022. A contrastive
for Computational Linguistics. framework for neural text generation. In Advances
in Neural Information Processing Systems.
Chin-Yew Lin. 2004. ROUGE: A package for auto-
matic evaluation of summaries. In Text Summariza- Ben Wang and Aran Komatsuzaki. 2021. Gpt-j-6b: A 6
tion Branches Out, pages 74–81, Barcelona, Spain. billion parameter autoregressive language model.
Association for Computational Linguistics.
Hongmin Wang. 2019. Revisiting challenges in data-to-
Andrea Madotto, Zhaojiang Lin, Genta Indra Winata, text generation with fact grounding. In Proceedings
and Pascale Fung. 2021. Few-shot bot: Prompt- of the 12th International Conference on Natural Lan-
based learning for dialogue systems. arXiv preprint guage Generation, pages 311–322, Tokyo, Japan.
arXiv:2110.08118. Association for Computational Linguistics.

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Di-
Ryan McDonald. 2020. On faithfulness and factu- nan, Kyunghyun Cho, and Jason Weston. 2019. Neu-
ality in abstractive summarization. In Proceedings ral text generation with unlikelihood training. arXiv
of the 58th Annual Meeting of the Association for preprint arXiv:1908.04319.

849
Yijun Xiao and William Yang Wang. 2021. On hal- Prompt for Dialogue Generation
lucination and predictive uncertainty in conditional
language generation. In Proceedings of the 16th Con- Topic: Green Eggs and Ham
ference of the European Chapter of the Association
for Computational Linguistics: Main Volume, pages System: I have, and we made green eggs and ham for the
kids when I did. Dr. Seuss wrote it.
2734–2744, Online. Association for Computational User: yeah. nice! what other cool facts can you tell me?
Linguistics.
We know that: Green Eggs and Ham. As of 2016, the book
Susan Zhang, Stephen Roller, Naman Goyal, Mikel has sold 8 million copies worldwide.
Artetxe, Moya Chen, Shuohui Chen, Christopher De- System replies: It has sold 8 million copies in many
languages. Hebrew is one because I bought it as a gift in
wan, Mona Diab, Xian Li, Xi Victoria Lin, et al. 2022. that one.
Opt: Open pre-trained transformer language models.
arXiv preprint arXiv:2205.01068. Topic: Neil Brooks

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Wein- System: Yes, I do. Have you heard of Neil Brooks. He is a
berger, and Yoav Artzi. 2019. Bertscore: Evaluating sprint freestyle swimmer that won the 100 m medley relay
at the 1980 Olympics in Moscow
text generation with bert. In International Confer- User: I have never heard of him but he sounds like he was
ence on Learning Representations. a very good swimmer.

A Implementation Detail We know that: Michael Phelps. Michael Fred Phelps II

(born June 30, 1985) is an American retired competitive
swimmer and the most successful and most decorated
A.1 Example Prompts Olympian of all time, with a total of 28 medals.
System replies: Yes, another good swimmer is Michael
Figure 3 and 4 demonstrate example prompts used Phelps. He has won 28 Olympic medals in swimming
in our experiments. competitions, but he's retired now.

Topic: Harry Potter

Prompt for Abstractive Summarization
System: Harry Potter is a series of fantasy novels I'm also
a big harry potter fan!
Summarize the given article. User: What's your favorite book? I like the Goblet of Fire
Article: We know that: Harry Potter and the Goblet of Fire. Harry
Ladies and gentlemen, the Rolling Stones are back on Potter and the Goblet of Fire is a fantasy book written by
tour. [...] accompanied by the Edvard Grieg Youth Choir. British author J. K. Rowling and the fourth novel in the
"Harry Potter" series.
Summarization: Rolling Stones are on tour in Europe. [...] System replies:
Scott committed suicide in March, shocking band's
frontman.

Article: Figure 4: An example prompt of the Wizard of

America's bid to spread basketball's popularity will Wikipedia dataset for the dialogue generation task.
continue in the next two years, [...] We feel very positive
about the future of the sport in Manchester and throughout
the UK.
minimum wage. We also compute inter-annotator
Summarization: Manchester will host the U.S. men's and
women's Olympic basketball teams. [...] Toronto Raptors agreements by Randolph’s κ, and records a moder-
and New Jersey Nets to play first regular-season NBA ate κ = 0.57.
match in London.

Article:
Tiger Woods will be wondering if he can ever catch a
break after suffering a bizarre injury on the ninth hole at
the Masters on Sunday. [...] this was Woods' best finish in
over a year.

Summarization:

Figure 3: An example prompt of the CNN-DailyMail

dataset for the abstractive summarzation task.

A.2 Details of Human Evaluation

The full human evaluation protocol is presented in
Figure 5. We invite three graduate-level students
proficient in English for the evaluation for the an-
notations. As our task does not require specific do-
main expertise, the payment is determined by the
850
CNN-DM WoW
Model Size Method
Rep-2 Rep-3 Rep-4 Diversity Rep-2 Rep-3 Rep-4 Diversity
Greedy 16.22 12.80 11.75 64.47 55.33 54.89 55.21 9.03
Beam 9.82 5.65 4.43 81.32 41.22 41.28 41.98 20.03
Nucleus 5.33 2.06 1.41 91.41 3.31 1.17 0.63 94.96
1.3B Contrastive 5.68 2.82 2.07 89.76 6.13 4.13 3.52 86.83
FECS (ours) 7.60 4.37 3.45 85.31 17.91 17.04 17.08 56.47
Greedy 19.65 16.98 16.22 55.89 36.57 34.46 33.52 27.64
Beam 8.72 4.76 3.69 83.73 30.35 29.22 29.05 34.98
Nucleus 5.79 3.07 2.42 89.11 2.74 0.92 0.49 95.89
2.7B Contrastive 4.13 1.82 1.25 92.95 3.22 2.03 1.67 93.23
FECS (ours) 4.40 1.84 1.15 92.76 7.10 5.78 5.40 82.80
Greedy 8.11 5.09 4.18 83.57 45.07 44.22 44.41 17.03
Beam 8.15 4.29 3.26 85.04 15.53 14.75 14.80 61.35
Nucleus 4.47 2.03 1.40 92.28 2.52 0.83 0.44 96.25
6.7B / 6B Contrastive 3.45 1.46 0.98 94.21 0.71 0.18 0.06 99.05
FECS (ours) 4.05 1.76 1.15 93.18 2.63 1.07 0.55 95.80

Table 6: The evaluation results of repetition and diversity on FECS and other decoding methods across model scales.

Task: Abstractive Summarization

Given two summaries (Summary_A and Summary_B), you should determine which one is
more faithful to the provided Source, and ﬁll in “A” or “B” in the Faithful column.

Degree of faithfulness (from most faithful to least faithful)

1. All information presented in the summary can be supported by the source.
○ If there is a tie, choose the one with more correct information or is more
comprehensive/complete.

2. The summary contains information which can not be supported by the source.
○ If there is a tie, choose the one with less information that can not be
supported by the source.

3. The summary contains information which contradicts the source.

○ If there is a tie, choose the one with less information that contradicts the
source.
4. If the two summaries are not rankable (e.g., they are exactly the same), please ﬁll in
“T” in the Faithful column.

Figure 5: The human evaluation protocol for the abstractive summarization task.

851

Oh God Not Again PDF
50% (2)
Oh God Not Again PDF
345 pages
Same Old Same New
No ratings yet
Same Old Same New
84 pages
Liidg Primal Instincts PDF
60% (5)
Liidg Primal Instincts PDF
337 pages
Cards - Against - Muggles - Full - Edition
No ratings yet
Cards - Against - Muggles - Full - Edition
152 pages
My Shifting Script 1
No ratings yet
My Shifting Script 1
7 pages
Draco664 Betrayal of The Best Kind
100% (1)
Draco664 Betrayal of The Best Kind
370 pages
Contrastive Learning Based Semantic
No ratings yet
Contrastive Learning Based Semantic
5 pages
2023 - The Benefits of Bad Advice
No ratings yet
2023 - The Benefits of Bad Advice
15 pages
Faithful Chain-of-Thought Reasoning
No ratings yet
Faithful Chain-of-Thought Reasoning
25 pages
PEC Cohort 2 Gen AI Training Day 3
No ratings yet
PEC Cohort 2 Gen AI Training Day 3
14 pages
2023 Acl-Short 25
No ratings yet
2023 Acl-Short 25
12 pages
Latent Knowldg
No ratings yet
Latent Knowldg
26 pages
【ACL22】Divide and Denoise: Learning From Noisy Labels in Fine-Grained EntityTyping With Cluster-Wise Loss Correction
No ratings yet
【ACL22】Divide and Denoise: Learning From Noisy Labels in Fine-Grained EntityTyping With Cluster-Wise Loss Correction
10 pages
Dola - Decoding by Contrasting Layers Improves
No ratings yet
Dola - Decoding by Contrasting Layers Improves
26 pages
Improving Grammatical Error Correction Via Pre-Training A Copy-Augmented Architecture With Unlabeled Data
No ratings yet
Improving Grammatical Error Correction Via Pre-Training A Copy-Augmented Architecture With Unlabeled Data
10 pages
An End-to-End Model With Adaptive Filtering For Retrieval-Augmented Generation
No ratings yet
An End-to-End Model With Adaptive Filtering For Retrieval-Augmented Generation
13 pages
Summary Eval Alex
No ratings yet
Summary Eval Alex
13 pages
PCS224 MST 23
No ratings yet
PCS224 MST 23
3 pages
Classifier Free Guidance For Language Models
No ratings yet
Classifier Free Guidance For Language Models
42 pages
SP 14
No ratings yet
SP 14
6 pages
Communication Systems-2 (304CDE) Lecture 10: Error Correction Code (Part 5)
No ratings yet
Communication Systems-2 (304CDE) Lecture 10: Error Correction Code (Part 5)
26 pages
Embedded, Decentralized Models For Cache Coherence: Armin G Abor and Tam Asi Aron
No ratings yet
Embedded, Decentralized Models For Cache Coherence: Armin G Abor and Tam Asi Aron
8 pages
Visual Word: Unlocking the Power of Image Understanding
From Everand
Visual Word: Unlocking the Power of Image Understanding
Fouad Sabry
No ratings yet
Jointly Training Speech Recognition and Synthesis With Cycle Consistency
No ratings yet
Jointly Training Speech Recognition and Synthesis With Cycle Consistency
3 pages
Eisenstein
No ratings yet
Eisenstein
305 pages
Lec 11
No ratings yet
Lec 11
30 pages
Naive Bayes Classifier: Fundamentals and Applications
From Everand
Naive Bayes Classifier: Fundamentals and Applications
Fouad Sabry
No ratings yet
Combining Similarity Features and Deep Representation Learning For Stance Detection in The Context of Checking Fake News
No ratings yet
Combining Similarity Features and Deep Representation Learning For Stance Detection in The Context of Checking Fake News
24 pages
RHO KnowledgeGraph
No ratings yet
RHO KnowledgeGraph
17 pages
A Methodology For The Study of DHTS: Kamala Sinha
No ratings yet
A Methodology For The Study of DHTS: Kamala Sinha
4 pages
Liz Usefulness Doc 07
No ratings yet
Liz Usefulness Doc 07
35 pages
CM-Sentence Generation Proposal
No ratings yet
CM-Sentence Generation Proposal
8 pages
Language Models Are General Purpose Interfaces
No ratings yet
Language Models Are General Purpose Interfaces
32 pages
Contrast Ive Activation Steering
No ratings yet
Contrast Ive Activation Steering
19 pages
An Efficient and Robust Semantic Hashing Framework For Similar Text Search
No ratings yet
An Efficient and Robust Semantic Hashing Framework For Similar Text Search
31 pages
Ed3book (347 520)
No ratings yet
Ed3book (347 520)
174 pages
How Many Opinions Does Your LLM Have? Improving Uncertainty Estimation in NLG
No ratings yet
How Many Opinions Does Your LLM Have? Improving Uncertainty Estimation in NLG
19 pages
13 TextGen 2024
No ratings yet
13 TextGen 2024
106 pages
R - E A N M T: Ewriter Valuator Rchitecture For Eural Achine Ranslation
No ratings yet
R - E A N M T: Ewriter Valuator Rchitecture For Eural Achine Ranslation
10 pages
Part 4: Implementing The Solution in Python
No ratings yet
Part 4: Implementing The Solution in Python
5 pages
Generalizing To The Future Mitigating Entity Bias in Fake News Detection
No ratings yet
Generalizing To The Future Mitigating Entity Bias in Fake News Detection
6 pages
AugCSE: Contrastive Sentence Embedding With Diverse Augmentations
No ratings yet
AugCSE: Contrastive Sentence Embedding With Diverse Augmentations
24 pages
Simcse: Simple Contrastive Learning of Sentence Embeddings
No ratings yet
Simcse: Simple Contrastive Learning of Sentence Embeddings
17 pages
Ai 3,4,5 Vtu nOTES
No ratings yet
Ai 3,4,5 Vtu nOTES
22 pages
Seq2Edits: Sequence Transduction Using Span-Level Edit Operations
No ratings yet
Seq2Edits: Sequence Transduction Using Span-Level Edit Operations
17 pages
U18IT605 May2023 Solutions
No ratings yet
U18IT605 May2023 Solutions
19 pages
Language Model Evaluation in Open-Ended Text Gener
No ratings yet
Language Model Evaluation in Open-Ended Text Gener
70 pages
Notes On Noise Contrastive Estimation and Negative Sampling: Chris Dyer
No ratings yet
Notes On Noise Contrastive Estimation and Negative Sampling: Chris Dyer
4 pages
Simcse: Simple Contrastive Learning of Sentence Embeddings
No ratings yet
Simcse: Simple Contrastive Learning of Sentence Embeddings
17 pages
NLP Final Review
No ratings yet
NLP Final Review
32 pages
NeurIPS 2023 Cross Episodic Curriculum For Transformer Agents Paper Conference
No ratings yet
NeurIPS 2023 Cross Episodic Curriculum For Transformer Agents Paper Conference
22 pages
NLP Summary
No ratings yet
NLP Summary
6 pages
Dual Monolingual Cross-Entropy-Delta Filtering of Noisy Parallel Data
No ratings yet
Dual Monolingual Cross-Entropy-Delta Filtering of Noisy Parallel Data
7 pages
Chapter 2 Concept Learning
No ratings yet
Chapter 2 Concept Learning
36 pages
Classifier Free Guidance For Language Models ICML2024
No ratings yet
Classifier Free Guidance For Language Models ICML2024
38 pages
NLP Soc
No ratings yet
NLP Soc
15 pages
Final 2009
No ratings yet
Final 2009
7 pages
RBN 19
No ratings yet
RBN 19
10 pages
Hallucinations 2312.17249
No ratings yet
Hallucinations 2312.17249
19 pages
Noetic End To End Response Selection With Supervised Neural Network Based Classifiers
No ratings yet
Noetic End To End Response Selection With Supervised Neural Network Based Classifiers
12 pages
2 FSModels
No ratings yet
2 FSModels
32 pages
Log (1+ 1 2 P)
No ratings yet
Log (1+ 1 2 P)
13 pages
PA3 Problem Statement
No ratings yet
PA3 Problem Statement
5 pages
Liz Yield Doc 0555
No ratings yet
Liz Yield Doc 0555
20 pages
Schema-Learning and Rebinding As Mechanisms of In-Context Learning and Emergence
No ratings yet
Schema-Learning and Rebinding As Mechanisms of In-Context Learning and Emergence
23 pages
Finite State Transducers
No ratings yet
Finite State Transducers
4 pages
MATERIAL DE INGLES ONCE GRADO (Reparado)
No ratings yet
MATERIAL DE INGLES ONCE GRADO (Reparado)
35 pages
Stylistic Analysis of J.K. Rowling's Harry Potter
No ratings yet
Stylistic Analysis of J.K. Rowling's Harry Potter
17 pages
All The Young
No ratings yet
All The Young
91 pages
Ancient
No ratings yet
Ancient
172 pages
Obliviate
No ratings yet
Obliviate
259 pages
J. K. Rowling: - 1 - Teacher: Cintia S. Faure
No ratings yet
J. K. Rowling: - 1 - Teacher: Cintia S. Faure
1 page
Harry Black and The
No ratings yet
Harry Black and The
484 pages
Fanfics Spreadsheet
No ratings yet
Fanfics Spreadsheet
13 pages
The Integrationof Mythical Creaturesinthe Harry Potter Series Terri Pinyerd
No ratings yet
The Integrationof Mythical Creaturesinthe Harry Potter Series Terri Pinyerd
5 pages
Ekran Resmi 2020-11-07 - 17.30.49
No ratings yet
Ekran Resmi 2020-11-07 - 17.30.49
1 page
Albus 5
No ratings yet
Albus 5
14 pages
Diagon Alley: L.O. To Develop Creative Skills Starter: Note Down Three Things That Happened in The Last Chapter
No ratings yet
Diagon Alley: L.O. To Develop Creative Skills Starter: Note Down Three Things That Happened in The Last Chapter
11 pages
Scavenger Hunt Clues
No ratings yet
Scavenger Hunt Clues
6 pages
The Balance Sheet Practice
No ratings yet
The Balance Sheet Practice
2 pages
Dark Possession
No ratings yet
Dark Possession
74 pages
Dissertation Final
No ratings yet
Dissertation Final
105 pages
All The Young Dudes - Chapter 145 - MsKingBean89 - Harry Potter - J. K. Rowling (Archive of Our Own)
No ratings yet
All The Young Dudes - Chapter 145 - MsKingBean89 - Harry Potter - J. K. Rowling (Archive of Our Own)
1 page
The Story of Dolores Jane Umbridge
No ratings yet
The Story of Dolores Jane Umbridge
5 pages
Blinding Lights
No ratings yet
Blinding Lights
459 pages
Marauders Powerpoint
No ratings yet
Marauders Powerpoint
54 pages
21st Century Literature q2m2 Students Copy1 231016000536 A0902438
No ratings yet
21st Century Literature q2m2 Students Copy1 231016000536 A0902438
24 pages
Of Choirs and Kettles
No ratings yet
Of Choirs and Kettles
4 pages
Water
100% (2)
Water
472 pages
Thesis Harry Potter
100% (2)
Thesis Harry Potter
4 pages

2023 - Fidelity-Enriched Contrastive Search

Uploaded by

2023 - Fidelity-Enriched Contrastive Search

Uploaded by

Fidelity-Enriched Contrastive Search:

Reconciling the Faithfulness-Diversity Trade-Off in Text Generation

3.2 Evaluation Metrics where Rep-n(x) measures the proportion of n-gram

A Implementation Detail We know that: Michael Phelps. Michael Fred Phelps II

Topic: Harry Potter

Article: Figure 4: An example prompt of the Wizard of

Figure 3: An example prompt of the CNN-DailyMail

A.2 Details of Human Evaluation

Task: Abstractive Summarization

Degree of faithfulness (from most faithful to least faithful)

3. The summary contains information which contradicts the source.

You might also like