0% found this document useful (0 votes)
78 views11 pages

How To Unleash The Power of Large Language Models For Few-Shot Relation Extraction

Uploaded by

xzj294197164
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views11 pages

How To Unleash The Power of Large Language Models For Few-Shot Relation Extraction

Uploaded by

xzj294197164
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

How to Unleash the Power of Large Language Models for

Few-shot Relation Extraction?


Xin Xu, Yuqi Zhu, Xiaohan Wang, Ningyu Zhang
Zhejiang University & AZFT Joint Lab for Knowledge Engine
{xxucs, wangxh07, zhangningyu}@zju.edu.cn

Abstract tential in few-shot relation extraction has not been


fully explored yet.
Scaling language models have revolutionized In this paper, we take GPT-3.5 (OpenAI, 2023b)
widespread NLP tasks, yet little comprehen-
as an exemplary LLM to investigate how to max-
sively explored few-shot relation extraction
with large language models. In this paper, we
imize the utilization of LLMs for the few-shot
investigate principal methodologies, in-context relation extraction task with in-context learning
learning and data generation, for few-shot re- and data generation. Different from text classifica-
lation extraction via GPT-3.5 through exhaus- tion, the relation extraction task contains rich pre-
tive experiments. To enhance few-shot perfor- defined schemas (e.g., entity and relation type con-
mance, we further propose task-related instruc- straints) and a relatively large and complex classifi-
tions and schema-constrained data generation. cation space with noisy data. We further design two
We observe that in-context learning can achieve
simple-yet-effective strategies to unleash the power
performance on par with previous prompt learn-
ing approaches, and data generation with the of large language models better: task-related in-
large language model can boost previous so- structions and schema-constrained data genera-
lutions to obtain new state-of-the-art few-shot tion. We conduct exhaustive experiments on four
results on four widely-studied relation extrac- well-known relation extraction datasets. Empiri-
tion datasets. We hope our work can inspire cal results indicate that LLMs can potentially be
future research for the capabilities of large lan- advantageous to few-shot relation extraction and
guage models in few-shot relation extraction1 .
boost previous prompt learning performance.
1 Introduction
2 Background
Few-shot Relation Extraction (RE) appeals to many
2.1 Few-shot Relation Extraction
researchers in Natural Language Processing (NLP)
due to the capability to extract textual information The relation extraction task aims to extract the re-
where only a few support examples are given (Han lationship between head and tail entities within
et al., 2018; Yang et al., 2021; Han et al., 2021a; a plain context. Specifically, one instance for
Brody et al., 2021; Ma et al., 2023). Most previous the relation extraction task consists of a context
works focus on fine-tuning (Soares et al., 2019; Ye x = {x1 , x2 , ..., h, ..., t, ..., x|x| }, head and tail en-
et al., 2022) or prompt-tuning (Chen et al., 2022; tity mentions h and t, entity types th and tt , and
Han et al., 2021b) with relatively small language the relation y ∈ Y between h and t, where Y is the
models, e.g., RoBERTa (Liu et al., 2019). Recently, set of candidate relations. RE systems will predict
with the scaling of model size and corpus size, y given x, h, t, th and tt . For few-shot relation
large language models (LLMs) such as ChatGPT extraction, fine-tuning pre-trained language mod-
(OpenAI, 2022) and GPT-4 (OpenAI, 2023a) have els (PLMs) is a direct solution (Han et al., 2019;
demonstrated powerful abilities by demonstrating Yamada et al., 2020; Joshi et al., 2020; Lyu and
only a few instances, a.k.a In-Context Learning Chen, 2021; Zhou and Chen, 2022). To alleviate
(Dong et al., 2023). Although LLMs have achieved the gap between pre-training objectives and down-
remarkable results in many NLP tasks, their po- stream applications, prompt tuning has recently
∗ been applied to relation extraction, especially for
Corresponding author.
1
Code is available in https://github.com/zjunlp/ low-resource scenarios (Chen et al., 2022; Han
DeepKE/tree/main/example/llm. et al., 2021b, 2022). Most of those approaches
190
Proceedings of The Fourth Workshop on Simple and Efficient Natural Language Processing (SustaiNLP), pages 190–200
July 13, 2023 ©2023 Association for Computational Linguistics
Figure 1: Strategies to unleash the power of LLMs for few-shot relation. HEAD TYPE and TAIL TYPE are schemas.
HEAD ENTITY and TAIL ENTITY are entity mentions. RELATION refers the verbalized relation label words.

utilize relatively small language models (RoBERTa prompt to elicit comprehension of the relation ex-
(Liu et al., 2019), GPT2 (Radford et al., 2019)), traction task from LLMs. To this end, specific and
demonstrating empirical success regarding few- compelling prompts for RE with demonstrations
shot relation extraction performance. To date, large are manually constructed and designed to instruct
language models have demonstrated powerful abil- LLMs to understand the relation extraction task
ities by prompting a few instances without tuning and how to execute relation extraction. Consid-
(Ding et al., 2022); however, the power of LLMs ering aspects and characteristics of the relation
for few-shot relation extraction is little known. extraction task, including task definition, candi-
date relation (label) words, entity types (schemas)
2.2 Large Language Models and so on, we design prompts of different articu-
Large language models, trained with exceedingly lation and complexity to investigate how prompts
large corpora and often with a great number of pa- help LLMs release the power of few-shot RE. First,
rameters (≥10B), have achieved excellent perfor- TEXT PROMPT only contains essential elements
mance in numerous downstream NLP tasks (Taylor for RE, including relation categories, contexts, and
et al., 2022; Zhang et al., 2022; Zeng et al., 2022; corresponding head and tail entities. Inspired by
Chowdhery et al., 2022; Ouyang et al., 2022). Com- the fantastic performance of InstructGPT (Ouyang
pared to relatively small language models (SLMs), et al., 2022) and ChatGPT (OpenAI, 2022), we de-
LLMs are usually not open-source and can not be sign the task-related instruction describing the
fine-tuned, which is challenging for downstream relation extraction task and add it to the prompt,
task adaptation. Therefore, in-context learning which is named INSTRUCT PROMPT. Meanwhile,
(Brown et al., 2020) is proposed to utilize prompts according to previous few-shot RE works (Zhou
with a few demonstrations for few-shot learning. and Chen, 2022), entity types (schemas) are help-
Previous studies (Yoo et al., 2021; Wang et al., ful; therefore, we also explore the effectiveness of
2021) have investigated using LLMs for text classi- schemas in prompts.
fication and generation. In this work, we take the
first step to study few-shot RE with large language 3.2 Data Generation with LLMs
models, which brings new challenges and insights. To complement the scarcity of labeled data, we in-
troduce another strategy: data generation via LLMs.
3 LLMs for Few-shot Relation Extraction Specifically, we utilize specific prompts with de-
In this section, we introduce two strategies to utilize scriptions of data forms to guide LLMs to generate
LLMs for relation extraction: 1) in-context learning more in-domain labeled data autonomously, which
(§3.1); 2) data generation (§3.2) with LLMs, as is subsequently employed to fine-tune a relatively
shown in Figure 1. small language model with existing few-shot la-
beled training data. We design the prompt to tell the
3.1 In-Context Learning with LLMs essential components (x, h, t, th , tt and y) of one
The first strategy applies in-context learning (ICL) RE training instance and show few-shot instances
by providing LLMs with demonstrations in the as demonstrations to teach LLMs to comprehend
191
TACRED TACREV RE-TACRED SciERC
Method K=8 K=16 K=8 K=16 K=8 K=16 K=8 K=16
SpanBERT (Joshi et al., 2020) 8.4 17.5 5.2 5.7 14.2 29.3 29.0 38.7
LUKE (Yamada et al., 2020) 9.5 21.5 9.8 22.0 14.1 37.5 33.2 48.9
Baselines

GDPNet (Xue et al., 2021) 11.8 22.5 8.3 20.8 18.8 48.0 33.5 42.3
TANL (Paolini et al., 2021) 18.1 27.6 18.6 28.8 26.7 50.4 32.4 38.7
TYP Marker (Zhou and Chen, 2022) 26.5 29.9 26.7 29.5 44.8 54.1 50.4 59.0
KnowPrompt (Chen et al., 2022) 29.4 32.1 29.8 34.1 56.1 61.4 50.2 57.1
In-context Learning† 31.9 32.4 49.9 46.6
In-context Learning†(w/ Instruction) 31.0 31.9 51.8 48.8
GPT3

Data Generation (TYP Marker) 35.8 36.6 36.7 36.5 58.4 60.6 63.2 64.3
Data Generation (KnowPrompt) 37.9 37.4 42.6 41.0 62.7 66.2 58.6 67.8

Table 1: Micro F1 (%) of few-shot performance. † refers to the performance with one-shot demonstrations.

Prompts TACRED TACREV RE-TACRED SciERC for training and validation. As for in-context learn-
TEXT 31.9 32.4 49.9 46.6 ing, because GPT-3.5 has the limitation of maxi-
TEXT + Schema 36.9 37.7 54.3 45.9
INSTRUCT 31.0 31.9 51.8 48.8 mum request tokens (4097 tokens) and the series
INSTRUCT + Schema 38.3 36.7 58.5 50.2
of TACRED datasets have more than 40 relations,
Table 2: Micro F1 (%) of performance on different
one-shot demonstrations can only be used, and
prompt: TEXT PROMPT and INSTRUCT PROMPT. the one-shot performance is reported in Table 1.
For the same reason, to generate more labeled data
for each relation independently, only three demon-
features of labeled RE data. Note that schemas, strations for the relation are added to the prompts.
such as types of relations and entities, are signifi- In-context learning is implemented on the four
cant structural information in RE data. Therefore, whole test sets. Different demonstrations are ran-
we propose schema-constrained data generation domly sampled from the shuffled training set every
by adding entity types as schema guidance to the time to avoid effects from permutations of demon-
prompt (in Figure 1) to boost performance. Then, strations (Lu et al., 2021). As for data generation,
the prompt is utilized to guide LLMs to create aug- generated data from GPT-3.5 and original few-shot
mented relation extraction data that are converted training data are combined to fine-tune two base-
into the expected format for future usage. lines, TYP Marker (Zhou and Chen, 2022) and
KnowPrompt (Chen et al., 2022). Using different
4 Experimental Setups shots of generated data will lead to different results.
4.1 Methods and Datasets Therefore, we increasingly add generated k-shot
(k ∈ {8, 16, 32, 48}) data to the original 8-shot
GPT-3.5 is utilized via OpenAI API2 as the large and 16-shot training data respectively and report
language model in our experiments. We implement the best performance over k in Tabel 1. More de-
experiments on four relation extraction datasets, tails are shown in Appendix A.3.
including TACRED (Zhang et al., 2017), TACREV
(Alt et al., 2020), RE-TACRED (Stoica et al., 2021) 5 Results and Discussion
and SciERC (Luan et al., 2018). Compared with
the LLM, six baselines methods are conducted via 5.1 Main Findings for Relation Extraction
relatively small models (details in Appendix A).
In-context learning on LLMs can achieve com-
4.2 Few-shot Settings parable performance for RE with tuning rel-
atively small PLMs. From Table 1, we notice
K instances per relation (K-shot) are sampled that ICL with only one-shot demonstrations can
for training and validation. For all baselines, we obtain competitive performance with full parame-
use randomly sampled 8-shot and 16-shot datasets ter tuning-based prompt learning baselines. Using
2
https://platform.openai.com/docs/models/ LLMs via ICL does not necessitate any parameter
gpt-3-5 updates, which contains the potential value of mak-
192
Figure 2: Micro F1 (%) of k in-context demonstrations
in SciERC.

Figure 3: Performance of data generation with LLMs


ing models scenario-adaptable, unlike supervised and different data augmentation methods. Roberta and
learning requiring parameter optimization. SciBERT are used on RE-TACRED and SciERC, respec-
tively, in the context embedding-based DA method.
Data generation with LLMs can boost previous
solutions to obtain new state-of-the-art few-shot
RE results. We find that previous baselines can resentative demonstrations; 2) it is non-trivial for
significantly improve with 10.7% for 16-shot in LLMs to understand structure prediction tasks with
SciERC and 6.6% for 16-shot in RE-TACRED more large output (relation) space. More case stud-
by simply using generated data from GPT-3.5 in ies for GPT-3.5 can be found in Appendix B.1.
Table 1. To be noted, data generation is a simple
yet effective approach to elicit the power from the 5.3 Utility of Generated Data from LLMs
LLM to previous methods, and we demonstrate that Combining data generated from LLMs with
using schema-constrained generation with LLMs original training data can yield better RE per-
can benefit all previous approaches with SLMs. formance than from traditional data augmenta-
5.2 Prompts in In-context Learning with tion approaches. We compare data generation
LLMs through the LLM with previous widely used data
generation approaches, such as substituting words
Instructions and schemas play an essential in training sets with WordNet’s synonyms and con-
role in in-context learning for RE with LLMs. textual word embedding in Figure 3 (details in
From Table 2, we notice that the model with Appendix A.3). Data generation with LLMs can
INSTRUCT PROMPT obtains better performance
obtain better performance than all others, indicat-
than TEXT PROMPT in most cases, indicating task- ing guiding LLMs to generate data is an effective
related information indeed helps to unlock more method to compensate for the lack of labeled data.
ability of LLMs for RE. Aberrant results are shown
in TACRED and TACREV because incorrectly la- Using more and more generated data from
beled demonstrations from the two datasets will vi- LLMs can only boost RE performance to a cer-
olate the correct instruction fed into LLMs, which tain extent, not continuously better. From Fig-
confuses LLMs and results on worse performance ure 4, we observe that with more generated data,
than ICL without the instruction. Moreover, adding the result climbs up first and then declines, and
schema information obtains much better perfor- is always higher than without generated data. We
mance, exhibiting the importance of pre-defined think low-quality generated data introduces much
structural information for relation extraction. noise in the training course, according to the anal-
ysis on generated data in Appendix B.2, and LMs
More demonstrations, counter-intuitively, may
may have an anti-noise capacity (Song et al., 2020).
not lead to performance improvement for RE
with LLMs. We find performance will not im- 6 Discussion and Conclusion
prove even drop and the gap between INSTRUCT
PROMPT and TEXT PROMPT becomes relatively In this paper, we take the first step to investi-
smaller as the number of in-context demonstrations gate how to utilize the large language model for
increases from Figure 2. We argue that there may few-shot relation extraction. We observe that
be two reasons: 1) it is challenging to select rep- task-related information, including instructions or
193
as black-box optimization (Sun et al., 2022) and
feature-based learning (Lang et al., 2022); however,
we find that most of those approaches cannot di-
rectly be applied to relation extraction due to the
large label space and complex schema structures.
We leave these for future work to leverage other
methods with LLMs for relation extraction.

Datasets: We only evaluate four relation extrac-


tion datasets and will try to investigate relation ex-
Figure 4: Micro F1 (%) of KnowPrompt with generated traction performance with LLMs on more diverse
training data and original 8-shot data. datasets across different domains and languages.

schemas, helps to elicit the capability of LLMs and


References
boost few-shot relation extraction performance. At
this stage, using LLMs to generate data may be a Christoph Alt, Aleksandra Gabryszak, and Leonhard
Hennig. 2020. Tacred revisited: A thorough evalu-
simple yet effective solution to enhance the power ation of the tacred relation extraction task. In Pro-
of foundation models (relatively small PLMs) for ceedings of ACL.
practical applications. We hope this work can de-
liver the benefits of using LLMs for the NLP com- Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciB-
ERT: A pretrained language model for scientific text.
munity. Note that LLMs can make predictions only In Proceedings of the 2019 Conference on Empirical
based on contexts combined with a few training ex- Methods in Natural Language Processing and the
amples as demonstrations. We argue that it has the 9th International Joint Conference on Natural Lan-
potential to design sophisticated human-readable guage Processing (EMNLP-IJCNLP), pages 3615–
3620, Hong Kong, China. Association for Computa-
prompts for scenario-adaptable (e.g., low-shot and
tional Linguistics.
any domains) relation extraction.
Zhen Bi, Jing Chen, Yinuo Jiang, Feiyu Xiong, Wei Guo,
Acknowledgment Huajun Chen, and Ningyu Zhang. 2023. Codekgc:
Code language model for generative knowledge
We would like to express gratitude to the anony- graph construction. CoRR, abs/2304.09048.
mous reviewers for their kind comments. This
Sam Brody, Sichao Wu, and Adrian Benton. 2021. To-
work was supported by the National Natural Sci-
wards realistic few-shot relation extraction. In Pro-
ence Foundation of China (No.62206246), Zhe- ceedings of the 2021 Conference on Empirical Meth-
jiang Provincial Natural Science Foundation of ods in Natural Language Processing, pages 5338–
China (No. LGG22F030011), Ningbo Natural Sci- 5345, Online and Punta Cana, Dominican Republic.
ence Foundation (2021J190), and Yongjiang Talent Association for Computational Linguistics.
Introduction Programme (2021A-156-G), CAAI- Tom Brown, Benjamin Mann, Nick Ryder, Melanie
Huawei MindSpore Open Fund. Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Limitations Askell, Sandhini Agarwal, Ariel Herbert-Voss,
Gretchen Krueger, Tom Henighan, Rewon Child,
Despite our best efforts, there may still be some Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens
limitations remaining in this paper. Winter, Chris Hesse, Mark Chen, Eric Sigler, Ma-
teusz Litwin, Scott Gray, Benjamin Chess, Jack
LLMs: Due to the limited budgets, we can not Clark, Christopher Berner, Sam McCandlish, Alec
Radford, Ilya Sutskever, and Dario Amodei. 2020.
afford all kinds of LLMs, so we only evaluate Language models are few-shot learners. In Ad-
GPT-3.5 (text-davinci-003). We will try to investi- vances in Neural Information Processing Systems,
gate relation extraction with more LLMs, such as volume 33, pages 1877–1901. Curran Associates,
OPT (Zhang et al., 2022), GLM-130B (Zeng et al., Inc.
2022), or code language models (Bi et al., 2023) Xiang Chen, Ningyu Zhang, Xin Xie, Shumin Deng,
like Codex. Yunzhi Yao, Chuanqi Tan, Fei Huang, Luo Si, and
Huajun Chen. 2022. Knowprompt: Knowledge-
Other Methods to utilize LLMs: There are aware prompt-tuning with synergistic optimization
several other techniques to leverage LLMs, such for relation extraction. In WWW ’22: The ACM Web

194
Conference 2022, Virtual Event, Lyon, France, April with rules for text classification. arXiv preprint,
25 - 29, 2022, pages 2778–2788. ACM. abs/2105.11259.

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Xu Han, Hao Zhu, Pengfei Yu, Ziyun Wang, Yuan Yao,
Maarten Bosma, Gaurav Mishra, Adam Roberts, Zhiyuan Liu, and Maosong Sun. 2018. FewRel: A
Paul Barham, Hyung Won Chung, Charles Sutton, large-scale supervised few-shot relation classification
Sebastian Gehrmann, Parker Schuh, Kensen Shi, dataset with state-of-the-art evaluation. In Proceed-
Sasha Tsvyashchenko, Joshua Maynez, Abhishek ings of the 2018 Conference on Empirical Methods
Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vin- in Natural Language Processing, pages 4803–4809,
odkumar Prabhakaran, Emily Reif, Nan Du, Ben Brussels, Belgium. Association for Computational
Hutchinson, Reiner Pope, James Bradbury, Jacob Linguistics.
Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin,
Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld,
Sunipa Dev, Henryk Michalewski, Xavier Garcia, Luke Zettlemoyer, and Omer Levy. 2020. Spanbert:
Vedant Misra, Kevin Robinson, Liam Fedus, Denny Improving pre-training by representing and predict-
Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, ing spans. Trans. Assoc. Comput. Linguistics, 8:64–
Barret Zoph, Alexander Spiridonov, Ryan Sepassi, 77.
David Dohan, Shivani Agrawal, Mark Omernick, Hunter Lang, Monica N. Agrawal, Yoon Kim, and
Andrew M. Dai, Thanumalayan Sankaranarayana David A. Sontag. 2022. Co-training improves
Pillai, Marie Pellat, Aitor Lewkowycz, Erica Mor- prompt-based learning for large language models.
eira, Rewon Child, Oleksandr Polozov, Katherine In International Conference on Machine Learning,
Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, ICML 2022, 17-23 July 2022, Baltimore, Maryland,
Mark Diaz, Orhan Firat, Michele Catasta, Jason USA, volume 162 of Proceedings of Machine Learn-
Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, ing Research, pages 11985–12003. PMLR.
Slav Petrov, and Noah Fiedel. 2022. Palm: Scaling
language modeling with pathways. arXiv preprint, Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
abs/2204.02311. dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Bosheng Ding, Chengwei Qin, Linlin Liu, Lidong Bing, Roberta: A robustly optimized BERT pretraining
Shafiq R. Joty, and Boyang Li. 2022. Is GPT-3 a good approach. arXiv:1907.11692, abs/1907.11692.
data annotator? arXiv preprint, abs/2212.10450.
Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel,
Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong and Pontus Stenetorp. 2021. Fantastically ordered
Wu, Baobao Chang, Xu Sun, Jingjing Xu, Lei Li, and prompts and where to find them: Overcoming
Zhifang Sui. 2023. A survey for in-context learning. few-shot prompt order sensitivity. arXiv preprint,
arXiv preprint, abs/2301.00234. abs/2104.08786.
Jiale Han, Bo Cheng, and Wei Lu. 2021a. Exploring Yi Luan, Luheng He, Mari Ostendorf, and Hannaneh
task difficulty for few-shot relation extraction. In Pro- Hajishirzi. 2018. Multi-task identification of entities,
ceedings of the 2021 Conference on Empirical Meth- relations, and coreferencefor scientific knowledge
ods in Natural Language Processing, EMNLP 2021, graph construction. In Proc. Conf. Empirical Meth-
Virtual Event / Punta Cana, Dominican Republic, 7- ods Natural Language Process. (EMNLP).
11 November, 2021, pages 2605–2616. Association
for Computational Linguistics. Shengfei Lyu and Huanhuan Chen. 2021. Relation
classification with entity type restriction. In Find-
Jiale Han, Shuai Zhao, Bo Cheng, Shengkun Ma, and ings of the Association for Computational Linguis-
Wei Lu. 2022. Generative prompt tuning for rela- tics: ACL/IJCNLP 2021, Online Event, August 1-6,
tion classification. In Findings of the Association 2021, volume ACL/IJCNLP 2021 of Findings of ACL,
for Computational Linguistics: EMNLP 2022, pages pages 390–395. Association for Computational Lin-
3170–3185, Abu Dhabi, United Arab Emirates. As- guistics.
sociation for Computational Linguistics.
Yubo Ma, Yixin Cao, YongChing Hong, and Aixin Sun.
Xu Han, Tianyu Gao, Yuan Yao, Deming Ye, Zhiyuan 2023. Large language model is not a good few-shot
Liu, and Maosong Sun. 2019. Opennre: An open information extractor, but a good reranker for hard
and extensible toolkit for neural relation extraction. samples! CoRR, abs/2303.08559.
In Proceedings of the 2019 Conference on Empiri- OpenAI. 2022. Chatgpt: Optimizing language mod-
cal Methods in Natural Language Processing and els for dialogue. https://openai.com/blog/
the 9th International Joint Conference on Natural chatgpt/.
Language Processing, EMNLP-IJCNLP 2019, Hong
Kong, China, November 3-7, 2019 - System Demon- OpenAI. 2023a. Gpt-4 technical report. arXiv preprint,
strations, pages 169–174. Association for Computa- abs/2303.08774.
tional Linguistics.
OpenAI. 2023b. Text-davinci-003. https:
Xu Han, Weilin Zhao, Ning Ding, Zhiyuan Liu, //platform.openai.com/docs/models/
and Maosong Sun. 2021b. PTR: prompt tuning text-davinci-003.

195
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- 2021, Virtual Event / Punta Cana, Dominican Re-
roll L. Wainwright, Pamela Mishkin, Chong Zhang, public, 16-20 November, 2021, pages 4195–4205.
Sandhini Agarwal, Katarina Slama, Alex Ray, John Association for Computational Linguistics.
Schulman, Jacob Hilton, Fraser Kelton, Luke Miller,
Maddie Simens, Amanda Askell, Peter Welinder, Fuzhao Xue, Aixin Sun, Hao Zhang, and Eng Siong
Paul F. Christiano, Jan Leike, and Ryan Lowe. 2022. Chng. 2021. Gdpnet: Refining latent multi-view
Training language models to follow instructions with graph for relation extraction. In Thirty-Fifth AAAI
human feedback. arXiv preprint, abs/2203.02155. Conference on Artificial Intelligence, AAAI 2021,
Thirty-Third Conference on Innovative Applications
Giovanni Paolini, Ben Athiwaratkun, Jason Krone, of Artificial Intelligence, IAAI 2021, The Eleventh
Jie Ma, Alessandro Achille, Rishita Anubhai, Symposium on Educational Advances in Artificial In-
Cícero Nogueira dos Santos, Bing Xiang, and Ste- telligence, EAAI 2021, Virtual Event, February 2-9,
fano Soatto. 2021. Structured prediction as transla- 2021, pages 14194–14202. AAAI Press.
tion between augmented natural languages. In 9th
International Conference on Learning Representa- Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki
tions, ICLR 2021, Virtual Event, Austria, May 3-7, Takeda, and Yuji Matsumoto. 2020. LUKE: deep con-
2021. OpenReview.net. textualized entity representations with entity-aware
self-attention. In Proceedings of the 2020 Confer-
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, ence on Empirical Methods in Natural Language
Dario Amodei, Ilya Sutskever, et al. 2019. Language Processing, EMNLP 2020, Online, November 16-20,
models are unsupervised multitask learners. OpenAI 2020, pages 6442–6454. Association for Computa-
blog, 1(8):9. tional Linguistics.

Livio Baldini Soares, Nicholas FitzGerald, Jeffrey Ling, Shan Yang, Yongfei Zhang, Guanglin Niu, Qinghua
and Tom Kwiatkowski. 2019. Matching the blanks: Zhao, and Shiliang Pu. 2021. Entity concept-
Distributional similarity for relation learning. In Pro- enhanced few-shot relation extraction. In Proceed-
ceedings of the 57th Conference of the Association ings of the 59th Annual Meeting of the Association for
for Computational Linguistics, ACL 2019, Florence, Computational Linguistics and the 11th International
Italy, July 28- August 2, 2019, Volume 1: Long Pa- Joint Conference on Natural Language Processing,
pers, pages 2895–2905. Association for Computa- ACL/IJCNLP 2021, (Volume 2: Short Papers), Virtual
tional Linguistics. Event, August 1-6, 2021, pages 987–991. Association
for Computational Linguistics.
Hwanjun Song, Minseok Kim, Dongmin Park, and Jae-
Gil Lee. 2020. Learning from noisy labels with Hongbin Ye, Ningyu Zhang, Hui Chen, and Huajun
deep neural networks: A survey. arXiv preprint, Chen. 2022. Generative knowledge graph construc-
abs/2007.08199. tion: A review. In Proceedings of the 2022 Con-
ference on Empirical Methods in Natural Language
George Stoica, Emmanouil Antonios Platanios, and Processing, pages 1–17, Abu Dhabi, United Arab
Barnabás Póczos. 2021. Re-tacred: Addressing short- Emirates. Association for Computational Linguistics.
comings of the TACRED dataset. In Thirty-Fifth
AAAI Conference on Artificial Intelligence, AAAI Kang Min Yoo, Dongju Park, Jaewook Kang, Sang-
2021, Thirty-Third Conference on Innovative Ap- Woo Lee, and Woo-Myoung Park. 2021. Gpt3mix:
plications of Artificial Intelligence, IAAI 2021, The Leveraging large-scale language models for text aug-
Eleventh Symposium on Educational Advances in Ar- mentation. In Findings of the Association for Com-
tificial Intelligence, EAAI 2021, Virtual Event, Febru- putational Linguistics: EMNLP 2021, Virtual Event /
ary 2-9, 2021, pages 13843–13850. AAAI Press. Punta Cana, Dominican Republic, 16-20 November,
2021, pages 2225–2239. Association for Computa-
Tianxiang Sun, Yunfan Shao, Hong Qian, Xuanjing tional Linguistics.
Huang, and Xipeng Qiu. 2022. Black-box tuning
for language-model-as-a-service. In International Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang,
Conference on Machine Learning, ICML 2022, 17-23 Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu,
July 2022, Baltimore, Maryland, USA, volume 162 of Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan
Proceedings of Machine Learning Research, pages Ma, Yufei Xue, Jidong Zhai, Wenguang Chen, Peng
20841–20855. PMLR. Zhang, Yuxiao Dong, and Jie Tang. 2022. GLM-
130B: an open bilingual pre-trained model. CoRR,
Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas abs/2210.02414.
Scialom, Anthony Hartshorn, Elvis Saravia, Andrew
Poulton, Viktor Kerkez, and Robert Stojnic. 2022. Susan Zhang, Stephen Roller, Naman Goyal, Mikel
Galactica: A large language model for science. arXiv Artetxe, Moya Chen, Shuohui Chen, Christopher
preprint, abs/2211.09085. Dewan, Mona T. Diab, Xian Li, Xi Victoria Lin,
Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shus-
Shuohang Wang, Yang Liu, Yichong Xu, Chenguang ter, Daniel Simig, Punit Singh Koura, Anjali Srid-
Zhu, and Michael Zeng. 2021. Want to reduce la- har, Tianlu Wang, and Luke Zettlemoyer. 2022.
beling cost? GPT-3 can help. In Findings of the OPT: open pre-trained transformer language mod-
Association for Computational Linguistics: EMNLP els. CoRR, abs/2205.01068.

196
Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor Angeli, methods with Word-Net’s synonyms and contex-
and Christopher D. Manning. 2017. Position-aware tual word embedding are achieved by nlpaug7 . The
attention and supervised data improve slot filling. In
parameter temperature in OpenAI API is set to 0
Proceedings of the 2017 Conference on Empirical
Methods in Natural Language Processing (EMNLP for precision in ICL and 1 for generating diverse
2017), pages 35–45. RE data. One NVIDIA GeForce RTX 3090 GPU
with 24GB memory is employed to run all experi-
Wenxuan Zhou and Muhao Chen. 2022. An improved
baseline for sentence-level relation extraction. In Pro- ments. We rerun the official code of baselines with
ceedings of the 2nd Conference of the Asia-Pacific their original settings except on the SciERC dataset.
Chapter of the Association for Computational Lin- Due to the vertical domain of SciERC, SciBERT
guistics and the 12th International Joint Conference (Beltagy et al., 2019) is used in TYP Marker and
on Natural Language Processing, AACL/IJCNLP
2022 - Volume 2: Short Papers, Online only, Novem- KnowPrompt for fairness. And for another three
ber 20-23, 2022, pages 161–168. Association for datasets, RoBERTa-large is utilized in TYP Marker
Computational Linguistics. and KnowPrompt.
A Experimental Details B Case Analysis
A.1 Datasets B.1 Wrong Cases from ICL
TACRED3 is a widely used RE dataset. It has 42 From Table 4, we notice that some RE instances
relation labels, including no_relation, meaning no are challenging for LLMs, and there are several
relation is found. TACREV4 includes the same limitations with LLMs: 1) LLMs are not good at
training set and relabeled development and test sets clearly distinguishing the order between head and
from TACRED. RE-TACRED5 is a re-annotated tail entities. 2) The same mention of head and tail
version of TACRED with 40 relations. SciERC6 entities will confuse LLMs. 4) If the distance be-
has seven relation categories and is constructed in tween head and tail entities in the context is long, it
the scientific domain. All datasets are derived from is difficult for LLMs to decide the relation correctly.
their official webs without modification, including 5) Semantically-similar relation label words and
contents and train/test/dev splits. entity mentions will puzzle LLMs because their em-
A.2 Baselines beddings are similar. 6) LLMs cannot afford very
long instances since there is a large label space for
We compare LLMs with recent baseline methods relation extraction. 7) LLMs may mostly fail to ex-
using relatively small models. 1) Normal fine- tract those ambitious or wrongly labeled relations;
tuning methods: SpanBERT (Joshi et al., 2020), those are also challenging for humans. More high-
a span-based PLM; LUKE (Yamada et al., 2020), quality demonstrations may help mitigate these
pre-trained contextualized representations of words issues. And we think it is necessary to develop
and entities based on the bidirectional transformer; step-by-step (Chat-style) approaches with LLMs to
GDPNet, a gaussian dynamic time warping pool- extract limited relations in one stage.
ing net able to select important words for rela-
tion prediction; TYP Marker (Zhou and Chen, B.2 Generated Data from LLMs
2022), fine-tuning with entity typed markers. 2) There are some cases for generated data from GPT-
Generative method: TANL (Paolini et al., 2021), 3.5 in Table 5. Through human checks on 100
framing a structured prediction language task as generated samples per dataset, about 78% gener-
a translation task between augmented natural lan- ated data are corrected labeled and of a high quality
guages. 3) Prompt-tuning methods: KnowPrompt, (85% for TACRED, 82.5% for TACREV, 72% for
knowledge-aware continuous prompt-based tuning RE-TACRED, 75% for SciERC). Meanwhile, we
with synergistic optimization. add generated data and original gold training data
A.3 Implementation Details respectively to 8-shot datasets and fine-tune Know-
Prompt, we evaluate the quality of generated data
Generated data with existing training data is then as shown in Table 3. We observe that labeled data
evaluated on KnowPrompt. Data augmentation generated by GPT-3.5 are mostly correct. As for
3
https://nlp.stanford.edu/projects/tacred/ TACRED and TACREV, generated data achieve
4
https://github.com/DFKI-NLP/tacrev more improvements than gold labeled data. Since
5
https://github.com/gstoica27/Re-TACRED
6 7
http://nlp.cs.washington.edu/sciIE/ https://github.com/makcedward/nlpaug

197
TACRED TACREV RE-TACRED SciERC
8-shot Dataset generated gold generated gold generated gold generated gold
add 0-shot 29.35 29.35 29.77 29.77 56.05 56.05 45.80 45.80
add 8-shot 31.63 30.73 34.30 33.16 59.85 60.92 48.30 57.08
add 16-shot 34.78 31.88 36.33 33.49 59.59 61.30 58.62 65.15
add 32-shot 36.45 33.35 38.19 33.98 60.06 64.65 57.70 72.11
add 48-shot 37.89 33.97 38.80 35.06 62.67 65.56 51.64 74.29
add 64-shot 36.67 34.36 42.61 35.57 61.07 67.28 54.52 75.36
add 72-shot 35.69 34.58 41.72 35.96 59.09 67.43 49.59 75.87

Table 3: Micro F1 (%) of KnowPrompt after adding labeled data generated by GPT-3.5 or gold labeled data to
8-shot datasets.

there are many incorrect labeled data in TACRED


and TACREV (Zhang et al., 2017; Alt et al., 2020),
we think better performance results from GPT-3.5’s
help. However, we also find that Some generated
data from GPT-3.5 are of less quality than gold data.
As for RE-TACRED and SciERC, using more gold
data perform better than generated data. Through
human checks, some generated samples are too
short and concatenated by some semantically ir-
relevant sentences. Meanwhile, big performance’s
difference on SciERC shows GPT-3.5 is not good
at vertical domains such as science.

198
Dataset Case Gold Relation In-context Learning
Context: And strangely enough , Cain’s short , three-year org:top_members/employees per:employee_of
TACRED tenure at the NRA is evidently the only period in his
decades-long career during which he ’s alleged to have
been a sexual predator.
Head Type: ORGANIZATION. Head Entity: NRA.
Tail Type: PERSON. Tail Type: Cain
Context: "I learn from students and I challenge them," per:parents per:alternate_names
says Heloise, 58, who took over the family hints business
when her mother, also named Heloise, died in 1977.
Head Type: PERSON. Head Entity: Heloise.
Tail Type: PERSON. Tail Entity: Heloise.
Context: Anna Mae Pictou Aquash, a Mi ‘ kmaq Indian per:country_of_birth per:countries_of_residence
TACREV from Canada, was brutally murdered in 1975.
Head Type: PERSON. Head Entity: Anna Mae Pictou
Aquash.
Tail Type: COUNTRY. Tail Entity: Canada.
Context: Messina Denaro has been trying to impose his no_relation per:cities_of_residence
power in Palermo, the Sicilian capital, and become the
new head of the Sicilian Mafia, weakened by the arrest
of Provenzano in April 2006.
Head Type: PERSON. Head Entity: his.
Tail Type: CITY. Tail Entity: Palermo.
Context: They say Vladimir Ladyzhenskiy died late Sat- per:identity per:date_of_death
RE-TACRED urday during the contest in southern Finland, while his
Finnish rival Timo Kaukonen was rushed to a hospital.
Head Type: PERSON. Head Entity: Vladimir Ladyzhen-
skiy.
Tail Type: PERSON. Tail Entity: his.
President of the Central American Parliament (Parlacen) org:top_members/employees per:title
Jacinto Suarez said on Monday that the presidents of the
Central American countries did not support Panama ’s
request of withdrawal from the Parlacen.
Head Type: ORGANIZATION. Head Entity: Central
American Parliament.
Tail Type: PERSON. Tail Entity: Jacinto Suarez.
Context: We evaluate across two corpora (conversational CONJUNCTION COMPARE
SciERC telephone speech and broadcast news speech) on both
human transcriptions and speech recognition output.
Head Type: OtherScientificTerm. Head Entity: transcrip-
tions.
Tail Type: OtherScientific Term. Tail Entity: output.
Context: We validate this new method on nine standard HYPONYM-OF PART-OF
person re-identification datasets including two large scale
Market-1501 and CUHK03 datasets and show that we
improve upon the current state-of-the-art methods on all
of them.
Head Type: Material. Head Entity: CUHK03 datasets.
Tail Type: Material. Tail Entity: datasets.

Table 4: Wrong cases predicted by GPT-3.5. The gold relation categories are listed in the third column and the
results predicted by in-context learning are in the fourth column.

199
Dataset Case Corrective Data
Context: The American Cancer Society is headquartered in Atlanta and was Tail Type: PERSON
TACRED founded in 1913 by 15 trained laywomen.
Head Type: ORGANIZATION. Head Entity: American Cancer Society.
Tail Type: ORGANIZATION. Tail Entity: 15 trained laywomen.
Relation: org:founded_by.
Context: Mary Brown, CEO of Brown Corp and renowned businesswoman, Tail Type: TITLE
is a regular speaker at industry conferences and events.
Head Type: PERSON. Head Entity: Mary Brown.
Tail Type: PERSON. Tail Entity: CEO.
Relation: per:title.
Context: Gustav Mahler was born in Kalischt, Bohemia on July 7th, 1860. Tail Type: DATE
TACREV Head Type: PERSON. Head Entity: Gustav Mahler. Relation: per:date_of_birth
Tail Type: PERSON. Tail Entity: 1860.
Relation: per:country_of_birth.
Context: MTN Nigeria, a subsidiary of South African-based MTN Group, has -
begun to list its shares on the Nigerian Stock Exchange.
Head Type: ORGANIZATION. Head Entity: MTN Group.
Tail Type: ORGANIZATION. Tail Entity: MTN Nigeria.
Relation: org:subsidiaries.
Context: Pope John Paul II was a hugely popular Catholic leader who was Tail Type: CITY
RE-TACRED based in the Vatican City for most of his papacy. Reltaion:
Head Type: PERSON. Head Entity: Pope John Paul II. per:cities_of_residence
Tail Type: PERSON. Tail Entity: Vatican City.
Relation: per:countries_of_residence.
Context: French drug manufacturer Sanofi-Aventis dissolved its Chinese -
subsidiary Guangzhou Pharma following a bribery scandal.
Head Type: ORGANIZATION. Head Entity: Sanofi-Aventis.
Tail Type: ORGANIZATION. Tail Entity: Guangzhou Pharma.
Relation: org:dissolved.
Context: The comparison between the two approaches indicates that the neural -
SciERC method produces far better results than the rule-based system.
Head Type: Method. Head Entity: neural method.
Tail Type: Method. Tail Entity: rule-based system.
Relation: COMPARE.
Context: The combination of chromatography and mass spectrometry has Relation: CONJUNCTION
enabled scientists to achieve unparalleled levels of proteome analysis.
Head Type: Method. Head Entity: mass spectrometry.
Tail Type: Method. Tail Entity: chromatography.
Relation: FEATURE-OF.

Table 5: Generated data from LLMs. Errors are bold in the second column and corrected in the third column.

200

You might also like