0% found this document useful (0 votes)
36 views13 pages

K S: T LLM S D K K G: Nowledge Olver Eaching STO Earch For Omain Nowledge From Nowledge Raphs

Uploaded by

Shahenda Hatem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views13 pages

K S: T LLM S D K K G: Nowledge Olver Eaching STO Earch For Omain Nowledge From Nowledge Raphs

Uploaded by

Shahenda Hatem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

K NOWLEDGE S OLVER : T EACHING LLM S TO S EARCH FOR

D OMAIN K NOWLEDGE FROM K NOWLEDGE G RAPHS

Chao Feng, Xinyu Zhang, Zichu Fei


arXiv:2309.03118v1 [cs.CL] 6 Sep 2023

A BSTRACT
Large language models (LLMs), such as ChatGPT and GPT-4, are versatile and can solve different
tasks due to their emergent ability and generalizability. However, LLMs sometimes lack domain-
specific knowledge to perform tasks, which would also cause hallucination during inference. In
some previous works, additional modules like graph neural networks (GNNs) are trained on retrieved
knowledge from external knowledge bases, aiming to mitigate the problem of lacking domain-
specific knowledge. However, incorporating additional modules: 1) would need retraining additional
modules when encountering novel domains; 2) would become a bottleneck since LLMs’ strong
abilities are not fully utilized for retrieval. In this paper, we propose a paradigm, termed Knowledge
Solver (KSL), to teach LLMs to search for essential knowledge from external knowledge bases by
harnessing their own strong generalizability. Specifically, we design a simple yet effective prompt
to transform retrieval into a multi-hop decision sequence, which empowers LLMs with searching
knowledge ability in zero-shot manner. Additionally, KSL is able to provide complete retrieval paths
and therefore increase explainability of LLMs’ reasoning processes. We conduct experiments on
three datasets: CommonsenseQA (Talmor et al., 2018), OpenbookQA (Mihaylov et al., 2018), and
MedQA-USMLE (Jin et al., 2021), and found that our approach improves LLM baseline performance
by a relatively large margin.

1 Introduction

Recently, large language models (LLMs) like ChatGPT have drawn numerous attention from researchers and practi-
tioners due to their generalist capabilities (Qin et al., 2023). For instance, sufficiently large language models could
perform well for different tasks in zero-shot manner, such as text summarization (Yang et al., 2023; Zhang et al.,
2023), machine translation (Moslem et al., 2023), and question answering (Singhal et al., 2023). However, in some
scenarios, LLMs lack domain-specific knowledge or are not able to recall facts and knowledge correctly, which causes
hallucination (Bang et al., 2023). Hallucination refers to models generating text that is nonsensical, or unfaithful to the
provided source input (Ji et al., 2023; Koehn and Knowles, 2017; Raunak et al., 2021; Rohrbach et al., 2018; Vinyals
and Le, 2015; Maynez et al., 2020).
Retrieving relevant texts from knowledge bases is a classic way to augment language models’ performance like
generation quality (Borgeaud et al., 2022; Lewis et al., 2020a; Levine et al., 2022; Guu et al., 2020). Besides, it can
also help improve the factuality of generated texts. Typically, retrieval modules are employed to find the most relevant
documents with the highest similarity scores to the query. Then input texts and retrieved documents would be combined
in a specific way fed into models. Motivated by this, some methods (Ram et al., 2023; Peng et al., 2023b) utilize
retrieved texts to augment LLMs. Ram et al. (2023) directly prepends retrieved documents to the input to obtain a
performance gain for LLMs. (Peng et al., 2023b) designs an LLM-Augmenter to retrieve and merge evidence from
external knowledge for alleviating hallucination. However, relying on similarity between embeddings would only make
model learn shallow features instead of understanding semantics, which in turn hinder the model from searching truly
useful knowledge. On the contrary, Knowledge Graphs (KGs) are clear, logical, and superior mediums of knowledge.
Thus, effectively leveraging KGs for LLMs should benefit LLMs’ performance on knowledge-required tasks.
For this reason, there is a line of work (Yasunaga et al., 2021; Lin et al., 2019; Feng et al., 2020) using KGs to help LLMs
make predictions. KagNet (Lin et al., 2019) proposes a graph neural network module to model relational graphs for
relational reasoning under the context of both knowledge symbolic space and language semantic space. MHGRN (Feng
Knowledge Solver: Teaching LLMs to Search for Domain Knowledge from Knowledge Graphs

Question: Where is a
business restaurant Input Reasoning Process
LLMs B. at hotel
likely to be located? I don’t have enough
A. town information and knowledge
B. at hotel to answer your question
C. mall accurately.
D. business sector
E. yellow pages (a)

Question: Where is a LLMs


business restaurant Input Reasoning Process D. business
likely to be located? sector
A. town KGs
B. at hotel Interactive
C. mall Knowledge
D. business sector Search
E. yellow pages
UsedFor
capital restaurant guests
city
IsA

Relat
dTo

n
io
at
ate

oc

e
L
At
dTo
business
Rel

RelatedTo sector
business edTo
locate place Relat
(b)

Figure 1: Knowledge Solver. An example comparing the vanilla LLM in (a) and zero-shot knowledge solver in (b) for
question-answering tasks. Our approach helps LLMs search for necessary knowledge to perform tasks by harnessing
LLMs’ own generalizability. Purple represents nodes and relations in LLMs’ chosen correct path.

et al., 2020) equips pretrained language models with a multi-hop relational reasoning module, which unifies path-based
reasoning methods and graph neural networks. QA-GNN (Yasunaga et al., 2021) learn representations over joint graphs
formed by connecting QA context and KG. However, they (Yasunaga et al., 2021; Lin et al., 2019; Feng et al., 2020)
all require training additional knowledge-aware modules like graph neural networks (GNNs) on retrieved knowledge.
There are two shortcomings of training additional modules: 1) would suffer from pains of retraining when encountering
novel domains; 2) would become a bottleneck since LLMs’ strong abilities are not fully utilized for retrieval.
In this paper, we propose a paradigm, termed Knowledge Solver (KSL), to solve these shortcomings, which teaches
LLMs themselves to search for knowledge from external knowledge bases. To be specific, we simplify the process of
searching for necessary knowledge from KGs into a multi-hop decision sequence. At each step, we transform local
information within KGs into text prompts (including the historical path selected by LLMs), based on which LLMs
select relevant knowledge in the context to perform tasks, as shown in Figure 1. The whole process is similar to humans
searching over the Internet for achieving some goals. Furthermore, based on the complete paths chosen by LLMs, we
can explain the whole decision-making process of LLMs. It allows for analysis when bad cases arise, a capability not
present in previous black-box retrieval methods.
We evaluate our approach, Knowledge Solver (KSL), with three LLMs (GPT-3.5, LLaMA (Touvron et al., 2023a), and
LLaMA 2 (Touvron et al., 2023b)) on three datasets: CommonsenseQA, OpenbookQA, and MedQA-USMLE, where
reasoning with knowledge is required. KSL improves two LLM baselines’ performance across these three datasets in
zero-shot and finetuning settings.
Our main contributions are summarized as follows:

• We propose Knowledge Solver (KSL), which is the first paradigm employing LLMs to search for relevant
knowledge on KGs by themselves.
• Our proposed paradigm Knowledge Solver can boost LLMs’ performance on knowledge-required tasks by a
relatively large margin in zero-shot manner without additional modules and training.
• Knowledge Solver can provide explainability for LLMs’ whole reasoning processes.
• When the computational burden is affordable, finetuning LLMs on our specially constructed dataset, with the
help of KGs, can benefit LLMs further.

2
Knowledge Solver: Teaching LLMs to Search for Domain Knowledge from Knowledge Graphs

Question
What type of person typically contracts illness? Given a question and an answer entity list, our goal is to choose the subsequent
entity based on their relations shown in the brackets () from the provided entity list,
A. hospital B. head C. sick person User until we reach the correct answer entity.
D. elderly person E. doctor’s office The question is: What type of person typically contracts illness? The answer entities
are: [‘hospital’, ‘head’, ‘person’, ‘sick’, ‘sick_person’, …]. Given a head entity contract,
please pick the next entity: [sicken(has subevent), condition(is related to), …].
External KG
The next entity is condition.
head
Assistant
human KSL
Given a head entity condition, please pick the next entity: [contract(is
type related to), illness(is a kind of), hospital(is related to), well(is related to) …].
patient
nurse User
animal
money
nursing The next entity is illness.
office home
typically Assistant
doctor elderly
person Given a head entity illness, please pick the next entity: [hospital(is at location of),
hospital
elderly_person(is at location of), sick_person(is at location of) …].
User
contract
condition illness
sick person The next entity is elderly_person.
health sick
Assistant

Figure 2: Method Overview. For each question answer choice pair, we retrieve relevant knowledge subgraph and
encode it into text prompt, which is injected into LLMs directly to help them perform knowledge-required tasks. In this
question-answering scenario, LLMs interact with provided external knowledge to choose the path for answering the
question correctly.

2 Related Work

Large Language Models. Pre-trained language models (PLMs) are trained on massive datasets, which enables them
to understand contexts and generate texts. Pre-trained LMs like GPT-1 (Radford et al., 2018), BERT (Devlin et al.,
2018), XLNet (Yang et al., 2019), RoBERTa Liu et al. (2019) and ALBERT (Lan et al., 2019) have been widely applied
to various natural language processing (NLP) tasks in recent years. For the task of question answering, models are
leveraged in a large number of existing frameworks, such as (Lin et al., 2019; Lv et al., 2020; Feng et al., 2020;
Yasunaga et al., 2021; Zhang et al., 2022) to encode the QA contexts as statement vectors.
The current burst of development in large language models (LLMs) brings new innovation hits with the immense size
and capacity. Base LLMs like T5 (Raffel et al., 2020), GPT-3 (Brown et al., 2020), PaLM (Chowdhery et al., 2022), GPT-
J (Wang, 2021), LLaMA (Touvron et al., 2023a), GLM (Du et al., 2022; Zeng et al., 2022), BLOOM (Scao et al., 2022),
RWKV (Peng et al., 2023a), MOSS (Sun et al., 2023) and LLaMA 2 (Touvron et al., 2023b) are trained on large datasets
to capture general language patterns. Additionally, instruction fine-tuned LLMs like InstructGPT (Ouyang et al., 2022),
Flan-PaLM (Chung et al., 2022), Flan-T5 (Chung et al., 2022), BLOOMZ (Muennighoff et al., 2022), Alpaca (Taori
et al., 2023) and Vicuna (Chiang et al., 2023) are designed to follow user instructions. RLHF (Reinforcement Learning
from Human Feedback) LLMs, such as ChatGPT1 and GPT-4 (OpenAI, 2023a), incorporate reinforcement learning
techniques to optimize model performance based on human feedback. However, in some scenarios, LLMs lack
domain-specific knowledge to perform relevant tasks. Our proposed paradigm, KSL, teaches LLMs themselves to
search for knowledge from external knowledge bases to help LLMs achieve goals.

Knowledge Base Question Answering. Question answering over knowledge base (KBQA) focuses on enabling
machines to answer questions using relevant knowledge retrieved from knowledge bases (KBs). Approaches in KBQA
can be broadly categorized into two groups: (i) text retrieval-based methods and (ii) Knowledge Graph-based methods.
Our research aligns with the second group, with an emphasis on integrating Knowledge Graphs into LLMs.
Text retrieval-based methods have been experimented with a wide range of NLP tasks. Generative models, augmented
with retrieval capabilities in question answering, are studied (and finetuned) in (Min et al., 2020; Lewis et al., 2020b;
Izacard and Grave, 2020). Rather than directly finetuning pretrained LMs to enhance language task performance, a
growing number of researchers are moving towards lighter-weight approaches, where they freeze model parameters
1
https://openai.com/blog/chatgpt/

3
Knowledge Solver: Teaching LLMs to Search for Domain Knowledge from Knowledge Graphs

Algorithm 1 Knowledge Solver Zero-Shot Reasoning.


Require: Question entities Vq = {vq1 , vq2 , · · · , vqn }; corresponding answer entities Va = {va1 , va2 , · · · , van }.
1: function REL _ EXTR(vh , Gsub )
2: tail_relation_list = []
3: for each tail entity vti of vh in Gsub do
4: relation rhti = Gsub (vh , vti )
5: tail_relation_list.append((vti , rhti ))
6: end for
7: return tail_relation_list
8: end function
9: retrieve subgraph Gsub given Vq and Va
10: vq is randomly selected from Vq as vh1
11: round = 0
12: for each head entity vhi do
13: if vhi ∈ Va then
14: break
15: end if
16: if round == round_maximum then
17: break
18: end if
19: tail_relation_list = REL_EXTR(vhi , Gsub )
20: vh(i+1) = LLM(tail_relation_list)
21: round += 1
22: end for
23: return vhi

and augment the model with small trainable modules. Such lightweight finetuning techniques include adapter tun-
ing (Houlsby et al., 2019; Lin et al., 2020), prompt tuning (Lester et al., 2021), prefix tuning (Li and Liang, 2021),
and more complex architectures like input-dependent prompt tuning, frozen readers, and LM recursion as presented in
(Levine et al., 2022).
Knowledge Graph-based methods are also widely applied in the question answering domain. KagNet (Lin et al.,
2019) constructs schema graphs representing paths between question and answer entities, which are then encoded
with GCN-LSTM-HPA architecture. To achieve both high accuracy and effective model scalability, Multi-hop Graph
Relation Network (Feng et al., 2020) combines path-based reasoning interpretability with GNN scalability, adding a
structured relational attention mechanism. Distinctly, QA-GNN (Yasunaga et al., 2021) links QA context vectors to
topic entities in the schema graph. DRAGON (Yasunaga et al., 2022) proposes a self-supervised model for bidirectional
text and KG integration, while GreaseLM (Zhang et al., 2022) fuses PLMs and GNN representations through layered
modality interactions. Unlike prior works training additional modules like GNNs, our method KSL encourages LLMs
to search for essential knowledge from external knowledge base by themselves.

3 Problem Definition

Our paper aims to help LLMs perform better on knowledge-required tasks when they lack domain-specific knowledge.
We choose question answering as the evaluated knowledge-required task. To mitigate the issue of lacking knowledge,
we inspire LLMs to interact with provided external knowledge and spontaneously identify the appropriate pathway
to derive the correct answer. Following prior work (Yasunaga et al., 2021), we define the Knowledge Graph as a
multi-relational graph G = (V, E). Here V is the set of entity nodes in the KG; E ∈ V × R × V is the set of edges that
connect nodes in V, where R represents a set of relation types.
Given question answer choices pair [q, A], we link entities mentioned in the question and answer choices to the given
KG G, following prior work (Feng et al., 2020). We denote all question entities as Vq ∈ V, and answer entities as
q,a q,a
Va ∈ V. Then we retrieve subgraph Gsub = (Vsub , Esub ) from KG G. Gsub contains all nodes on the k-hop paths
between nodes in Vq and Va .

4 Method

As shown in Figure 2, our method KSL first retrieves relevant subgraph Gsub from KG for given question answer
choices pair [q, A]. Then we encode Gsub into text prompt TK to inject knowledge into LLMs, which would initialize
dialogue-like inference to encourage LLMs to search necessary knowledge by utilizing their own abilities and guide
themselves to achieve final goals.

4
Knowledge Solver: Teaching LLMs to Search for Domain Knowledge from Knowledge Graphs

Algorithm 2 Generating Training Instruction Dataset.


Require: A sequence of all question answer choices pairs Q = {[q1 , A1 ], · · · , [qN , AN ]}; structured knowledge source (Knowledge
Graph) G; encoder E to transform Gsub into text prompt TK
1: total_paths = []
2: for each [qj , Aj ] in Q do
3: extract question and answer choices entities Vq and Va
4: retrieve subgraph Gsub from G
5: for each question entity vqi ∈ Vq do
6: randomly select correct answer choice entity vca
7: path = find_shortest_path(Gsub , source=vqi , target=vca )
8: total_paths.append(path)
9: remove all nodes on the path except for vca from Gsub
10: end for
11: end for
12: training_data = []
13: for each path pi in total_paths do
14: hist = []
15: for each node nj in pi except for the last node do
16: instance = {}
17: instance[“instruction"] = instruction
18: head entity vhj = nj
19: tail_entity vtj = entity_extract(Gsub , vhj )
20: relationrhtj = relation_extract(Gsub , vhj , vtj )
21: instance[“input"] = E(vhj , hist)
22: instance[“output"] = E(vtj )
23: training_data.append(instance)
24: hist.append([instance[“input"], instance[“output"]])
25: end for
26: end for
27: return training_data

4.1 Knowledge Solver Zero-Shot Reasoning

In order to help models perform tasks that require domain-specific knowledge, like question answering, we inject
external knowledge into LLMs. For each retrieved subgraph Gsub , we transform it into text prompt TK fed into LLMs,
and utilize LLMs’ strong generalizability to incentivize them to search for necessary information by themselves.
Given the question q and the set of answer choices A = [a1 , ..., aN ], where N is the total number of answer choices, we
retrieve Gsub and view it as external knowledge. The Gsub contains all question entities Vq , all answer choice entities
Va , intermediate entities, and corresponding relations R between entities. To initialize the reasoning process of LLMs
for question answering, we first randomly select a question entity vq ∈ Vq for LLMs, and then encourage LLMs to
choose a path based on their own judgment until they finally reach one of the answer entities va ∈ Va . Concretely, we
can break down the reasoning process of LLMs for question answering into several rounds like CoT (Wei et al., 2022)
(the total number of rounds depends on LLMs’ own judgment. In practice, we set the limit of rounds to Nr ). For each
question and answer choices pair [q, A], the chain of rounds would form an explicit reasoning path, which not only
augments LLMs with domain-specific external knowledge, but also increases LLMs’ explainability.
During each round, we put the current head entity vh and all linked head entities Vt = [vt1 , ..., vtN ] and their
corresponding relations Rht = [rht1 , ..., rhtN ] in the text prompt to inform LLMs of the existence of external
knowledge. LLMs will pick the most likely tail entity as the head entity for the next round, based on the prior knowledge
implicitly stored in their parameters and explicit external knowledge in the form of text prompts, like relations, for
question answering. Then, this entity selection process will repeat until one of the answer entities va is chosen.
Ultimately, we find the LLMs’ selected answer choice based on the mapping between answer entity va and answer
choice a. The whole reasoning process is purely done by text generation instead of classification over predefined
entities since in many scenarios, we are not able to access the logits of LLMs. For each round, the input text prompt
also includes the whole history of entity selection, similar to dialogue. The overall reasoning process is also illustrated
in Algorithm 1.

4.2 Knowledge Solver Finetuning

When LLMs are accessible, we can finetune them on external knowledge to transform this knowledge into LLMs’
parameters. Following Alpaca (Taori et al., 2023), we leverage instruction tuning (Wei et al., 2021) to finetune LLMs.
To be specific, we use a similar template in Alpaca (Taori et al., 2023). Different from general instruction tuning (Wei
et al., 2021), where LLMs are stimulated to follow users’ instructions in zero-shot manner, our main goal is to encourage

5
Knowledge Solver: Teaching LLMs to Search for Domain Knowledge from Knowledge Graphs

Below is an instruction that describes a task, paired with an


input that provides further context. Write a response that
appropriately completes the request.

Instruction
Given a question and an answer entity list, our goal is to
choose the subsequent entity based on their relations
shown in the brackets () from the provided entity list, until
we reach the correct answer entity.

Input
The question is: What does playing soccer for a long time
lead to? The answer entities are: ['excitement', 'fatigue',
'anger', 'hurt', 'hurting', 'get', 'get_tired', 'getting',
'getting_tired', 'tired']. Given a head entity lead, please pick
the next entity: [action(is related to), run(is related to),
compete(has subevent) …]

Output
The next entity is run.

Figure 3: Training example. Instance in our constructed instruction tuning dataset.

LLMs to learn domain-specific knowledge. Thus, we fix instructions, which are used to inform LLMs to select the
correct path, across all instances (in reality, the instructions can be modified according to domain-specific knowledge).
The input and response formats are the same as we stated in Knowledge Solver Zero-Shot Reasoning, where we
transform each retrieved subgraph Gsub into multiple input-response pairs starting from question entity vq to answer
entity va in the correct answer choice. Each input contains entity selection history like the dialogue between the user and
LLMs, the current head entity, all connected tail entities, and corresponding relations. The response includes the next
tail entity of the correct path. Concretely, for each question and answer choices pair [q, A], we iterate over all question
entities vq ∈ Vq while keeping all extracted paths separated. The whole process of constructing instruction-tuning
dataset is also illustrated in Algorithm 2. The example of our instruction tuning dataset can be seen in Figure 3. We
utilize LoRA (Hu et al., 2021) to tune LLMs since it can help greatly reduce GPU memory burden.
For inference, finetuned KSL uses the same way as zero-shot Knowledge Solver. For each question and answer choices
pair [q, A], we randomly select a question entity vq to initialize the reasoning process. We leave averaging results of all
question entities for future research.

5 Experiment

5.1 Datasets

We evaluate our approach Knowledge Solver on three question-answering datasets: CommonsenseQA (Talmor et al.,
2018), OpenbookQA (Mihaylov et al., 2018), and MedQA-USMLE (Jin et al., 2021).

CommonsenseQA is a question-answering dataset for commonsense reasoning, comprising a total of 12102 questions.
The methodology for question generation involves sampling three target concepts related to a source concept from
ConceptNet (Speer et al., 2017). Each question has five choices. Three of these are authored by crowd workers based on
the target concepts, with an additional two serving as distractors. CommonsenseQA serves as one of the most common
benchmark datasets for KGQA, as shown in (Lin et al., 2019; Lv et al., 2020; Feng et al., 2020; Yasunaga et al., 2021,
2022). Our paper preprocesses data with the original data splits in KagNet (Lin et al., 2019).

6
Knowledge Solver: Teaching LLMs to Search for Domain Knowledge from Knowledge Graphs

Models CSQA OBQA MedQA


GPT-3.5 (zero-shot) 72.9 74.8 55.8
GPT-3.5 + KSL (zero-shot) 79.6 (+9.19%) 81.6 (+9.09%) 58.4 (+4.66%)
LLaMA-7B (zero-shot) 20.5 26.8 22.7
LLaMA-7B + KSL (zero-shot) 28.4 (+38.54%) 34.0 (+26.87%) 23.6 (+3.96%)
LLaMA2-7B (zero-shot) 19.7 25.6 25.1
LLaMA2-7B + KSL (zero-shot) 26.3 (+33.50%) 32.2 (+25.78%) 25.8 (+2.79%)
LLaMA-7B (finetuned) 38.0 29.8 25.0
LLaMA-7B + KSL (finetuned) 47.4 (+24.74%) 45.8 (+53.69%) 25.7 (+2.80%)

Table 1: Performance Evaluation. We report the accuracy of LLM baselines and (zero-shot and finetuned) KSL on
three datasets: CommonsenseQA, OpenBookQA, and MedQA-USMLE.

OpenbookQA contains approximately 6000 multiple-choice questions and an open book of over 1000 elementary-
level science facts. The question-answering process requires a combination of scientific facts, commonsense knowledge,
and multi-hoop reasoning abilities. Our paper follows the original data splits (Mihaylov et al., 2018).

MedQA is a multilingual dataset designed for solving real-world medical problems. All questions and answers are
gathered from professional medical board exams. In our paper, we focus on the USMLE subset, where data is from the
National Medical Board Examination in the USA, and follow the original data splits (Jin et al., 2021).

5.2 Knowledge Graphs

CoceptNet (Speer et al., 2017) is used for CommonsenseQA and OpenbookQA. It links words and phrases from
common human language via labeled relationships. We adopt the relation setups from MHGRN (Feng et al., 2020),
which include a total of 34 multi-directional relation types. The paths between all topic entities mentioned in the
question-answer pair are founded and grounded as the subgraphs.
In the context of the USMLE dataset of MedQA, we incorporate the Knowledge Graph constructed in QA-GNN (Ya-
sunaga et al., 2021), which contains biomedical vocabularies from Unified Medical Language System (UMLS) (Boden-
reider, 2004) and DrugBank (Wishart et al., 2018).
Given each question and answer choices pair [q, A], we retrieve subgraph Gsub from structured Knowledge Graph G
following the preprocessing step described in MHGRN (Feng et al., 2020), with hop size k = 2.

5.3 Implementation & training details

Zero-shot. We mainly use three LLMs (GPT-3.5, LLaMA-7B (Touvron et al., 2023a), and LLaMA 2-7B (Touvron
et al., 2023b)) as baselines. For GPT-3.5, we call OpenAI API to use gpt-3.5-turbo-16k. The limit of the total number
of rounds Nr is set to 5 during inference.

Finetuning. We use LoRA (Hu et al., 2021) to finetune LLaMA-7B (Touvron et al., 2023a) on 8 NVIDIA A40 GPUs,
each has 48 GB memory. For CommonsenseQA (Talmor et al., 2018), the training set contains 114,552 instances and
the development set consists of 14,391 instances. For OpenbookQA (Mihaylov et al., 2018), the training set includes
57,458 instances and the development set contains 5814 examples. For MedQA-USMLE (Jin et al., 2021), there are
13,561 instances in the training set and 1677 instances in the development set. The global batch size is 128 and learning
rate is set to 3e-4. We set the rank r in LoRA (Hu et al., 2021) to 16 and α to 16. The dropout probability (Srivastava
et al., 2014) is 0.05. We finetune query, key, value, and output projection matrices Wq , Wk , Wv , Wo in self-attention
modules of transformers (Vaswani et al., 2017). The maximum of input sequence length is 1152. The total number of
finetuning epochs for CommonsenseQA (Talmor et al., 2018) and OpenbookQA (Mihaylov et al., 2018) is 3, and for
MedQA-USMLE (Jin et al., 2021) is 5. We use checkpoints with the lowest validation loss for final inference on test
sets.

Evaluation metric. For three question-answering datasets: CommonsenseQA (Talmor et al., 2018), OpenbookQA (Mi-
haylov et al., 2018), and MedQA-USMLE (Jin et al., 2021), we use accuracy as evaluation metric following prior
work (Yasunaga et al., 2021). However, we only perform text generation instead of classification over the predefined
set, it is hard to use the traditional way for calculating accuracy. Instead, we call OpenAI API and input hand-crafted
prompts (see details in supplementary) to GPT-4 (OpenAI, 2023b) to judge whether LLMs’ generation matches ground

7
Knowledge Solver: Teaching LLMs to Search for Domain Knowledge from Knowledge Graphs

Question-Answer Choices Pair Output


Where are a lot of offices in New York?
A. school building GPT-3.5 (zero-shot): The answer is C.
B. skyscraper
C. business
D. grocery store KSL (zero-shot): offices AtLocation skyscraper
E. work
What would you use to find a company?
A. market place GPT-3.5 (zero-shot): The answer is B.
B. internet
C. yellow pages
UsedFor* RelatedTo
D. phone book KSL (zero-shot): find telephone_directory yellow_pages
E. armed forces
What causes someone to stop driving
immediately? GPT-3.5 (zero-shot): The answer is D.
A. traffic jams
B. wheels turning
C. lack of fuel
RelatedTo Causes
D. illness KSL (zero-shot): stop driving lack_of_fuel
E. tire wear

Figure 4: Qualitative Results of KSL (GPT-3.5). Generated responses on some examples of GPT-3.5 and zero-shot
KSL (GPT-3.5). The bold choice represents the correct answer. An asterisk (*) denotes a reversed relation.

truth. In the end, we use the score from GPT-4 (OpenAI, 2023b) for calculating accuracy (0 represents the LLMs’
output is totally irrelevant while 1 means that LLMs’ generated answer correctly matches the ground truth).

5.4 Result

Question-Answer Choices Pair Output


Where would a human expect to find manufacturing
LLaMA-7B (zero-shot): The chosen option is: C. grocery store
operations?
A. factory
RelatedTo
B. school KSL (zero-shot): manufacturing factory
C. grocery store
D. band RelatedTo
KSL (finetuned): manufacturing factory
E. aircraft
The team was able to communicate effectively, they knew
LLaMA-7B (zero-shot): The chosen option is: A. send email
what each other would what?
A. send email
RelatedTo RelatedTo
B. talk with people KSL (zero-shot): team work think
C. ring up the president of bermuda
D. think RelatedTo HasSubevent*
KSL (finetuned): able do think
E. speak to

In what region is a centavo uses? LLaMA-7B (zero-shot): The chosen option is: A. colon
A. colon
B. austral RelatedTo
KSL (zero-shot): centavo peso
C. cordoba
D. indian IsA* RelatedTo
E. mexican peso KSL (finetuned): region south austral

Figure 5: Qualitative Results of KSL (LLaMA-7B). Generated responses on some examples of LLaMA-7B and
zero-shot/finetuned KSL (LLaMA-7B). The bold choice represents the correct answer. An asterisk (*) denotes a
reversed relation.

Knowledge Solver zero-shot reasoning. As shown in Table 1, our Knowledge Solver (KSL) can boost LLM baselines
(GPT-3.5, LLaMA-7B (Touvron et al., 2023a), and LLaMA 2-7B (Touvron et al., 2023b)) by a relatively large margin,
indicating that: 1) our approach can benefit model in performing knowledge required tasks; 2) LLMs possess certain
abilities to search necessary information by themselves when external knowledge is provided. Training an adapter for
each scenario where domain-specific knowledge is required would cast large computational and time costs. In contrast,
our zero-shot Knowledge Solver can harness LLMs’ own emergent ability to perform domain knowledge-required tasks
by only providing external knowledge. This teaches LLMs to interact with external knowledge to achieve final goals.

8
Knowledge Solver: Teaching LLMs to Search for Domain Knowledge from Knowledge Graphs

CommonsenseQA
OpenBookQA
40 MedQA-USMLE

30
Accuracy
20

10

0 LLaMA (zero-shot) Alpaca-LoRA KSL (zero-shot) KSL (finetuned)


Model
Figure 6: Ablation Experiments on Finetuned KSL (LLaMA-7B). We compare our KSL with LLaMA and Alpaca-
LoRA.

Knowledge Solver finetuning. Unlike training separate adapters like GNNs, our approach can also finetune LLMs on
provided external knowledge to inject knowledge into LLMs’ parameters. As shown in Table 1, finetuned KSL (LLaMA-
7B) can improve performance further and surpass finetuned LLaMA-7B (see finetuning detials in supp.) on three
datasets. This suggests that our method can effectively help LLMs memorize knowledge to perform domain-specific
knowledge-required tasks when the computational burden is affordable. Interestingly, the improvement on MedQA-
USMLE (Jin et al., 2021) is not as substantial as on CommonsenseQA (Talmor et al., 2018) and OpenBookQA (Mihaylov
et al., 2018). The problem might be due to the fact that Knowledge Graph (Yasunaga et al., 2021) is not large enough,
where for many question and answer choices pairs, it is difficult to retrieve complete subgraphs. In many cases, several
answer entities are not included in subgraphs or there is no path from question entities to answer entities.

5.5 Qualitative result

We show some qualitative results in Figure 4 and Figure 5. It shows that our zero-shot KSL can help LLMs perform
knowledge-required tasks without any additional training. Provided with external knowledge, LLMs can look up
necessary knowledge to achieve final goals by themselves. Our approach can help LLMs correct their mistakes when
they lack relevant domain-specific knowledge. For example, vanilla LLaMA-7B (Touvron et al., 2023a) doesn’t know
where the manufacturing operations can be found while zero-shot KSL (LLaMA-7B) can correctly answer the question.
Finetuned KSL (LLaMA-7B) can further improve LLMs’ ability to solve knowledge-required tasks like answering the
question “in what region is a centavo uses?". These demonstrate the effectiveness of KSL.

5.6 Ablation study

As shown in Table 1, finetuned KSL (LLaMA-7B) can improve performance substantially. In order to investigate
whether this boost mainly comes from instruction tuning itself or our specially constructed knowledge datasets, we
also evaluate Alpaca-LoRA (Taori et al., 2023) on CommonsenseQA (Talmor et al., 2018), OpenBookQA (Mihaylov
et al., 2018), and MedQA-USMLE (Jin et al., 2021) by using the same inference method mentioned as vanilla LLaMA-
7B (Touvron et al., 2023a). It’s worth noting that Alpaca-LoRA’s maximum sequence length is 512, while for our
interactive inference method, the input sequence length is generally longer than 512. As shown in Figure 6, Alpaca-Lora,
which uses the same technique LoRA (Hu et al., 2021) tuning LLaMA-7B (Touvron et al., 2023a), works on par with
vanilla LLaMA-7B (Touvron et al., 2023a), suggesting that our specially designed knowledge dataset is the main
source benefiting LLMs on performing knowledge required tasks. Alpaca-LoRA (Taori et al., 2023) underperforms our
zero-shot KSL (LLaMA-7B). It indicates that encouraging LLMs to search for relevant knowledge by harnessing their
own abilities is an effective and efficient way to help model on knowledge-required tasks.

6 Conclusion
In this paper, we propose Knowledge Solver (KSL), which can help LLMs perform better on domain-specific knowledge-
required tasks in zero-shot and finetuning manner. Provided with external knowledge, LLMs can harness their own

9
Knowledge Solver: Teaching LLMs to Search for Domain Knowledge from Knowledge Graphs

ability to search for necessary knowledge and information to perform relevant tasks without additional training or
modules. Our interactive inference method can not only explicitly inject knowledge into LLMs but also guide LLMs to
solve tasks. We also demonstrate that performance improvement majorly comes from our specially designed inference
method (for zero-shot) and task (for finetuning) instead of instruction tuning. Currently, the initial question entity for
our interactive inference method is randomly chosen. We leave how to choose the first entity to initialize performing
tasks for further research.

References
Bang, Y., Cahyawijaya, S., Lee, N., Dai, W., Su, D., Wilie, B., Lovenia, H., Ji, Z., Yu, T., Chung, W., et al. (2023).
A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv
preprint arXiv:2302.04023.
Bodenreider, O. (2004). The unified medical language system (umls): integrating biomedical terminology. Nucleic
acids research, 32(suppl_1):D267–D270.
Borgeaud, S., Mensch, A., Hoffmann, J., Cai, T., Rutherford, E., Millican, K., Van Den Driessche, G. B., Lespiau,
J.-B., Damoc, B., Clark, A., et al. (2022). Improving language models by retrieving from trillions of tokens. In
International conference on machine learning, pages 2206–2240. PMLR.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
Askell, A., et al. (2020). Language models are few-shot learners. Advances in neural information processing systems,
33:1877–1901.
Chiang, W.-L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J. E., Stoica,
I., and Xing, E. P. (2023). Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C.,
Gehrmann, S., et al. (2022). Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, E., Wang, X., Dehghani, M., Brahma, S., et al.
(2022). Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for
language understanding. arXiv preprint arXiv:1810.04805.
Du, Z., Qian, Y., Liu, X., Ding, M., Qiu, J., Yang, Z., and Tang, J. (2022). GLM: general language model pretraining
with autoregressive blank infilling. pages 320–335.
Feng, Y., Chen, X., Lin, B. Y., Wang, P., Yan, J., and Ren, X. (2020). Scalable multi-hop relational reasoning for
knowledge-aware question answering. arXiv preprint arXiv:2005.00646.
Guu, K., Lee, K., Tung, Z., Pasupat, P., and Chang, M. (2020). Retrieval augmented language model pre-training. In
International conference on machine learning, pages 3929–3938. PMLR.
Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M., and Gelly,
S. (2019). Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages
2790–2799. PMLR.
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. (2021). Lora: Low-rank
adaptation of large language models. arXiv preprint arXiv:2106.09685.
Izacard, G. and Grave, E. (2020). Leveraging passage retrieval with generative models for open domain question
answering. arXiv preprint arXiv:2007.01282.
Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y. J., Madotto, A., and Fung, P. (2023). Survey of
hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
Jin, D., Pan, E., Oufattole, N., Weng, W.-H., Fang, H., and Szolovits, P. (2021). What disease does this patient have? a
large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421.
Koehn, P. and Knowles, R. (2017). Six challenges for neural machine translation. arXiv preprint arXiv:1706.03872.

10
Knowledge Solver: Teaching LLMs to Search for Domain Knowledge from Knowledge Graphs

Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., and Soricut, R. (2019). Albert: A lite bert for self-supervised
learning of language representations. arXiv preprint arXiv:1909.11942.
Lester, B., Al-Rfou, R., and Constant, N. (2021). The power of scale for parameter-efficient prompt tuning. arXiv
preprint arXiv:2104.08691.
Levine, Y., Dalmedigos, I., Ram, O., Zeldes, Y., Jannai, D., Muhlgay, D., Osin, Y., Lieber, O., Lenz, B., Shalev-Shwartz,
S., et al. (2022). Standing on the shoulders of giant frozen language models. arXiv preprint arXiv:2204.10019.
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T.,
et al. (2020a). Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information
Processing Systems, 33:9459–9474.
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T.,
et al. (2020b). Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information
Processing Systems, 33:9459–9474.
Li, X. L. and Liang, P. (2021). Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint
arXiv:2101.00190.
Lin, B. Y., Chen, X., Chen, J., and Ren, X. (2019). Kagnet: Knowledge-aware graph networks for commonsense
reasoning. arXiv preprint arXiv:1909.02151.
Lin, Z., Madotto, A., and Fung, P. (2020). Exploring versatile generative language model via parameter-efficient transfer
learning. arXiv preprint arXiv:2004.03829.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. (2019).
Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
Lv, S., Guo, D., Xu, J., Tang, D., Duan, N., Gong, M., Shou, L., Jiang, D., Cao, G., and Hu, S. (2020). Graph-based
reasoning over heterogeneous external knowledge for commonsense question answering. In Proceedings of the AAAI
conference on artificial intelligence, volume 34, pages 8449–8456.
Maynez, J., Narayan, S., Bohnet, B., and McDonald, R. (2020). On faithfulness and factuality in abstractive summariza-
tion. arXiv preprint arXiv:2005.00661.
Mihaylov, T., Clark, P., Khot, T., and Sabharwal, A. (2018). Can a suit of armor conduct electricity? a new dataset for
open book question answering. arXiv preprint arXiv:1809.02789.
Min, S., Michael, J., Hajishirzi, H., and Zettlemoyer, L. (2020). Ambigqa: Answering ambiguous open-domain
questions. arXiv preprint arXiv:2004.10645.
Moslem, Y., Haque, R., and Way, A. (2023). Adaptive machine translation with large language models. arXiv preprint
arXiv:2301.13294.
Muennighoff, N., Wang, T., Sutawika, L., Roberts, A., Biderman, S., Scao, T. L., Bari, M. S., Shen, S., Yong,
Z.-X., Schoelkopf, H., et al. (2022). Crosslingual generalization through multitask finetuning. arXiv preprint
arXiv:2211.01786.
OpenAI (2023a). Gpt-4 technical report.
OpenAI, R. (2023b). Gpt-4 technical report. arXiv, pages 2303–08774.
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A.,
et al. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information
Processing Systems, 35:27730–27744.
Peng, B., Alcaide, E., Anthony, Q., Albalak, A., Arcadinho, S., Cao, H., Cheng, X., Chung, M., Grella, M., GV, K. K.,
et al. (2023a). Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048.
Peng, B., Galley, M., He, P., Cheng, H., Xie, Y., Hu, Y., Huang, Q., Liden, L., Yu, Z., Chen, W., et al. (2023b). Check
your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv
preprint arXiv:2302.12813.

11
Knowledge Solver: Teaching LLMs to Search for Domain Knowledge from Knowledge Graphs

Qin, C., Zhang, A., Zhang, Z., Chen, J., Yasunaga, M., and Yang, D. (2023). Is chatgpt a general-purpose natural
language processing task solver? arXiv preprint arXiv:2302.06476.
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al. (2018). Improving language understanding by generative
pre-training.
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. (2020). Exploring
the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research,
21(1):5485–5551.
Ram, O., Levine, Y., Dalmedigos, I., Muhlgay, D., Shashua, A., Leyton-Brown, K., and Shoham, Y. (2023). In-context
retrieval-augmented language models. arXiv preprint arXiv:2302.00083.
Raunak, V., Menezes, A., and Junczys-Dowmunt, M. (2021). The curious case of hallucinations in neural machine
translation. arXiv preprint arXiv:2104.06683.
Rohrbach, A., Hendricks, L. A., Burns, K., Darrell, T., and Saenko, K. (2018). Object hallucination in image captioning.
arXiv preprint arXiv:1809.02156.
Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ilić, S., Hesslow, D., Castagné, R., Luccioni, A. S., Yvon, F., Gallé, M., et al.
(2022). Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
Singhal, K., Tu, T., Gottweis, J., Sayres, R., Wulczyn, E., Hou, L., Clark, K., Pfohl, S., Cole-Lewis, H., Neal,
D., et al. (2023). Towards expert-level medical question answering with large language models. arXiv preprint
arXiv:2305.09617.
Speer, R., Chin, J., and Havasi, C. (2017). Conceptnet 5.5: An open multilingual graph of general knowledge. In
Proceedings of the AAAI conference on artificial intelligence, volume 31.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: a simple way to
prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958.
Sun, T., Zhang, X., He, Z., Li, P., Cheng, Q., Yan, H., Liu, X., Shao, Y., Tang, Q., Zhao, X., Chen, K., Zheng, Y., Zhou,
Z., Li, R., Zhan, J., Zhou, Y., Li, L., Yang, X., Wu, L., Yin, Z., Huang, X., and Qiu, X. (2023). Moss: Training
conversational language models from synthetic data.
Talmor, A., Herzig, J., Lourie, N., and Berant, J. (2018). Commonsenseqa: A question answering challenge targeting
commonsense knowledge. arXiv preprint arXiv:1811.00937.
Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., and Hashimoto, T. B. (2023). Stanford
alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E.,
Azhar, F., et al. (2023a). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
Touvron, H., Martin, L., Stone, K. R., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P.,
Bhosale, S., Bikel, D. M., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W.,
Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A. S., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V.,
Khabsa, M., Kloumann, I. M., Korenev, A. V., Koura, P. S., Lachaux, M.-A., Lavril, T., Lee, J., Liskovich, D., Lu, Y.,
Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog, I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi,
K., Schelten, A., Silva, R., Smith, E. M., Subramanian, R., Tan, X., Tang, B., Taylor, R., Williams, A., Kuan, J. X.,
Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur, M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S., and
Scialom, T. (2023b). Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017).
Attention is all you need. Advances in neural information processing systems, 30.
Vinyals, O. and Le, Q. (2015). A neural conversational model. arXiv preprint arXiv:1506.05869.
Wang, B. (2021). Mesh-Transformer-JAX: Model-Parallel Implementation of Transformer Language Model with JAX.
https://github.com/kingoflolz/mesh-transformer-jax.
Wei, J., Bosma, M., Zhao, V. Y., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M., and Le, Q. V. (2021). Finetuned
language models are zero-shot learners. arXiv preprint arXiv:2109.01652.

12
Knowledge Solver: Teaching LLMs to Search for Domain Knowledge from Knowledge Graphs

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D., et al. (2022). Chain-of-
thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems,
35:24824–24837.
Wishart, D. S., Feunang, Y. D., Guo, A. C., Lo, E. J., Marcu, A., Grant, J. R., Sajed, T., Johnson, D., Li, C., Sayeeda,
Z., et al. (2018). Drugbank 5.0: a major update to the drugbank database for 2018. Nucleic acids research,
46(D1):D1074–D1082.
Yang, X., Li, Y., Zhang, X., Chen, H., and Cheng, W. (2023). Exploring the limits of chatgpt for query or aspect-based
text summarization. arXiv preprint arXiv:2302.08081.
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R. R., and Le, Q. V. (2019). Xlnet: Generalized autoregressive
pretraining for language understanding. Advances in neural information processing systems, 32.
Yasunaga, M., Bosselut, A., Ren, H., Zhang, X., Manning, C. D., Liang, P. S., and Leskovec, J. (2022). Deep
bidirectional language-knowledge graph pretraining. Advances in Neural Information Processing Systems, 35:37309–
37323.
Yasunaga, M., Ren, H., Bosselut, A., Liang, P., and Leskovec, J. (2021). Qa-gnn: Reasoning with language models and
knowledge graphs for question answering. arXiv preprint arXiv:2104.06378.
Zeng, A., Liu, X., Du, Z., Wang, Z., Lai, H., Ding, M., Yang, Z., Xu, Y., Zheng, W., Xia, X., et al. (2022). Glm-130b:
An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414.
Zhang, T., Ladhak, F., Durmus, E., Liang, P., McKeown, K., and Hashimoto, T. B. (2023). Benchmarking large language
models for news summarization. arXiv preprint arXiv:2301.13848.
Zhang, X., Bosselut, A., Yasunaga, M., Ren, H., Liang, P., Manning, C. D., and Leskovec, J. (2022). Greaselm: Graph
reasoning enhanced language models for question answering. arXiv preprint arXiv:2201.08860.

13

You might also like