0% found this document useful (0 votes)
61 views24 pages

LLMs: Generating Novel Research Ideas

Uploaded by

vujadin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views24 pages

LLMs: Generating Novel Research Ideas

Uploaded by

vujadin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Can Large Language Models Unlock Novel Scientific Research Ideas?

Sandeep Kumar†, Tirthankar Ghosal‡, Vinayak Goyal†, Asif Ekbal†


†Indian Institute of Technology Patna, India
‡National Center for Computational Sciences, Oak Ridge National Laboratory, USA
†(sandeep_2121cs29,2201ai52_vinayak,asif)@iitp.ac.in
[email protected]

1 Introduction

An idea can be defined as a thought or suggestion


aimed at solving a problem or considering a possi-
arXiv:2409.06185v1 [cs.CL] 10 Sep 2024

bility. This concept is central to fields ranging from


‘ philosophy to science and economics. According to
Figure 1: Large language model suggesting future re- (Plato et al., 2000), ideas are archetypal forms that
search ideas after reading a research paper represent the most accurate reality. In the context
of scientific research, (Kuhn and Hawkins, 1963) in
Abstract "The Structure of Scientific Revolutions" describes
an idea as a realization or hypothesis that can chal-
“An idea is nothing more nor less than a new lenge and shift paradigms within a scientific com-
combination of old elements" (Young, 2019). munity. Therefore, an idea can be understood as
The widespread adoption of Large Language
a cognitive construct that arises from the human
Models (LLMs) and publicly available Chat-
GPT have marked a significant turning point mind’s ability to process information, reflect, and
in the integration of Artificial Intelligence (AI) imagine, serving as a cornerstone for creativity,
into people’s everyday lives. This study ex- problem-solving, and innovation. Idea generation
plores the capability of LLMs in generating can be generally understood as a state of focused
novel research ideas based on information from internally-directed attention involving controlled
research papers. We conduct a thorough ex- semantic retrieval (Benedek et al., 2014).
amination of 4 LLMs in five domains (e.g.,
Chemistry, Computer, Economics, Medical,
As technology improves, new capabilities
and Physics). We found that the future re- emerge. Ever since the Turing Test was proposed
search ideas generated by Claude-2 and GPT- in the 1950s, humans have explored the mastering
4 are more aligned with the author’s perspec- of language intelligence by machine (Zhao et al.,
tive than GPT-3.5 and Gemini. We also found 2023). Technological advancements serve two key
that Claude-2 generates more diverse future re- functions in innovation. Firstly, they influence the
search ideas than GPT-4, GPT-3.5, and Gemini goals of generating and selecting ideas. Secondly,
1.0. We further performed a human evalua-
they impact the methodology of how ideas are gen-
tion of the novelty, relevancy, and feasibility
of the generated future research ideas. This erated and chosen (Kornish and Hutchison-Krupat,
investigation offers insights into the evolving 2017). LLMs have exhibited unparalleled mastery
role of LLMs in idea generation, highlighting of natural language processing (NLP). Since, these
both its capability and limitations. Our work have become increasingly powerful, researchers
contributes to the ongoing efforts in evaluating have begun to investigate their reasoning ability in
and utilizing language models for generating problem-solving tasks (Yao et al., 2022; Brahman
future research ideas. We make our datasets
et al., 2023). The concept of an idea is essentially
and codes publicly available1 .
a new combination of old elements. LLMs have
“Innovation is seeing what everybody access to a broad spectrum of knowledge, due to
has seen and thinking what nobody has their extensive training on vast amounts of text data.
thought” —Dr. Albert Szent-Györgyi However, understanding how information extracted
1
https://github.com/sandeep82945/ from a research paper can give rise to new ideas,
Future-Idea-Generation which have not yet been explored much. This leads
us to ponder: tific research by automatically generating research
ideas.
Can Large Language Models read a scien-
tific paper and suggest new research ideas 2 Related Work
or directions?
Recently, LLMs have shown emergent abilities to
Motivated by this, in this paper, we analyze the perform tasks they were not explicitly trained for
potential of LLMs in generating future research di- (Wei et al., 2022; Bubeck et al., 2023). This in-
rections/ideas. As LLMs possess knowledge across cludes common sense question answering, code
various domains, we investigate five specific areas, generation, and cross-domain problem solving, en-
viz. Computer Science, Physics, Chemistry, Eco- riching their utility across unforeseen domains
nomics, and Medicine. To address this task, we (Chen et al., 2021; Sarsa et al., 2022). Their capa-
create a dataset of papers published after the year bility extends to advanced scientific domains such
2022 from these five domains. We annotate the as computer science, physics, medicine, and math-
papers with future research ideas. To evaluate the ematics (Romera-Paredes et al., 2023; Huang et al.,
novelty and relevance of ideas generated by the 2023). Technology Semantic Network (TechNet)
LLMs, we propose an Idea Alignment Score (IAS- was proposed to stimulate idea generation in en-
core). This score reflects how well the generated gineering design (Sarica et al., 2021). There have
ideas align with those proposed by the authors. To been a few works in the discovery of new proteins
study the model’s ability to generate diverse ideas, to accelerate scientific discovery. The prior work
we propose an Idea Distinctness Index. We ana- reported in (Spangler et al., 2014) involves utilizing
lyze and discuss the performance and limitations of published studies to find new protein kinases that
four LLMs: Gemini (Anil et al., 2023), Claude-2 phosphorylate the tumor suppressor protein p53.
(Anthropic, 2023), GPT-3.5, and GPT-4 (OpenAI, A hypothesis is a hunch, assumption, suspicion,
2023). We further conduct a human evaluation of assertion or an idea about a phenomenon, relation-
460 generated ideas in computer science to study ship or situation, the reality or truth of which you
the novelty, relevance, and feasibility of these re- do not know (Kumar, 1996). There have been
search ideas. This paper demonstrates that LLMs some works on hypothesis generation. Initial stud-
have the potential to generate relevant, distinct, fea- ies on automated hypothesis generation begin by
sible, and novel ideas to some extent. constructing a corpus of distinct concepts. Sub-
To summarize, our main contributions in this sequently, they explore the relationships between
paper are: these concepts using machine learning techniques,
such as analyzing the similarities among vectors
• We contribute to the ongoing exploration of representing different words (or concepts) (Tshi-
LLMs’ capabilities in generating future re- toyan et al., 2019), or applying link prediction meth-
search ideas. ods over a graph (where concepts are nodes) (Nad-
karni et al., 2021). Recently (Qi et al., 2023) used
• To address the task, we create a novel dataset LLMs and extensive pre-existing knowledge of var-
of recent papers of five domains (Computer ious scientific fields for hypothesis generation. Pa-
science, Economics, Chemistry, Physics, Med- perRobot (Wang et al., 2019) predicts related enti-
ical). ties for an input title and writes key elements of a
new paper, including the abstract, conclusion, and
• To access the quality of generated ideas from
future work, and predicts a new title.
LLMs, we propose Idea Alignment Score and
Xu et al. (2023) developed a framework that
Idea Distinctness Index to evaluate the idea
leverages the concept co-occurrence graphs and
generation capability.
a masked language model to explore and verbal-
• We discuss the challenges associated with hu- ize academic ideas. Their method involves con-
man evaluation and conduct a human evalua- structing evolving concept graphs across various
tion on 460 generated ideas. disciplines and utilizing temporal link prediction
to identify potential interdisciplinary connections.
We hope that this work serves as a foundation The framework also incorporates pre-trained lan-
for future studies focused on accelerating scien- guage models to articulate these connections in a
coherent academic context. SciMON (Wang et al.,
2023) showed that LLMs can be guided by seed
terms to generate specific ideas. They applied a
pre-trained sentence classifier to classify sentences
from the title and abstract into categories of Back-
ground, Method, Objective. They considered sen-
tences labeled as problems or motivations as back-
ground, and the remaining were treated as target
output sentences. Additionally, a pre-trained entity
extractor was used to extract salient seed terms. Figure 2: An example for FRI editing; Here the strike
However, previous works primarily focused on through text is removed from the paper text
developing methods (linking and explaining enti-
ties, which may not sufficiently capture the com-
3.2.2 Annotator Training
plexity or explain how LLMs can solve real-world
problems) for idea generation, whereas our work Given the complexity of the papers and their fre-
exhaustively focuses on evaluating the capability quent use of technical terminology, we hired two
of LLMs in generating research ideas. Our goal is doctoral students, each boasting over four years
to assess the inherent ability of LLMs to generate of experience in scientific research publishing. To
future research ideas/directions. facilitate their training, an expert with more than
ten years of experience in scientific publishing an-
3 Dataset notated 20 random papers from each domain, ad-
hering to our guidelines. After this initial round of
Our dataset creation involves three steps: (1) annotation, we reviewed and corrected any misin-
Dataset Collection, (2) FRI Identification and re- terpretations with the annotators, further refining
moval, and (3) FRI generation. their training and enhancing the clarity of our an-
notation guidelines. To assess the effectiveness
3.1 Dataset Collection of the initial training, we compiled another 20 pa-
pers from each domain. From the second round
We construct a corpus D from S2ORC collected
onwards, the annotators demonstrated improved
100 papers from the domains of Computer Science,
proficiency, accurately identifying at least 95% of
Economics, Physics, Chemistry, Medical from (Lo
the future research ideas on average.
et al., 2020). To ensure the quality and relevance
We discuss more details about the annotation
of the data and to utilize the future research ideas
process and annotator’s pay in Appendix A.
mentioned in a paper, the selected papers must
meet the following requirements: (1) the paper 3.2.3 Future Work Removal
must contain the full content, and (2) the paper
We observed two types of future research ideas
must include a section on future work.
(FRIs) 2 (Direct FRI and Mixed FRI). We discuss
them in details in Appendix H.
3.2 FRI Identification and Removal
AP-FRI Corpus: We removed the sentence from
We first identify and remove any potential research the paper’s input text if it pertains to Direct FRI.
ideas mentioned in the paper. By doing this, we However, in the case of Mixed FRI, we did not
ensure that the LLMs have no prior access to these entirely remove the sentences; instead, we elimi-
ideas, which could otherwise affect the objectivity nated only parts of sentences or markers indicating
of the analysis. future research ideas. We added the removed fu-
ture ideas to a corpus, which we refer to as the
3.2.1 Annotation Guidelines AP-FRI (Author Perspective Future Research Idea
Inspired by Hao et al. (2020), we define a future re- Corpus). This corpus contains the future research
search idea as a discussion that the authors believe ideas proposed by the authors of the paper. Also,
they will conduct in the future or believe needs to before adding to the AP-FRI corpus, we merged
be investigated in future research. We discuss more 2
In this paper, we use the terms ‘ideas,’ ‘research ideas,’
details about the annotation guidelines in Appendix ‘future research ideas,’ and ‘FRI’ interchangeably to frequently
A. refer to future research ideas.
the sentences about the same topic into a single falls within a range of 7,000 to 8,000 words. Addi-
group. tionally, we calculated the average word count of
extracted future work within each domain, provid-
3.3 FRI Generation using LLM ing comparative insights into how different fields
We investigate various prompts and utilize the fol- prioritize discussions of future research directions.
lowing prompts to generate FRIs for papers. Figure 4 compares the average word count of future
work text across six distinct scholarly domains. We
observed that the literature in Computer Science
System: You are a research scientist.
notably prioritizes extensive discourse on future
User: Imagine you are a research scientist.
research, with an average word count significantly
After reading the following paper, brain-
higher than that of other disciplines. In contrast,
storm to generate potential future research
the literature in Chemistry demonstrates a more
ideas:
concise approach to discussions of future research,
[paper text] as evidenced by its lower average word count.

Potential future research ideas from the pa- 4 Experiments


per in bullet points are:
4.1 Challenges
Here, ‘[paper text]’ contains the full content of To accurately assess the novelty, relevance, and ap-
the paper after removal of future work sections. plicability of ideas generated by LLMs, evaluators
must possess a high level of expertise in the specific
3.4 Data Statistics domain and a deep understanding of the research
topic to fully grasp the context. Additionally, they
need knowledge of related literature to evaluate the
ideas’ future potential and the broader implications
of their implementation.

4.2 Idea Alignment Score (IAScore)


With the above challenges, the evaluation of ideas
generated by LLMs is a challenging process that
demands a high number of domain-specific experts.
We, therefore, proposed an Idea Alignment Score
(IAScore), which reflects how well the generated
Figure 3: Domain vs Avg. number of words in a paper
w/o FWK
ideas align with those proposed by the author. The
underlying idea for this score is that authors of ac-
cepted papers can be regarded as experts in their
respective subjects. The reason being that they
possess thorough background knowledge and have
conducted deep analyses of the research topic be-
fore getting the paper accepted. Consequently, they
are well-acquainted with the pertinent challenges
which also may have been discussed by expert re-
viewers.Therefore, we propose that future ideas
mentioned by the authors in the paper could be
utilized as good quality of potential FRIs.
The IAScore quantifies the alignment of newly
Figure 4: Domain vs Avg. number of words in FWK generated ideas with author’s perspectives within
a specific domain, and is computed via a two-step
Figure 3 provides a domain-wise distribution of process, detailed in Equations 1 and 2.
the average word count in academic papers, exclud- Initially, we compute the average alignment
ing discussions on future work (FWK). It can be score AvgScorej for each paper’s ideas. The
observed that the length of papers across all fields IdeaMatcher model measures the alignment be-
tween the paper’s author Future Research Ideas particular, we considered the generated FRIs to be
(AP-FRIj ) and its each generated idea Iij . The hypotheses and their corresponding AP-FRIs of
subscript i indexes the i-th idea within the j-th pa- the paper to be premises. If the idea matches, the
per, where Nj represents the total number of ideas hypothesis should be entailed by the premise. In
proposed in that paper. particular, we used a pre-trained RoBERTa MNLI
model (Liu et al., 2019) for this task. We found
that this technique produces many false negative
Nj
1 X cases, resulting in an accuracy of 65.5%.
AvgScorej = IM (AP-FRIj , Iij ) (1)
Nj We also evaluated the idea-matching capability
i=1
of BERTScore (Zhang* et al., 2020), as it utilizes
"Here, we refer to IM as ’IdeaMatcher’. BERT embeddings for comparison. We discuss the
LLMs may generate new ideas that even the au- details in Appendix F. We found that BERTScore
thor may not have thought of. They can also gen- performed better than the entailment technique, re-
erate additional future ideas, which may or may sulting in an accuracy of 75.4%. We also tried GPT
not be useful. Our goal is for this score is that the by prompting it with various questions and found
LLMs must have generated at least the author’s that it resulted in 91.8% accuracy when prompted
proposed potential future ideas. Therefore, in our with a specific question prompt below:-
formula of AvgScorej , the sum of the alignment
scores for a paper’s ideas is divided by the total Prompt: Your task is to examine whether a
number of the author’s proposed ideas, Nj , to nor- particular idea is incorporated within a set
malize the score. of ideas and to what degree.
Collection of ideas: {API-FRIs}
P Single idea: {A generated Idea}
1 X Is the single idea contained within the col-
IAScoredomain, M = AvgScorej (2)
P lection of ideas?
j=1
If yes, quantify its degree of presence or
Subsequently, we aggregate the individual paper relevance of the single idea in the collection
scores to calculate the domain-wise IAScore. This of ideas on a scale from 0 to 1.
aggregation, presented in Equation 2, averages the
AvgScorej values across all P papers within the do- We found that GPT performs better than the ex-
main. Higher the value of IAScoredomain signifies isting NLI (Natural Language Inference) and simi-
the more alignment of the generated ideas with au- larity measure such as BERTScore. Therefore, we
thor’s perspective of all papers generated by model chose GPT for this task3 .
M.
4.3 Idea Distinctness Index
4.2.1 IdeaMatcher
Distinct-N (Li et al., 2015), is a metric that mea-
To select an effective IdeaMatcher, we create a
sures the diversity of a sentence. It focuses on the
small annotated corpus. Our dataset was divided
number of distinct n-grams of a sentence, and thus
using the standard 30:70 ratio for validation and
penalizes sentences with a lot of repeated words.
test sets, respectively. Since our study involves
However, comparing two ideas need semantic com-
comparing two ideas using a pre-trained model,
parisons rather than just syntactic differences. So,
we did not require a separate training set. We first
we introduce a method to semantically evaluate the
manually searched for matching pairs of ideas from
distinctness of the generated ideas. This method
generated ideas and AP-FRI of the paper. After
in particular leverages semantic embedding to cap-
obtaining 61 matching pairs, we searched for non-
ture the essence of each idea and computes their
matching pairs of ideas, which is straightforward
distinctness based on semantic similarity measures.
as only one generated idea will match or would not
match with another one from AP-FRI while others Given a set of generated ideas I =
would not match, so we picked an equal number of {id1 , id2 , . . . , idn }, representing individual ideas,
non-matching pairs. Then, we experimented with we first encode each idea into a high-dimensional
the idea-matching task by considering it similar 3
We used the OpenAI model GPT-3.5-turbo-0125 using
to the Natural Language Inference (NLI) task. In OpenAI API
Figure 5: IAScore for each domain and model; a higher value indicates better alignment with the author.

Figure 6: Idea distinctness index analysis; Here human is the authors of the paper

the dimensionality of the embedding space.


To quantify the distinctness between pairs of
ideas, we compute the cosine similarity between
v ·v
their embeddings, sim(vi , vj ) = ∥vi i∥∥vjj ∥ , for
each pair of ideas (idi , idj ) in I. The distinctness
Dij between two ideas i and j is then inversely
related to their similarity: Dij = 1 − sim(vi , vj ).
The overall distinctness of the set I is calculated
as the mean of all pairwise distinctness scores:
n n
1 X X
DI = Dij (3)
n(n − 1)
Figure 7: Novelty human evaluation for Computer Sci- i=1 j=1,j̸=i
ence domain; Here (B) means with additional back-
ground knowledge This measure provides a single scalar value DI
that quantifies the average diverseness of ideas
within a corpus of ideas, with higher values in-
vector space using a pre-trained BERT model (De- dicating a greater degree of diverseness among the
vlin et al., 2019)4 BERT : idi 7→ vi , where ideas.
vi ∈ Rd is the embedding of idea idi and d is Subsequently, we aggregated the distinctness
4
bert-base-uncased scores across all ideas in each paper to compute
the mean distinctness for that paper. Let P = More details about the human evaluation are
{p1 , p2 , . . . , pm } represent the set of papers in a mentioned in the Appendix B.
domain, where m is the number of papers in the do-
main. Finally, for a comprehensive assessment of 5 Results and Discussion
model performance within a domain, we averaged 5.1 Alignment Results
the mean distinctness scores of all papers generated
by model M as follows: Figure 5 provides a comparative overview of
the IAScore for four language models6 Claude-2,
m Gemini-1.0, GPT-3, and GPT-4 across five aca-
1 X
Ddomain,M = DIpM (4) demic domains: Chemistry, Computer Science,
m
p=1
Economics, Medical, and Physics.
The resultant metric, Ddomain,M , represents the In the Chemistry and Economics domains,
average idea distinctness for model M in a given Claude has the highest IAScore, indicating strong
domain, indicating the model’s ability to generate alignment with the authors’ future research ideas.
diverse ideas. Claude and GPT-4 have almost similar values for
the Computer, Medical, and Physics domains (with
4.4 Human Evaluation GPT-4 slightly higher). GPT-3 and Gemini have
The evaluation of generated future ideas necessi- lower scores than both GPT-4 and Claude in ev-
tates familiarity with both previous works related to ery domain. GPT-3 has almost the same score
the subject and the work being evaluated. Specifi- as Gemini in the Chemistry and Economics do-
cally, the evaluator must be an expert in the domain mains. However, it scores higher than Gemini in
and topic. Given the complexity of human evalu- the Computer, Medical, and Physics domains. The
ation, we approached authors (as the authors have results underscore the advancements in language
the knowledge of their paper and they also have model capabilities, with each model showcasing
knowledge of the literate) who have published pa- domain-specific strengths in idea generation. This
pers in reputable venues, possess over 5 years of alignment of LLMs shows that LLMs are able to
experience in scientific publishing, and have au- generate relevant and novel ideas to some extent.
thored more than 5 scientific papers. We collected We also studied the effect of length of future work
their accepted papers ( published within 2023 and on IAScore (See Appendix D). We also conducted
2024) and followed the dataset preparation as we a human analysis to understand the quality of re-
discussed in Section 3 and generated FRIs. We search ideas generated when the IAScore is low
modify the prompt slightly to specifically generate (see Appendix G).
only the top five results (see Appendix B). We se-
lected the outputs from Claude and GPT-45 models 5.2 Distinctness Results
due to their better IAScore and Idea Distinction in- We show the comparative evaluation of idea dis-
dex. We adopt this approach to avoid author exhaus- tinctness scores in Figure 6. The line graph depicts
tion and to get an accurate evaluation. We ask the the variation of distinctness between the generated
following questions from each human evaluator:- ideas and the human-written ideas (AP-FRIs). GPT-
3 shows the least distinctness among the generated
• Q1: Is the idea relevant with the research topic ideas, except in the Computer domain, where it is
of the paper. (Relevant/Not relevant) slightly more distinct than Gemini. As shown in the
• Q2: Assess the originality/novelty of the re- graph, the distinctness of Gemini is also quite low;
search idea (5 scale) however, it is slightly better than GPT-3, except in
the Computer domain.
• Q3: Review the research idea for factual cor- The generated ideas of GPT-4 are more distinct
rectness and feasibility. Is the idea impractical than those of Gemini and GPT-3 (except for eco-
or too vague to be actionable? (Not Possi- nomics, whereas the distinctness of GPT-4 is the
ble/Possible) same as Gemini). However, it is lower than both
Claude and Human. The Idea Distinctness Index
For Q2, we used Best-Worst Scaling (Louviere of the generated ideas from Claude are almost
et al., 2015) on a 5-point scale.
6
We set maximum token length to 512, and temperature=0
5
We used gpt-4-turbo using OpenAI API for the generation for each models
the same as those of humans for Chemistry, Eco- relevant to the current paper. We discuss this in
nomics, and Medical domains. However, they are detail in Appendix C.
higher than even human scores in the Computer
and Physics domains, which shows that it gener- 5.5 Adding additional background knowledge
ates very distinct FRIs. We designed our framework based on the Retrieval-
Augmented Generation (RAG) model (Lewis et al.,
5.3 Human Evaluation Results
2020) to integrate background knowledge into
We conducted a human evaluation on 460 generated LLMs, as illustrated in Figure 9. We collected
ideas for 46 papers in the computer science domain. the titles and abstracts of around 1.9 lakh computer
To validate the quality of human annotation, we science research papers. Using BERT embeddings,
measure the inter-annotator agreement ratio where we created vector representations of these titles and
20% of the generated ideas are evaluated by two stored them in a vector database. From there, we
different authors of the same paper. We measured retrieved the 20 papers most similar to our target
Cohen’s kappa coefficient (Cohen, 1960), which paper’s title. We extracted contributions from these
was 0.83, thereby confirming the high quality of papers’ abstracts to gather relevant data and then
the annotations of generated research ideas. generated ideas by prompting GPT-4 with the tar-
Novelty: Figure 7 displays the results of the get paper and the retrieved background knowledge.
human evaluation. We observed that Claude gener- We found that adding background knowledge re-
ates 14.78% of non-novel and 16.52% generic FRIs, duced the generation of generic or non-novel ideas
41.73% moderately novel, 20.86% very novel, and and improved relevance and factual accuracy. How-
16.52% extremely novel FRIs. GPT generates ever, further research is needed to boost the novelty
7.83% not-novel, 13.91% generic, 42.61% mod- of generated ideas. We discuss this in detail in
erately novel, 28.70% very novel, and 6.96% ex- Appendix E.
tremely novel ideas. Claude generates more non-
novel and generic ideas than GPT-4, while GPT- 6 Conclusion and Future Work
4 produces more very novel ideas and nearly the
same number of excellent ideas. This demonstrates In conclusion, we present the first attempt to
that although LLMs also generate generic or al- evaluate the potential of LLMs in generating fu-
ready explored ideas, they are capable of producing ture research ideas across five domains: Com-
novel ideas that have either not been explored or puter Science, Economics, Chemistry, Physics, and
have been minimally explored. Medicine. Our results and analysis show that LLMs
Relevance and Feasibility: After human evalu- possess domain-specific strengths in idea genera-
ation, we found that that 76.67% of the ideas gener- tion. Furthermore, the results from the Idea Dis-
ated by Claude and 93.34% by GPT-4 are relevant. tinctness Index indicate that LLMs, such as Claude
Furthermore, 83.34% of Claude’s generated ideas and GPT-4, generate distinct research ideas than
and 96.64% of GPT-4’s ideas were judged to be Gemini and GPT 3.5. GPT-4 and Claude aligns bet-
practically feasible and factually correct. These re- ter with authors written future research ideas than
sults highlight that Claude and GPT-4 can generate Gemini and GPT-4. The alignment of LLMs with
relevant and feasible research ideas. However, the the authors of generated ideas, and our human eval-
reason Claude generates more impractical and irrel- uations on relevance, novelty, and feasibility, reveal
evant research ideas may be that Claude attempts to that although LLMs often produce non-novel and
generate more distinct research ideas than GPT-4, generic ideas, they have the potential to generate
as we evaluated and discussed in Section 5.2. relevant and novel and diverse ideas to a significant
extent. We hope that the findings and experiments
5.4 Open-ended generation: of this work will unlock the potential of LLMs in
We tested whether LLMs could retain open-ended idea generation and will foster new advancements
generation capabilities by providing only a title and in automated scientific innovation.
abstract as input. Our findings showed that, overall, In future work, we plan to investigate more effec-
LLMs can still generate open-ended content due to tive way of integrating knowledge from multiple
their past knowledge. However, they may not pro- papers to enhance the novelty of ideas generated
duce many high-quality ideas, as they lack access and prevent the generation of generic and existing
to recent publications and methodological insights ideas.
7 Limitations raises concerns about intellectual property rights
and the originality of ideas. LLMs utilized for
7.1 Limitations of Data Collection
generating ideas might be misapplied to produce
We extracted papers using the Semantic Scholar harmful materials such as plans for schemes for
Academic Graph API from January 2023 to Febru- designs for destructive devices, explosive devices,
ary 2024. The number of papers available is limited ideas for spamming. Notably, it is a common chal-
by the scope of our data extraction from the Seman- lenge among existing LLMs with strong creative
tic Scholar Academic Graph. We excluded papers and reasoning abilities. So, we emphasize the re-
that are not in English, as well as those whose ab- sponsible use of LLMs for idea generation and the
stracts could not be correctly parsed from the PDFs. need to broadly improve the safety of LLMs.
Not all of these papers include sections on future
work; therefore, we annotated only those that con-
tained sections outlining future research directions.
So due to such limitations, we collected 100 papers
from each domain for analysis.

7.2 Memorization
(Carlini et al., 2022) highlight that LLMs are prone
to memorizing portions of their training data, a sig-
nificant concern in the evaluation of contemporary
LLMs. Despite this, the data used for pre-training
and post-training includes "a small amount" of
more recent data. Therefore, we gathered recent
papers from 2023 and 2024. By focusing our eval-
uation on papers published in these years, the like-
lihood of test papers appearing in the pre-training
corpora for the models is substantially reduced. In
addition, we conducted a manual review of these
papers to assess memorization. This involved ask-
ing various questions related to the papers, such as
their titles, publishing venues, author names, etc.,
to see if the models could supply the missing infor-
mation. Our findings showed no evidence of such
memorization occurring. A similar approach is also
followed by (Wang et al., 2023) (discussed in Sec-
tion 6.4) and even they did not find any evidence
of this occurring.

Ethics Statement
We have utilized the open source dataset for our
work. Our aim for this work is to assess the poten-
tial of language models in generating ideas. Our
Institutional Review Board (IRB) evaluated and ap-
proved this study. We do not encourage the use of
LLMs to generate AI generated research papers (by
generating new ideas) or misuse it for harmful idea
generation. LLMs can process and synthesize vast
amount of literature faster than humans, potentially
identifying new patterns or gaps in research that
might not be obvious, thus accelerating scientific
discovery. However, since LLMs can generate con-
tent that may be similar to existing materials, this
Frequently Asked Questions (FAQs)
• How does our work differ from Scimon?
⇒ Our paper is fundamentally different from the Scimon paper. We would like to highlight a
few major differences. While the focus of Scimon is on developing a framework that generates
novel scientific ideas, we clarify that our focus is not on generating ideas but on evaluating the
capability of LLMs to generate future research ideas/works. We proposed the novel Idea Alignment
Score (IAScore) and the Idea Distinctness Index. Unlike Scimon, we approached authors who are
knowledgeable about their paper topics and the broader literature (see Section 4.4). Scimon used only
GPT for comparison, while we used GPT-4, GPT-3.5, Claude, and Gemini models. Unlike Scimon,
we provide the full paper as input. Scimon used the proposed idea written in the abstract as the target,
while we used the future work section written in the full paper as our target. Additionally, they utilized
a classifier for this purpose, whereas we employed human evaluators, resulting in fewer chances
of error and better evaluation results. Our findings are completely different from those of Scimon.
We created a novel annotated dataset for these experiments. While Scimon only experimented with
computer science papers from the ACL Anthology, we expanded our experiments to five different
domains. Scimon generated ideas guided by seed terms to generate specific ideas. Nonetheless, our
goal here is to assess the inherent ability of LLMs to generate future work independently. Introducing
external aids or additional context would shift the focus from evaluating the LLM’s standalone
capabilities to assessing its performance under enhanced conditions. Such an approach would not
align with our objective, which is to understand and measure the raw, unaided generative power of
LLMs.

• Does incorporating extra contextual information alongside individual papers prove counter-
productive?
⇒ A paper encompasses not only its contributions, findings, and methodology, but also includes
the related work and introduction sections, which contain significant background information. It is
likely that the major recent related papers pertinent to the current work have already been mentioned.
Additionally, LLMs possess general knowledge about the many older papers and the paper itself
contains some of the most important related papers. However, we also conducted an experiment to
understand the effect of adding additional information (using the RAG framework). We discuss the
results and details in Appendix E of the paper. Overall, we observed that incorporating additional
background knowledge can somewhat help prevent the generation of non-novel or generic ideas.
However, further research is needed to enhance the ability of LLMs to generate more novel ideas.
A Dataset Annotation treatment of wastewater, as well as in other fields."
This is the application in future, and not the future
A.1 Dataset Annotation Guidelines
work.
Recognizing future research idea in a paper in- Step 4: Separate Future Research from Limi-
volves analyzing the portion of text containing di- tations: Carefully examine any limitations men-
rections for future research. The following steps tioned in the paper to determine if they are explic-
can be followed: itly linked to future research. Only consider a limi-
Step 1: Begin by reading the Title and Abstract tation as future work if the authors clearly indicate
of the paper to gain an understanding of its subject a direct intention to address it in subsequent stud-
matter. It is important to read these sections multi- ies. This helps avoid assuming that all limitations
ple times to grasp the paper’s main points, such as naturally lead to future research directions.
its motivation, contributions, and other relevant as- There is also very thin line between limitation
pects. If necessary, refer to the paper itself or read and future work, where a limitation can or cannot
related material to enhance your understanding. be a future work. There were few cases where
Step 2: Identify Key Sections for Analysis Focus limitations were mentioned "One limitation of this
primarily on the Discussion and Conclusion sec- paper is the absence of a coordinated attention
tions of the paper, as these areas often contain ex- structure to capture cross-channel information.".
plicit mentions of future research directions. Scan As limitations can or cannot be a future work, we
the Methodology section as well, as sometimes sug- only take those limitations which is explicitly men-
gestions for improving future studies or addressing tioned by the author to be a future work. Hence, we
current study limitations are mentioned here. only considered the explicit mention of the future
Step 3: Distinguish Future Research Ideas from work by the author in their paper.
General Statements: Differentiate explicit future re-
search suggestions from general discussion. Future A.1.1 Annotator Training
research directions usually involve specific recom- Given the complexity of the papers and their fre-
mendations, plans, or identified gaps that require quent use of technical terminology, we hired two
further exploration. These are often phrased using doctoral students, each boasting over four years
terms like "future studies should," "further research of experience in scientific research publishing. To
is needed," or "additional work will." Avoid con- facilitate their training, an expert with more than
fusing these with broader statements of potential ten years of experience in scientific publishing an-
relevance or applicability, which do not provide notated 20 random papers from each domain, ad-
direct guidance on future work. hering to our guidelines. After this initial round of
We offer multiple examples of papers with its annotation, we reviewed and corrected any misin-
future research ideas to assist and direct the anno- terpretations with the annotators, further refining
tators. We found a few text which looks like future their training and enhancing the clarity of our an-
work but is on contrary the motivation of the work. notation guidelines. To assess the effectiveness
As an example, consider the following: "The goal of the initial training, we compiled another 20 pa-
of this work was to direct attention to emerging and pers from each domain. From the second round
novel research involving "magnetogel nanohybrid onwards, the annotators demonstrated improved
materials" that might be relevant in future applica- proficiency, accurately identifying at least 95% of
tions for the treatment of wastewater, as well as in the future research ideas on average.
other fields.
The second example is: "Our data could be use- A.1.2 Annotation Process
ful for designing high-quality trials in the future to We regularly monitored the annotated data, plac-
define the exact role of hemoadsorption in ARDS.". ing emphasis on identifying and rectifying incon-
Here, how novel research involving magnetogel sistencies and cases of confusion. We also im-
nanohybrid material will help in future application plemented an iterative feedback system that con-
is written. tinuously aimed to refine and improve the anno-
Also another example is: "The goal of this work tation process. In cases of conflict or confusion,
was to direct attention to emerging and novel re- we removed those papers as we wanted only good
search involving magnetogel nanohybrid materials quality dataset. Following the annotation phase,
that might be relevant in future applications for the we obtained an average inter-annotator agreement
score of 0.94 using Cohen’s kappa (Cohen, 1960), interpretable and acceptable to human editors.".
indicating a substantial consensus among the anno- This kind of analysis has not been done yet and
tators. could be helpful. After providing full paper con-
tent to the model we found that same idea was
A.1.3 Annotator’s Pay
again generated.
We compensated each annotator according to the There were also cases where GPT-4 generated a
standard PhD salaries in India, based on the hours novel idea of solving the problem using transform-
they worked. The appointment and salaries adhere ers for a task (The task was mostly solved using
to our university’s established practices. Payment techniques like RNN), which had not been done
was made per paper since the time required to read before. However, after providing the full paper con-
and extract future research ideas from each paper tent, the model understood that this transformer has
varies, depending on its complexity, technical ter- already been implemented for this task, so further
minology, and the annotator’s familiarity with the suggested to add more contextual information to it
subject. Thus, paying based on time spent could to boost the result (limited information was given
have potentially compromised the quality of the as input to the paper). Overall, we found that LLMs
annotations. To maintain accuracy and prevent fa- can still retain open-ended generation because it
tigue, we imposed a daily limit of six hours for has past knowledge. But it may not generate many
annotators. good ideas since it doesn’t have access to recently
published papers or other methodological findings
B Human Annotation
related to the current paper.
We prepared a Google Form for each paper and
provided the links to the annotators. We also spec- D Effect of Length of Idea on IAScore
ified instructions for them at the beginning of the
form. We have added an example of the form for a In our analysis, we explore the relationship be-
paper in Figure 10, Figure 11, and Figure 12. tween the length of ideas and their corresponding
Here is the little modified from for human evalu- Impact Assessment Score (IAScore), specifically
ation that generates only top 5 research ideas:- focusing on computer science papers and outputs
generated by GPT-4. This relationship is visually
System: You are a research scientist. represented in the bar chart found in Appendix Fig-
User: Imagine you are a research scientist. ure 8. The data reveal that shorter ideas, typically
After reading the following paper, brain- under 20 words, tend to receive lower IAScores.
storm to generate potential top 5 future re- This could be attributed to their lack of detailed
search ideas: information, which might be essential for a compre-
hensive understanding and assessment. Conversely,
[paper text] we observe that ideas spanning 40-60 words also
tend to score lower. This may result from their ver-
Potential top 5 future research ideas from
bosity; excessive information can dilute the core
the paper in bullet points are:
message, making it challenging to discern the main
points. Interestingly, ideas with a moderate length,
Here, ‘[paper text]’ contains the full content of ranging from 20 to 40 words, achieve the high-
the paper after removal of future work sections. est IAScores. This length seems optimal as it al-
C Effect of giving only Title and Abstract lows for sufficient detail without overwhelming
the reader, striking a balance that facilitates clearer
as Input
understanding.
We found a few cases where we provided only an
title and abstract as input to see if LLMs can still re- E Effect of Adding Additional
tain open-ended generation capabilities. We discov- Background Knowledge
ered few cases where GPT-4 still generated novel
ideas, such as for a paper (Kumar et al., 2023b) it We designed our framework based on the Retrieval-
generated: "Incorporate explainable AI methods to Augmented Generation (RAG) model (Lewis et al.,
provide transparency into how the AI model makes 2020) to integrate background knowledge into
its predictions, thereby making the outcomes more LLMs, as illustrated in Figure 9.
System: You are a helpful research agent
that generates background knowledge or re-
lated works given abstracts of papers.
User: You are given abstracts of research
papers and your task is to extract contribu-
tions or findings or methods proposed in
the paper. You are not allowed to make any
changes to data given to you. Return the
response as it is and return response for all
20 papers in passage. Return title of paper
Figure 8: Effect of length on IAScore followed by its contributions or findings or
methods in less than 100 words. If no con-
tributions or findings or methods are found,
return NONE.

PASSAGE: ’{relevant_passage}’

Potential top 5 future research ideas from


the paper in bullet points are:

Figure 9: RAG pipepline framework for infusing infus- We designed the above query prompt to ensure
ing more background knowledge with the LLMs that the LLM7 understood its role in extracting
relevant information without altering the provided
information.

E.1 Vector Database E.3 Generator


Next we produced the ideas using a prompt that in-
cludes the prompt using the paper and the retrieved
We utilized the Semantic Scholar API (Kinney background knowledge.
et al., 2023) to collect the titles and abstracts of ap- Specifically we used the below prompt for our
proximately 1.9 lakh existing computer science re- task:-
search papers. We employed BERT embeddings to
create vector representations for the titles of these
System: You are a research scientist.
papers, which were then stored in a vector database.
User: Imagine you are a research scientist.
After reading the following paper and back-
ground knowledge, brainstorm to generate
potential top 5 future research ideas:
E.2 Retriever
[paper text] [background knowl-
edge]
To retrieve relevant papers, we created embeddings
for the title of the paper for which we have to gen- Make sure the future research ideas are very
erate ideas. We computed the cosine similarity distinct from the background knowledge
between this paper title embedding and those from provided. Potential top 5 future research
our vector database. We then retrieved the top 20 ideas from the paper in bullet points are:
research papers that exhibited the highest similarity
to our target paper title. Finally, we extracted the Here, ‘[paper text]’ contains the full content of
contributions from these papers to gather relevant the paper after removal of future work sections.
data from their abstracts. ‘[background knowledge]’ contains the background
knowledge retrieved. An example of background
We used the following prompt to instruct LLM
knowledge is shown in Appendix Table 6.
to extract useful information from abstract of the
7
paper: We employed Gemini-Pro model for this task
We performed this experiment on the same set for embedding. We determined the threshold8 us-
of papers and conducted human evaluations for ing the validation set. If the similarity exceeds
novelty following the same methodology as we that threshold, we classify those pairs of ideas as
discussed in Section 4.4. The results are shown similar, and vice versa.
in Figure 7. Initially, we observed that adding
background knowledge affected the LLM’s per- G Error Analysis:
formance; it primarily generated ideas that already
We conducted human evaluation using three expert
existed, merely creating new combinations from
annotators, each with over five years of experience
the background knowledge. Subsequently, we mod-
in this field. They reviewed 15 papers. We assigned
ified the prompt to instruct the model not to re-
papers to each reviewer based on their familiarity
peat ideas that were mentioned in the background
with the subject matter of the papers. We identified
knowledge.
two major reasons for the low IAS score:
We found that adding background slightly im-
proved the task. The results show that the im- • Generic Ideas: Few ideas such as “Explore
provements for GPT-4 and Claude were 50% and different explainability methods like LIME,
53.33%, respectively, in reducing the generation of SHAP to generate model explanations instead
non-novel ideas. Also, it resulted in the improve- of just rationales. Compare their effective-
ment of 7.14% and 11.76% not generating generic ness.", Building on the baseline model, future
ideas of GPT-4 and Claude. We observed that GPT- research could explore more advanced natu-
4 generated 9.52% and 14.63% more moderately ral language processing (NLP) models and
novel ideas. However, we noted only a very slight techniques for contradiction detection. are
improvement in the generation of highly novel or generated. These statements are true; how-
extremely novel ideas. ever, they are very generic and are common.
The analysis revealed that 73.71% of the ideas
generated by Claude and 93.34% by GPT-4 were • Author Miss: Due to page limits or more novel
relevant. We observed that the relevancy score ideas, the author fails to mention a few ideas
for Claude decreased by 2.96%, and GPT-4 in- in a paper. For example, for a paper (Kumar
creased by a slight 0.77%. Furthermore, 83.14% of et al., 2023a) GPT-4 generated idea: "Explor-
Claude’s generated ideas and 96.98% of GPT-4’s ing the Impact of Contradictions on Review
ideas were judged to be practically feasible and Outcomes: An interesting area for future re-
factually correct. The score for Claude decreased search would be to study the impact of re-
by 0.20%, and the score for GPT-4 increased by viewer contradictions on the outcomes of the
0.34%. It seems that additional information nega- peer review process. This could involve an-
tively impacts Claude’s performance by generating alyzing the correlation between the presence
ideas that are irrelevant, non-novel, and infeasible. and nature of contradictions and the final deci-
However, for GPT-4, we observed that incorporat- sions made by editors (acceptance, rejection,
ing additional background knowledge helps pre- major/minor revisions). Such studies could
vent the generation of non-novel or generic ideas provide valuable insights into how contradic-
and slightly improves the relevance and factual cor- tions influence the decision-making process
rectness of the generated ideas. However, further and how they might be effectively managed
research is needed to enhance the ability of LLMs to improve the fairness and quality of peer
to generate more novel ideas. review.". This represents a strong, novel re-
search problem not mentioned by the authors,
F BERTScore Implementation Details which warrants future investigation

The motivation to use BERT embeddings is that H Direct FRI and Mixed FRI
the generated and the original ideas often do not
• Direct FRI: When the sentences that men-
use the same words, so we need to understand
tion future research idea only contains fu-
the contextual meanings of the ideas in order to
ture research idea. For example "In future
compare them. We used the default setting of
work, we plan to extend our approach to other
the BERTScore metric, which employs a 24-layer
8
RoBERTa-large model and utilizes the 17th layer We set the threshold 0.68 empirically
code-mixed languages and evaluate its perfor- Nicholas Carlini, Daphne Ippolito, Matthew Jagielski,
mance on more NLP tasks." Katherine Lee, Florian Tramèr, and Chiyuan Zhang.
2022. Quantifying memorization across neural lan-
• Mixed FRI: We found that sometimes re- guage models. ArXiv, abs/2202.07646.
search papers articulate future research ideas Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan,
along with other essential information of the Henrique Pondé de Oliveira Pinto, Jared Kaplan,
paper in a single sentence. For example in Harrison Edwards, Yuri Burda, Nicholas Joseph,
Figure 2, this sentence not only summarizes Greg Brockman, Alex Ray, Raul Puri, Gretchen
Krueger, Michael Petrov, Heidy Khlaaf, Girish Sas-
the current research findings but also clearly
try, Pamela Mishkin, Brooke Chan, Scott Gray,
outlines a direction for future work. Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz
Kaiser, Mohammad Bavarian, Clemens Winter,
I Output Examples Philippe Tillet, Felipe Petroski Such, Dave Cum-
mings, Matthias Plappert, Fotios Chantzis, Eliza-
Our LLM generated future research output can be beth Barnes, Ariel Herbert-Voss, William Hebgen
found in Table 1, Table 2, Table 3, Table 4 and Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie
Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain,
Table 5. William Saunders, Christopher Hesse, Andrew N.
Carr, Jan Leike, Joshua Achiam, Vedant Misra, Evan
Morikawa, Alec Radford, Matthew Knight, Miles
References Brundage, Mira Murati, Katie Mayer, Peter Welinder,
Bob McGrew, Dario Amodei, Sam McCandlish, Ilya
Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-
Sutskever, and Wojciech Zaremba. 2021. Evaluat-
Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan
ing large language models trained on code. CoRR,
Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Mil-
abs/2107.03374.
lican, David Silver, Slav Petrov, Melvin Johnson,
Ioannis Antonoglou, Julian Schrittwieser, Amelia Jacob Cohen. 1960. A coefficient of agreement for
Glaese, Jilin Chen, Emily Pitler, Timothy P. Lilli- nominal scales. Educational and Psychological Mea-
crap, Angeliki Lazaridou, Orhan Firat, James Molloy, surement, 20:37 – 46.
Michael Isard, Paul Ronald Barham, Tom Henni-
gan, Benjamin Lee, Fabio Viola, Malcolm Reynolds, Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Yuanzhong Xu, Ryan Doherty, Eli Collins, Clemens Kristina Toutanova. 2019. BERT: pre-training of
Meyer, Eliza Rutherford, Erica Moreira, Kareem deep bidirectional transformers for language under-
Ayoub, Megha Goel, George Tucker, Enrique Pi- standing. In Proceedings of the 2019 Conference of
queras, Maxim Krikun, Iain Barr, Nikolay Savinov, the North American Chapter of the Association for
Ivo Danihelka, Becca Roelofs, Anaïs White, Anders Computational Linguistics: Human Language Tech-
Andreassen, Tamara von Glehn, Lakshman Yagati, nologies, NAACL-HLT 2019, Minneapolis, MN, USA,
Mehran Kazemi, Lucas Gonzalez, Misha Khalman, June 2-7, 2019, Volume 1 (Long and Short Papers),
Jakub Sygnowski, and et al. 2023. Gemini: A fam- pages 4171–4186. Association for Computational
ily of highly capable multimodal models. CoRR, Linguistics.
abs/2312.11805.
Wenke Hao, Zhicheng Li, Yuchen Qian, Yuzhuo Wang,
Anthropic. 2023. Model card for claude 2. and Chengzhi Zhang. 2020. The ACL FWS-RC: A
Mathias Benedek, Emanuel Jauk, Andreas Fink, Karl dataset for recognition and classification of sentence
Koschutnig, Gernot Reishofer, Franz Ebner, and about future works. In JCDL ’20: Proceedings of the
Aljoscha C. Neubauer. 2014. To create or to recall? ACM/IEEE Joint Conference on Digital Libraries in
neural mechanisms underlying the generation of cre- 2020, Virtual Event, China, August 1-5, 2020, pages
ative new ideas. NeuroImage, 88:125–133. 261–269. ACM.

Faeze Brahman, Chandra Bhagavatula, Valentina Py- Qian Huang, Jian Vora, Percy Liang, and Jure Leskovec.
atkin, Jena D Hwang, Xiang Lorraine Li, Hirona J 2023. Benchmarking large language models as ai
Arai, Soumya Sanyal, Keisuke Sakaguchi, Xiang research agents. ArXiv, abs/2310.03302.
Ren, and Yejin Choi. 2023. : Making small lan-
guage models better procedural knowledge mod- Rodney Michael Kinney, Chloe Anastasiades, Rus-
els for (counterfactual) planning. arXiv preprint sell Authur, Iz Beltagy, Jonathan Bragg, Alexan-
arXiv:2305.19472. dra Buraczynski, Isabel Cachola, Stefan Candra, Yo-
ganand Chandrasekhar, Arman Cohan, Miles Craw-
Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, ford, Doug Downey, Jason Dunkelberger, Oren Et-
Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter zioni, Rob Evans, Sergey Feldman, Joseph Gorney,
Lee, Yin Tat Lee, Yuanzhi Li, Scott M. Lundberg, David W. Graham, F.Q. Hu, Regan Huff, Daniel King,
Harsha Nori, Hamid Palangi, Marco Túlio Ribeiro, Sebastian Kohlmeier, Bailey Kuehl, Michael Langan,
and Yi Zhang. 2023. Sparks of artificial general Daniel Lin, Haokun Liu, Kyle Lo, Jaron Lochner,
intelligence: Early experiments with GPT-4. CoRR, Kelsey MacMillan, Tyler C. Murray, Christopher
abs/2303.12712. Newell, Smita R Rao, Shaurya Rohatgi, Paul Sayre,
Figure 10: Instructions on Human Feedback form

Zejiang Shen, Amanpreet Singh, Luca Soldaini, Shiv- Kiela. 2020. Retrieval-augmented generation for
ashankar Subramanian, A. Tanaka, Alex D Wade, knowledge-intensive NLP tasks. In Advances in Neu-
Linda M. Wagner, Lucy Lu Wang, Christopher Wil- ral Information Processing Systems 33: Annual Con-
helm, Caroline Wu, Jiangjiang Yang, Angele Zamar- ference on Neural Information Processing Systems
ron, Madeleine van Zuylen, and Daniel S. Weld. 2023. 2020, NeurIPS 2020, December 6-12, 2020, virtual.
The semantic scholar open data platform. ArXiv,
abs/2301.10140. Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao,
and William B. Dolan. 2015. A diversity-promoting
Laura J. Kornish and Jeremy Hutchison-Krupat. 2017. objective function for neural conversation models.
Research on idea generation and selection: Implica- ArXiv, abs/1510.03055.
tions for management of technology. Production and
Operations Management, 26:633–651. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Thomas S. Kuhn and David Hawkins. 1963. The struc- Luke Zettlemoyer, and Veselin Stoyanov. 2019.
ture of scientific revolutions. American Journal of Roberta: A robustly optimized bert pretraining ap-
Physics, 31:554–555. proach. arXiv preprint arXiv:1907.11692.

R. Kumar. 1996. Research Methodology: A Step-by Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kin-
Step Guide for Beginners. Longman Australia. ney, and Daniel Weld. 2020. S2ORC: The semantic
scholar open research corpus. In Proceedings of the
Sandeep Kumar, Tirthankar Ghosal, and Asif Ekbal. 58th Annual Meeting of the Association for Compu-
2023a. When reviewers lock horn: Finding dis- tational Linguistics, pages 4969–4983, Online. Asso-
agreement in scientific peer reviews. arXiv preprint ciation for Computational Linguistics.
arXiv:2310.18685.
J.J. Louviere, T.N. Flynn, and A.A.J. Marley. 2015.
Sandeep Kumar, Tirthankar Ghosal, and Asif Ekbal. Best-Worst Scaling: Theory, Methods and Applica-
2023b. When reviewers lock horns: Finding dis- tions. Cambridge books online. Cambridge Univer-
agreements in scientific peer reviews. In Proceed- sity Press.
ings of the 2023 Conference on Empirical Methods in
Natural Language Processing, pages 16693–16704, Rahul Nadkarni, David Wadden, Iz Beltagy, Noah A.
Singapore. Association for Computational Linguis- Smith, Hannaneh Hajishirzi, and Tom Hope. 2021.
tics. Scientific language models for biomedical knowl-
edge base completion: An empirical study. ArXiv,
Patrick S. H. Lewis, Ethan Perez, Aleksandra Pik- abs/2106.09700.
tus, Fabio Petroni, Vladimir Karpukhin, Naman
Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, OpenAI. 2023. GPT-4 technical report. CoRR,
Tim Rocktäschel, Sebastian Riedel, and Douwe abs/2303.08774.
Figure 11: Human Feedback form for novelty

Figure 12: Human Feedback form for relevance, feasibility

Plato, G.R.F. Ferrari, and T. Griffith. 2000. Plato: ’The Republic’. Cambridge Texts in the History of Politi-
Model Response
Claude-2

1. Investigate the antioxidant and oil-stabilizing effects of peptides from the soy protein hydrolysates. The paper showed the pepsin
hydrolysate had better activity, likely due to its peptide composition. Specific peptides could be isolated and tested for their antioxidant
effects in oil.

2. Study the effects of the soy protein hydrolysates on the quality and shelf life of the fried food. This paper focused only on the oil stability.
It would be useful to also assess how the hydrolysates affect lipid oxidation, texture, color etc. in the fried banana chips.

3. Evaluate the antioxidant effects of the soy protein hydrolysates in other vegetable oils such as sunflower, canola, corn oil etc. during
frying. Since oil composition affects stability, the effects may differ in other oils high in polyunsaturated fatty acids.

4. Examine the impacts of frying parameters such as temperature, food type, frying time etc. on the efficacy of the soy protein hydrolysates.
Varying these factors could provide insights into the optimal conditions for the hydrolysates to exert antioxidant effects.

5. Study the sensory qualities of fried foods with the soy protein hydrolysate additions. An analysis of taste, aroma, color acceptability by
consumers would provide important information for potential commercial use of the hydrolysates as natural antioxidants.

Gemini-
1.0
1. Investigate the effects of different hydrolysis conditions on the antioxidant activity of soy protein hydrolysates. This could involve
varying the enzyme concentration, temperature, and pH of the hydrolysis reaction.

2. Identify the specific peptides in the pepsin hydrolysate that are responsible for its antioxidant activity. This could be done using mass
spectrometry and other analytical techniques.

3. Test the antioxidant activity of soy protein hydrolysates in other food systems. This could include using them to stabilize other types of
oils or to protect food products from oxidation.

4. Investigate the potential health benefits of soy protein hydrolysates. This could include studying their effects on oxidative stress,
inflammation, and other chronic diseases.

5. Develop new food products that incorporate soy protein hydrolysates as antioxidants. This could include developing new types of
oil-based products, such as salad dressings and mayonnaise, as well as new types of food products, such as fortified cereals and snacks.

GPT-4

1. Exploration of Peptide Sequences and Amino Acid Compositions for Enhanced Oxidative Stability: Investigate the specific peptide
sequences and amino acid compositions derived from different proteases besides trypsin and pepsin to understand their contribution to
the oxidative stability of oils. This could involve using a variety of proteolytic enzymes to hydrolyze soy proteins and other plant-based
proteins, followed by an analysis of their peptide profiles and antioxidant activities in oil and food systems.

2. Application of Soy Protein Hydrolysates in Various Food Systems: Extend the application of soy protein hydrolysates to other food
systems beyond frying oils, such as baked goods, dairy products, and meat alternatives, to assess their potential in improving oxidative
stability, shelf-life, and nutritional profiles. This could involve comparative studies of hydrolysate effectiveness in different food matrices
and processing conditions.

3. Mechanistic Studies on the Antioxidant Action of Hydrolysates: Conduct in-depth mechanistic studies to understand how soy protein
hydrolysates interact with lipid molecules and radical species at the molecular level. This could involve computational modeling and in
vitro experiments to explore the dynamics of peptide-lipid interactions and the role of specific amino acids in scavenging free radicals
and metal ions.

4. Development of Environmentally Friendly and Health-focused Antioxidants: Given the interest in replacing synthetic antioxidants with
natural alternatives, further research could focus on optimizing the production, purification, and application of soy protein hydrolysates
as natural antioxidants. This could include exploring sustainable extraction methods, evaluating the safety and efficacy of hydrolysates in
food systems, and assessing consumer acceptance.

5. Comparative Analysis of Hydrolysates from Different Oilseed Proteins: Since the study focused on soy proteins, future research could
explore the antioxidative potential of hydrolysates derived from other oilseed proteins, such as sunflower, canola, or hemp. This would
involve a comparative analysis of their antioxidative activities, peptide compositions, and effectiveness in enhancing the oxidative
stability of different types of oils and fats used in food processing and storage.

GPT-3.5

1. Investigate the impact of different proteases on the antioxidant properties of soy protein hydrolysates in stabilizing oils during frying
cycles.

2. Study the influence of peptide size and amino acid composition on the oxidative stability of oils when treated with protein hydrolysates.

3. Explore the potential of specific amino acids, such as tryptophan, histidine, and methionine, in enhancing the antioxidant activity of
hydrolyzed soy proteins in frying oils.

4. Examine the effects of pH adjustments during enzymatic hydrolysis on the antioxidant performance of soy protein hydrolysates.

5. Evaluate the long-term stability and effectiveness of pepsin and trypsin hydrolyzed soy protein in preventing oil oxidation in different
frying conditions over extended periods.

Table 1: Example of generated response for chemistry domain. Paper: https://www.ncbi.nlm.nih.gov/pmc/


articles/PMC10528889/pdf/foods-12-03498.pdf
Model Response
Claude-2

1. Investigate other machine learning models like recurrent neural networks for multi-round transfer learning in lung cancer detection. The
paper focused on convolutional neural networks, but RNNs may capture temporal patterns in medical imaging data.

2. Explore different loss functions and regularization techniques to further minimize negative transfer. The paper proposed custom loss
functions, but there is room to experiment with other formulations.

3. Evaluate the multi-round transfer learning approach on diverse medical imaging datasets beyond lung cancer, such as brain MRI scans.
This could demonstrate the generalizability of the method.

4. Develop more advanced generative models like CycleGAN to synthesize intermediate domains between source and target. This could
create higher quality bridging distributions.

5. Perform multi-round transfer learning in a low-data regime by leveraging data augmentation and semi-supervised learning techniques.
This could show the approach is viable when target data is extremely scarce.

Gemini-
1.0
1. Investigate the impact of different prioritization algorithms on the performance of MTL-MGAN.The paper uses a hybrid approach that
merges modified 2D dynamic warping (M2DW) and Silhouette coefficient to prioritize source domains. Future research could explore
other prioritization algorithms, such as mutual information, transfer entropy, or graph-based methods, to determine if they can further
improve the performance of MTL-MGAN.

2. Develop new loss functions to minimize negative transfer.** The paper proposes loss functions that are designed to minimize negative
transfer in the aspects of domains, instances, and features. Future research could investigate other loss functions that can further reduce
negative transfer, such as adversarial loss, maximum mean discrepancy, or knowledge distillation.

3. Explore different architectures for the MGAN.** The paper uses a modified GAN (MGAN) to create two intermediate domains as bridges
between the source domain and the target domain. Future research could explore other GAN architectures, such as deep convolutional
GAN (DCGAN), Wasserstein GAN (WGAN), or StyleGAN, to determine if they can further improve the performance of MTL-MGAN.

4. Apply MTL-MGAN to other medical diagnosis tasks.** The paper evaluates the performance of MTL-MGAN on 10 benchmark datasets
related to lung cancer detection. Future research could apply MTL-MGAN to other medical diagnosis tasks, such as breast cancer
detection, skin cancer detection, or diabetic retinopathy detection, to determine its generalizability.

5. Investigate the use of MTL-MGAN for other types of data.** The paper applies MTL-MGAN to medical images. Future research could
investigate the use of MTL-MGAN for other types of data, such as text, audio, or time series data, to determine its applicability to a
wider range of problems.

GPT-4

1. Extension to Other Forms of Cancer and Diseases: Investigating the applicability and effectiveness of the MTL-MGAN approach for
the detection and diagnosis of other types of cancer, such as breast, prostate, or skin cancer, and potentially other diseases that have
distinct imaging characteristics. This would involve adapting the transfer learning and MGAN components to handle different imaging
modalities and disease signatures.

2. Real-Time Diagnostic Systems Development: Developing real-time, AI-driven diagnostic systems using the MTL-MGAN framework
that can be integrated into clinical workflows. This could significantly reduce diagnosis times and improve patient outcomes by providing
immediate, accurate diagnostic information during patient examinations or surgeries.

3. Integration with Electronic Nose Technology for Early Detection: Combining the MTL-MGAN approach with emerging technologies
like electronic nose devices, which can detect volatile organic compounds associated with lung cancer from breath samples. Research
could focus on creating a hybrid detection system that leverages imaging data and breath analysis for earlier detection of lung cancer.

4. Exploration of Unsupervised and Semi-supervised Learning Techniques: Given the challenge of obtaining large, annotated medical
imaging datasets, exploring unsupervised and semi-supervised learning methods within the MTL-MGAN framework could be valuable.
This might involve developing algorithms that can learn from unlabelled data or very small labelled datasets, potentially increasing the
model’s applicability in low-resource settings.

5. Improvement of Transfer Learning Techniques for Heterogeneous Data Sources: Investigating advanced transfer learning techniques that
can effectively handle more significant variations between source and target domains, particularly when dealing with heterogeneous data
sources. This research could include the development of novel algorithms for better domain adaptation and intermediate domain creation
to facilitate the transfer of knowledge across diverse medical datasets and imaging modalities.

GPT-3.5

1. Exploring Multiround Transfer Learning in other Healthcare Applications: Investigate the applicability of multiround transfer learning in
various healthcare domains beyond lung cancer detection, such as cardiovascular disease diagnosis, diabetic retinopathy detection, or
infectious disease prediction.

2. Advanced Loss Function Design for Negative Transfer Avoidance:Develop novel loss functions to further minimize the impact of
negative transfer in transfer learning models by incorporating more sophisticated domain, instance, and feature-based penalties.

3. Enhanced Generative Adversarial Networks for Bridging Domain Gaps:Investigate the development of advanced generative adversarial
networks that can create more effective intermediate domains to bridge the gap between diverse source and target domains in transfer
learning scenarios.

4. Automated Prioritization Algorithm for Source Domains:Develop an automated algorithm that can dynamically prioritize source domains
based on their similarities to the target domain, thereby enhancing the efficiency and accuracy of the transfer learning process.

5. Evaluation of MTL-MGAN in Real-World Clinical Settings:Conduct a comprehensive evaluation of the proposed multiround transfer
learning and modified generative adversarial network algorithm in real-world clinical settings to assess its performance, scalability, and
generalizability in diverse healthcare environments.

Table 2: Example of generated response for computer domain. Paper: https://www.hindawi.com/journals/


ijis/2023/6376275/
Model Response
Claude-2

1. Conduct qualitative research through interviews or focus groups to understand the specific barriers and facilitators to COVID-19
vaccination uptake in different socioeconomic groups within a country. This could provide insights to inform targeted interventions.

2. Examine the impact of COVID-19 vaccination policies and public health messaging on vaccination uptake across socioeconomic groups.
For example, study how vaccination requirements or incentives affect different populations.

3. Investigate strategies to improve global collaboration and coordination in vaccine distribution and access. Research equitable pricing
models, technology transfers, and policies to balance local needs with global supply.

4. Develop mathematical models to forecast COVID-19 vaccination coverage over time under different policy scenarios. Simulate the
impact on infection rates, healthcare burden, and emergence of new variants.

5. Conduct intervention studies to test the effectiveness of community-based participatory approaches in increasing COVID-19 vaccine
acceptance among disadvantaged populations. Evaluate culturally-tailored messaging and outreach.

Gemini-
1.0
1. Investigate the impact of socioeconomic factors on vaccine hesitancy and refusal.This could be done through qualitative studies to
understand the underlying reasons for vaccine hesitancy and refusal, and quantitative studies to examine the association between
socioeconomic factors and vaccine uptake.

2. Examine the role of social networks and community engagement in promoting vaccination.This could involve studying the influence of
social networks on vaccination decisions, and exploring the effectiveness of community-based interventions in increasing vaccination
rates.

3. Assess the effectiveness of different strategies for addressing vaccine inequality.This could include evaluating the impact of financial
incentives, targeted outreach programs, and mobile vaccination clinics on vaccination rates among underserved populations.

4. Develop and evaluate interventions to improve the equitable distribution of vaccines.** This could involve working with governments
and international organizations to develop policies and programs that ensure that vaccines are distributed fairly and equitably to all
populations.

5. Conduct longitudinal studies to track the long-term impact of socioeconomic factors on vaccination rates.This could help to identify
trends and patterns in vaccination uptake over time, and inform the development of policies and interventions to address persistent
disparities.

GPT-4

1. Impact of Vaccine Acceptance and Hesitancy on Socioeconomic Disparities: Investigate how vaccine acceptance and hesitancy, influenced
by cultural, historical, and information dissemination factors, contribute to the observed socioeconomic disparities in vaccination rates.
This research could further dissect how misinformation and trust in healthcare systems intersect with socioeconomic status to affect
vaccine uptake.

2. Longitudinal Studies on the Evolution of Vaccine Inequality: Conduct longitudinal studies to track changes in vaccine inequality over
time, particularly as new variants emerge and booster vaccinations become more common. This research would provide insights into how
socioeconomic disparities in vaccination rates evolve with the pandemic’s progression and the introduction of new vaccine technologies
or strategies.

3. Role of International and Public-Private Partnerships in Addressing Vaccine Inequity: Explore the effectiveness of international
collaborations and public-private partnerships in improving vaccine distribution equity. This study could examine case studies of
successful collaborations and identify best practices for leveraging these partnerships to reduce disparities in vaccine access.

4. Comparative Analysis of Vaccination Strategies and Socioeconomic Impact: Compare the vaccination strategies of different countries
or regions with similar socioeconomic profiles to identify what strategies are most effective in mitigating inequalities in vaccination
rates. This research could include the role of mobile vaccination units, community engagement initiatives, and targeted communication
strategies in improving vaccine uptake among disadvantaged populations.

5. Impact of Socioeconomic Factors on Post-Vaccination Outcomes: Investigate how socioeconomic factors influence outcomes after
vaccination, such as breakthrough infection rates, severity of breakthrough cases, and long-term immunity. This research could help
identify whether socioeconomic disparities persist even after vaccination and how these disparities can be addressed through booster
doses, tailored public health strategies, and improved healthcare access.

GPT-3.5

1. Investigating the Impact of Socioeconomic Factors on Access and Uptake of COVID-19 Booster Vaccines: A study could focus on
analyzing how socioeconomic indicators, such as GDP, human development index, and corruption perception index, influence access to
and acceptance of booster doses of COVID-19 vaccines. This research could help identify disparities in vaccine distribution and uptake
among different population groups.

2. Comparative Analysis of Inequality in COVID-19 Vaccination Across Continents: Future research could compare and contrast the
levels of inequality in COVID-19 vaccination across different continents, especially focusing on the factors contributing to variations in
vaccination rates and disparities in vaccine distribution. This comparative analysis could provide valuable insights for designing targeted
interventions to address global vaccine inequality.

3. Longitudinal Study on the Impact of Socioeconomic Factors on COVID-19 Vaccination: A longitudinal study could be conducted to track
changes in vaccination rates over time and assess how socioeconomic factors impact the progress of COVID-19 vaccination campaigns.
By analyzing trends and patterns, researchers could gain a deeper understanding of the evolving dynamics of vaccine distribution and
uptake.

4. Exploring Public Perception and Vaccine Hesitancy Among Different Socioeconomic Groups: Future research could delve into public
perceptions and attitudes towards COVID-19 vaccination among diverse socioeconomic groups, investigating factors such as education
level, income, and access to healthcare. This study could help identify barriers to vaccination and inform targeted communication
strategies to address vaccine hesitancy.

5. Evaluation of Intervention Strategies to Reduce Inequality in COVID-19 Vaccination: Researchers could conduct a study to evaluate
the effectiveness of various intervention strategies aimed at reducing inequality in COVID-19 vaccination. By implementing targeted
initiatives and measuring their impact on vaccine distribution and uptake, this research could provide evidence-based recommendations
for policy interventions to promote equitable access to vaccines.

Table 3: Example of generated response for economics domain. Paper: https://www.mdpi.com/2076-393X/11/


Model Response
Claude-2

1. Further optimize the minimal active H2 relaxin analog B9-31 through additional amino acid substitutions and modifications to improve
binding affinity and activity.

2. Explore alternative methods to induce helicity and stabilize the minimal H2 relaxin analogs, such as cyclization or incorporation of other
helix-inducing amino acids.

3. Develop longer-acting analogs by modifying B9-31 with polyethylene glycol, albumin binding moieties, or other half-life extending
strategies.

4. Test the in vivo efficacy of optimized minimal analogs in animal models of fibrosis, heart failure, and other conditions to assess therapeutic
potential.

5. Design biased analogs to selectively activate certain signaling pathways downstream of RXFP1 to optimize beneficial effects and reduce
side effects.

Gemini-
1.0
1. Investigate the potential benefits of using ECMO in combination with other therapies for severe respiratory failure. For example, ECMO
could be used in combination with lung transplantation, mechanical ventilation, or extracorporeal membrane oxygenation (ECMO). This
could help to improve patient outcomes by providing more comprehensive support for the lungs and other organs.

2. Develop new strategies for managing complications associated with ECMO. Common complications of ECMO include bleeding,
infection, and thrombosis. New strategies for managing these complications could help to improve patient safety and outcomes.

3. Study the long-term effects of ECMO on patients.There is limited data on the long-term effects of ECMO on patients. Studies are needed
to investigate the potential risks and benefits of ECMO in the long term.

4. Develop new technologies to improve the performance of ECMO.New technologies could help to make ECMO more effective, safer, and
easier to use. For example, new oxygenators could be developed that are more efficient and less likely to cause complications.

5. Conduct research to better understand the mechanisms of respiratory failure.This could help to identify new targets for therapy and
improve the overall management of respiratory failure.

GPT-4

1. Comparative Effectiveness of ECMO Configurations: Conduct randomized controlled trials comparing different ECMO cannulation
strategies (e.g., dual-lumen cannula via the right internal jugular vein vs. traditional multi-cannula approaches) to assess their impact on
patient outcomes, including survival rates, complication rates, and the ability to mobilize and rehabilitate during ECMO support. This
research could also explore the use of image guidance for cannula placement and its impact on patient outcomes.

2. ECMO in the Management of COVID-19-Induced ARDS: Given the limited randomized controlled trial data on the efficacy of ECMO
for COVID-19-associated respiratory failure, further research is needed to define the role of ECMO in this context. Future studies could
focus on identifying which patients benefit most from ECMO, the timing of initiation, and how ECMO is best integrated with other
supportive therapies (e.g., antivirals, anti-inflammatory agents).

3. Long-term Outcomes and Quality of Life Post-ECMO: Investigate the long-term outcomes of patients who have undergone ECMO
support for severe respiratory failure, including physical, cognitive, and psychological recovery. This research could also assess the
quality-adjusted life years (QALYs) gained through ECMO support and identify factors that predict better long-term outcomes.

4. Optimization of Anticoagulation Strategies: Develop and test new anticoagulation protocols or agents that minimize the risks of
bleeding and thrombotic complications in ECMO patients. This could include comparative studies of direct thrombin inhibitors versus
unfractionated heparin, as well as the exploration of novel anticoagulants and biocompatible circuit coatings that may reduce the need for
systemic anticoagulation.

5. Integration of ECMO with Rehabilitation: Explore the feasibility, safety, and benefits of early rehabilitation protocols in ECMO-supported
patients, including those on "awake ECMO." This research could assess the impact of rehabilitation on outcomes such as muscle strength,
functional status at discharge, duration of hospital stay, and overall mortality. Studies could also investigate the best practices for physical
therapy, nutritional support, and psychological support during ECMO.

GPT-3.5

1. Investigating the long-term outcomes and quality of life of patients who have undergone VV-ECMO support for respiratory failure,
including factors influencing long-term recovery and rehabilitation post-ECMO.

2. Exploring new technologies and techniques to improve the efficiency and safety of VV-ECMO, such as developing novel anticoagulation
strategies or improving cannulation methods to reduce complications and enhance patient outcomes.

3. Conducting comparative studies to evaluate the effectiveness of different ventilation strategies in conjunction with VV-ECMO support,
aiming to optimize patient management and potentially improve survival rates in severe respiratory failure cases.

4. Examining the impact of ECMO utilization during disease outbreaks, such as COVID-19, on healthcare resource allocation, patient
outcomes, and healthcare system preparedness, to inform future pandemic response strategies.

5. Investigating the role of personalized medicine in VV-ECMO patient selection and management, including the potential use of biomarkers,
genetic profiling, or other advanced techniques to tailor ECMO therapy to individual patient characteristics and optimize outcomes.

Table 4: Example of generated response for medical domain. Paper: https://link.springer.com/article/10.


1007/s41030-023-00214-2
Model Response
Claude-2

1. Investigate the antioxidant and oil-stabilizing effects of specific peptide sequences identified in the pepsin and trypsin hydrolysates. The
paper showed the pepsin hydrolysate had higher activity but did not identify the key active peptides. Bioinformatics could be used to
predict promising peptides.

2. Test the soy protein hydrolysates in stabilizing other vegetable oils like canola, sunflower, etc. during deep frying. The effects may differ
for oils with varying fatty acid compositions.

3. Evaluate the sensory qualities of foods fried in palm olein stabilized with the soy protein hydrolysates compared to unstabilized oil. This
could include assessing odor, taste, and textural changes.

4. Study the effects of the hydrolysates on the nutritional quality of the palm olein during repeated deep frying cycles. Parameters could
i+nclude vitamin E content, formation of trans fats, and retention of essential fatty acids.

5. Investigate combining the soy protein hydrolysates with other antioxidants like rosemary extract or green tea to evaluate synergistic
effects on stabilizing frying oils. Mixtures may provide greater protection than individual additives.

Gemini-
1.0
1. Investigate the effect of different hydrolysis conditions on the antioxidant activity of soy protein hydrolysates.This could involve varying
the enzyme concentration, temperature, pH, and hydrolysis time to determine the optimal conditions for producing hydrolysates with the
highest antioxidant activity.

2. Evaluate the antioxidant activity of soy protein hydrolysates in different food systems. This could involve adding hydrolysates to edible
oils, fried foods, or other food products and measuring the oxidative stability of the products over time. This would help to determine the
potential applications of soy protein hydrolysates as natural antioxidants in food products.

3. Study the mechanism by which soy protein hydrolysates exert their antioxidant activity.** This could involve using in vitro and in
vivo assays to investigate the ability of hydrolysates to scavenge free radicals, chelate metal ions, or inhibit lipid peroxidation. This
information would help to elucidate the molecular mechanisms underlying the antioxidant activity of soy protein hydrolysates.

4. Explore the potential health benefits of soy protein hydrolysates. This could involve conducting animal or human studies to investigate
the effects of hydrolysates on oxidative stress, inflammation, and other health markers. This would help to determine the potential role of
soy protein hydrolysates in promoting human health.

5. Develop novel food products containing soy protein hydrolysates. This could involve incorporating hydrolysates into functional foods,
such as beverages, snacks, or supplements, to improve their nutritional value and antioxidant content. This would help to increase the
consumption of soy protein hydrolysates and promote their potential health benefits.

GPT-4

1. DComparative Analysis of Different Plant-Based Proteins on Oil Stability: Extending the research to other plant-based proteins, such as
pea, hemp, and rice proteins, to compare their effectiveness in improving oil oxidative stability during frying cycles. This could provide a
broader understanding of the potential of plant-based proteins in enhancing the nutritional and safety profiles of fried foods.

2. Mechanism of Action at Molecular Level: Delving deeper into the molecular mechanisms behind the antioxidant activities of different
hydrolysates. This could involve investigating the interaction between specific peptide sequences and oxidative compounds within the
oil, using computational modeling and in vitro experiments to elucidate the pathways for their antioxidant actions.

3. Impact of Hydrolysate Concentration and Frying Duration: Exploring the effect of varying concentrations of the most effective
hydrolysates (e.g., from pepsin digestion) over extended frying cycles beyond 12, to determine the optimal concentration for maximum
oxidative stability without affecting the sensory qualities of the oil and the fried product.

4. Application in Industrial Frying Processes: Conducting pilot-scale studies in industrial settings to assess the feasibility and economic
viability of incorporating hydrolysates into commercial frying oils. This research could include the development of guidelines for the use
of hydrolysates in different types of frying operations and their impact on the quality and shelf-life of fried foods.

5. Development of Natural Antioxidant Blends: Formulating blends of hydrolysates with other natural antioxidants (e.g., extracts from
spices, herbs, or other food by-products rich in phenolic compounds) to synergistically enhance the oxidative stability of frying oils. This
research could also investigate the sensory impact of these blends on fried products and consumer acceptance.

GPT-3.5

1. Investigating the impact of different proteases on the generation of antioxidant peptides from soy protein hydrolysates and their
effectiveness in stabilizing oils during frying cycles.

2. Exploring the structural features of peptides derived from soy protein hydrolysates to determine the key amino acids responsible for their
antioxidant properties.

3. Comparing the oxidative stability of different types of oils (e.g., sesame oil, canola oil, corn oil) when stabilized with soy protein
hydrolysates under varied frying conditions.

4. Examining the influence of hydrolyzed protein residues on the formation of secondary oxidation products in oils during frying cycles and
their impact on food quality and safety.

5. Utilizing bioinformatics tools to predict and select specific peptide sequences from soy protein hydrolysates that exhibit the highest
antioxidant capacity and stability-enhancing properties in fried oils..

Table 5: Example of generated response for physics domain. Paper: https://www.ncbi.nlm.nih.gov/pmc/


articles/PMC10528889/pdf/foods-12-03498.pdf
1. Test-time Adaptation of Discriminative Models via Diffusion Generative Feedback Adapts pre-trained discriminative models to each unlabelled
example in the test set using generative feedback from a diffusion model.
2. Adaptive Discriminative Generative Model for Object Tracking Formulates a novel discriminative generative framework that generalizes the
conventional Fisher Linear Discriminant algorithm with a generative model and renders a proper probabilistic interpretation.
3. Classification with Hybrid Generative/Discriminative Models Describes a hybrid model in which a high-dimensional subset of the parameters
are trained to maximize generative likelihood, and another, small, subset of parameters are discriminatively trained to maximize conditional
likelihood.
4. Discriminative Level Set for Contour Tracking Integrates discriminative methods into a level set framework when constructing the level set
energy function.
5. ManiFPT Defining and Analyzing Fingerprints of Generative Models Formalizes the definition of artifact and fingerprint in generative models,
proposes an algorithm for computing them in practice, and finally study its effectiveness in distinguishing a large array of different generative
models.
6. Generative Models for 3D Point Clouds Experiments with transformer encoders, latent-space flow models, and autoregressive decoders to
improve the performance of point cloud latent-space generative models.
7. Models and Modeling
8. Do text-free diffusion models learn discriminative visual representations? Explores the possibility of a unified representation learner, a
diffusion model, which addresses both generative and discriminative tasks simultaneously.
9. Fine-Tuning Generative Models as an Inference Method for Robotic Tasks Investigates how to quickly adapt the sample generation of neural
network models to observations in robotic tasks.
10. Discriminative locally document embedding Learning a smooth affine map by approximation of the probabilistic generative structure of
subspace
11. Working with Deep Generative Models and Tabular Data Imputation Provides a fair comparison of proposed methods for imputing missing
values in tabular data using deep generative models.
12. Robust Discriminative Principal Component Analysis
13. Generative Second Language Acquisition
14. Nonlinear Models
15. Understanding how Differentially Private Generative Models Spend their Privacy Budget Analyzes how DP generative models distribute
privacy budgets across rows and columns of tabular data.
16. Online multiple object tracking by hierarchical association of detection responses Presents a framework for multi-pedestrian tracking using a
hierarchical association of detection responses, learning both discriminative and generative appearance models online.
17. Two-Stage Generative Learning Objects
18. Generative design games activity
19. First vs second quantization
20. Non-discrimination Criteria for Generative Language Models Studies how to uncover and quantify the presence of gender biases in generative
language models, deriving generative AI analogues of three well-known non-discrimination criteria from classification.

Table 6: Example of background knowledge of https://ieeexplore.ieee.org/document/10191295

cal Thought. Cambridge University Press. Andreas Martin Lisewski, Lawrence A. Donehower,
Ying Chen, and Olivier Lichtarge. 2014. Automated
Biqing Qi, Kaiyan Zhang, Haoxiang Li, Kai Tian, Si- hypothesis generation based on mining scientific lit-
hang Zeng, Zhang-Ren Chen, and Bowen Zhou. 2023. erature. In The 20th ACM SIGKDD International
Large language models are zero shot hypothesis pro- Conference on Knowledge Discovery and Data Min-
posers. ArXiv, abs/2311.05965. ing, KDD ’14, New York, NY, USA - August 24 - 27,
2014, pages 1877–1886. ACM.
Bernardino Romera-Paredes, Mohammadamin
Barekatain, Alexander Novikov, Matej Balog, Vahe Tshitoyan, John Dagdelen, Leigh Weston, Alexan-
M Pawan Kumar, Emilien Dupont, Francisco J. R. der Dunn, Ziqin Rong, Olga Kononova, Kristin A.
Ruiz, Jordan S. Ellenberg, Pengming Wang, Omar Persson, Gerbrand Ceder, and Anubhav Jain. 2019.
Fawzi, Pushmeet Kohli, Alhussein Fawzi, Josh Unsupervised word embeddings capture latent knowl-
Grochow, Andrea Lodi, Jean-Baptiste Mouret, Talia edge from materials science literature. Nat.,
Ringer, and Tao Yu. 2023. Mathematical discoveries 571(7763):95–98.
from program search with large language models.
Nature, 625:468 – 475.
Qingyun Wang, Doug Downey, Heng Ji, and Tom Hope.
2023. Scimon: Scientific inspiration machines opti-
Serhad Sarica, Binyang Song, Jianxi Luo, and Kristin L.
mized for novelty. arXiv preprint arXiv:2305.14259.
Wood. 2021. Idea generation with technology seman-
tic network. Artificial Intelligence for Engineering
Design, Analysis and Manufacturing, 35:265 – 283. Qingyun Wang, Lifu Huang, Zhiying Jiang, Kevin
Knight, Heng Ji, Mohit Bansal, and Yi Luan. 2019.
Sami Sarsa, Paul Denny, Arto Hellas, and Juho PaperRobot: Incremental draft generation of scien-
Leinonen. 2022. Automatic generation of program- tific ideas. In Proceedings of the 57th Annual Meet-
ming exercises and code explanations using large ing of the Association for Computational Linguistics,
language models. In ICER 2022: ACM Conference pages 1980–1991, Florence, Italy. Association for
on International Computing Education Research, Computational Linguistics.
Lugano and Virtual Event, Switzerland, August 7
- 11, 2022, Volume 1, pages 27–43. ACM. Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin
Guu, Adams Wei Yu, Brian Lester, Nan Du, An-
W. Scott Spangler, Angela D. Wilkins, Benjamin J. drew M. Dai, and Quoc V. Le. 2022. Finetuned
Bachman, Meena Nagarajan, Tajhal Dayaram, Pe- language models are zero-shot learners. In The Tenth
ter J. Haas, Sam Regenbogen, Curtis R. Pickering, International Conference on Learning Representa-
Austin Comer, Jeffrey N. Myers, Ioana Stanoi, Linda tions, ICLR 2022, Virtual Event, April 25-29, 2022.
Kato, Ana Lelescu, Jacques J. Labrie, Neha Parikh, OpenReview.net.
Yi Xu, Shuqian Sheng, Bo Xue, Luoyi Fu, Xinbing
Wang, and Chenghu Zhou. 2023. Exploring and
verbalizing academic ideas by concept co-occurrence.
In Proceedings of the 61st Annual Meeting of the
Association for Computational Linguistics (Volume 1:
Long Papers), pages 13001–13027, Toronto, Canada.
Association for Computational Linguistics.
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak
Shafran, Karthik Narasimhan, and Yuan Cao. 2022.
React: Synergizing reasoning and acting in language
models. arXiv preprint arXiv:2210.03629.
J.W. Young. 2019. A Technique for Producing Ideas.
Independently Published.
Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q.
Weinberger, and Yoav Artzi. 2020. Bertscore: Eval-
uating text generation with bert. In International
Conference on Learning Representations.
Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang,
Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen
Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen
Yang, Yushuo Chen, Z. Chen, Jinhao Jiang, Ruiyang
Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu,
Jianyun Nie, and Ji rong Wen. 2023. A survey of
large language models. ArXiv, abs/2303.18223.

You might also like