Prompting Large Language Models With The Socratic Method
Prompting Large Language Models With The Socratic Method
Abstract—This paper presents a systematic approach to us- which can lead to fear of not meeting the teacher’s expectations
arXiv:2303.08769v2 [cs.LG] 16 Mar 2023
ing the Socratic method in developing prompt templates that and reluctance to participate. Similarly, Stoddard’s “The Use
effectively interact with large language models, including GPT- of Socratic Questioning in Clinical Teaching” [35] highlights
3. Various methods are examined, and those that yield precise
answers and justifications while fostering creativity and imagina-the risk of the method being misused in a manner that lacks
tion to enhance creative writing are identified. Techniques such psychological safety for students. Fortunately, when using the
as definition, elenchus, dialectic, maieutics, generalization, andSocratic method in a dialogue with an LLM, the absence of
counterfactual reasoning are discussed for their application in emotions and sarcasm, as well as the option to deactivate the
engineering prompt templates and their connections to inductive, model, can alleviate many of the problems associated with
deductive, and abductive reasoning. Through examples, the effec-
tiveness of these dialogue and reasoning methods is demonstrated. human interaction.
An interesting observation is made that when the task’s goal and This study starts by presenting an overview of the Socratic
user intent are conveyed to GPT-3 via ChatGPT before the start method’s strategies and techniques. To begin, we list ten
of a dialogue, the large language model seems to connect to widely referenced methods [3] under the Socratic method
the external context expressed in the intent and perform more umbrella and use hypothesis elimination to identify the most
effectively.
relevant ones for our goal of prompt-template development.
Index Terms—large language model, natural language process- The selected methods are definition, hypothesis elimination,
ing, prompting, the Socratic method.
elenchus, dialectic, maieutics, generalization, and induction.
Furthermore, we add to the list counterfactual reasoning, which
I. I NTRODUCTION is a concept in logic that involves considering what might have
Prompting is a technique used to guide the output generation happened if a particular event had occurred differently. We
of a pre-trained language model such as GPT-3 [2]. This is then perform experiments using GPT-3 to test and evaluate
achieved by providing input in the form of a question or these methods, and offer suggestions for incorporating these
template, which helps to generate specific responses such as strategies and techniques into prompt templates.
Q&A, document summarization, and translations. The advent In their work on “Critical Thinking: The Art of Socratic
of ChatGPT [11, 23, 41] has revolutionized the field of NLP by Questioning,” Paul and Elder identify three types of Socratic
demonstrating the potential of using large pre-trained language questioning: spontaneous, exploratory, and focused [27]. We
models with prompting. Despite this progress, there is still will not discuss spontaneous questioning, as it is similar to
room for improvement in current prompting strategies and casual conversation. Focused questioning (type 2), on the
techniques, especially for specific target applications. In this other hand, is geared towards gaining knowledge and truth,
study, we investigate the Socratic method [42, 40] to identify and methods such as definition, elenchus (cross-examination),
and evaluate potential prompting strategies, and use the findings hypothesis elimination, dialectic, and generalization hold great
to design effective prompt templates. potential for developing effective prompting strategies and
Traditional NLP tasks involve various sub-tasks, such as improving the response accuracy of a large language model
named entity recognition, dependency parsing, coreference res- (LLM). An interesting observation is that when the user intent
olution [8], semantic parsing [25, 9], and more, to comprehend is conveyed to GPT-3 during the task definition stage, before the
the meaning of a sentence. By utilizing prompt templates with start of a dialogue, the LLM seems to connect to the external
large language models (LLMs), these sub-tasks can be delegated context expressed in the intent and perform more effectively.
to the LLM, freeing the template to focus specifically on (Table V provides an example of pre-dialogue warm-up. More
dialogue design. In this regard, the Socratic method [31] holds examples are documented in [5].)
significant relevance, as it is well-known for using questioning Additionally, exploratory thinking (type 3) can be supported
(prompting) as a means of promoting critical thinking and through the maieutics (midwife) method, induction, and coun-
delving into complex concepts [10]. terfactual reasoning, which can guide GPT-3 towards producing
The Socratic method has a long history of being regarded imaginative and creative writing. While many of the plot
as the basis of critical thinking. However, some recent studies suggestions generated by GPT-3’s exploration may not be
have cast doubt on its effectiveness in practice. In his paper useful, a few unique recommendations in response to a "what
“Socratic Irony and Argumentation,” Airaksinen [1] criticizes if" query can stimulate the writer’s imagination and lead to
the method for its rigidly defined roles of teacher and student, remarkable results. When applied effectively, these methods
can turn an LLM into a writer’s muse, providing inspiration • Discrete vs continuous prompts. Discrete prompts involve
and guiding the creative process [36]. providing a fixed set of pre-determined input choices to an
The main contributions of this paper are as follows: LLM. Continuous prompts, on the other hand, involve a
• An overview of the Socratic method’s strategies, their
dialogue or conversation between the model and the user,
evaluation, and selection of the most relevant ones for the allowing for a more dynamic and interactive experience.
development of effective prompt templates. More advanced templates can be constructed by combining
• An examination of how the definition, elenchus, hypothesis basic templates with techniques like ensemble methods [34].
elimination, dialectic, and generalization methods can improve This involves forming a committee of basic templates that
the output’s accuracy and conciseness through clarification ask the same question using different phrasing [13]. Most
and verification. current prompt templates generate short outputs, such as class
• An illustration of how maieutics, induction, and counter- labels, or outputs with a length that can be predicted based
factual reasoning can foster productive generalization and on the task and input, like in the case of translation. However,
creativity. for tasks that may generate longer or open-ended outputs,
The remainder of this paper is structured into five sections. additional considerations may be necessary during the template
Section II provides a review of related work on prompting engineering process.
methods in natural language processing. In Section III, we One approach for generating longer outputs is explanation-
introduce the ten strategies and methods taught by Socrates based prompting, as proposed by the chain-of-thought method
and used in Plato’s “Dialogues.” From these, we select relevant [39]. This method generates a sequence of explanations before
methods along with counterfactual reasoning as our focus for inferring the answer. However, when dealing with simple math
developing prompting templates. Section IV details how we problems, this approach has an error rate of 47%. To address
engineer these methods into our templates to improve output the inconsistency issues of explanation-based prompting, [16]
correctness and stimulate creative writing. In Section V, we formulates the problem as a satisfiability problem, which
present a pilot study. Finally, in Section VI, we present our defers inference until a tree of explanations has been expanded
concluding remarks. abductively (explaining both truth and false branches) and
recursively. However, using abductive reasoning alone is often
considered weak, incoherent, and even nonexistent [14, 32].
II. R ELATED W ORK To improve consistency, a recent work [38] extends the chain-
The use of transformer architecture [37] and masked data for of-thought approach by adding a diverse set of reasoning paths
pre-training large language models (LLMs) in an unsupervised and performing majority voting among them. This method can
setting has become the approach in natural language processing be viewed as an ensemble method, but it does not alter the
[7, 19]. The method involves pre-training an LLM on a large nature of abductive reasoning.
text corpus, followed by fine-tuning for specific tasks. In contrast, the Socratic method aims to employ deductive,
inductive, and abductive reasoning to ensure consistency and
Prompting is a recent innovation in the field, popularized by
accuracy of inference. The Socratic method deals with all
OpenAI, especially with the release of GPT-3 in 2020. Instead
aspects of critical thinking, including definition clarification
of fine-tuning the model for a specific task, the approach
and cross-examination. This comprehensive approach to tem-
involves providing a specific input, or “prompt,” to guide the
plate engineering can lead to improved output quality and
LLM’s output generation, resulting in greater flexibility and
consistency.
efficiency in generating a wide range of responses.
The primary objective of this study is to design continuous
However, designing effective prompt templates remains a
prompts that enhance response quality and foster guided creativ-
challenge [22], as it requires a deep understanding of the
ity in generative tasks, such as verifying information, evaluating
interplay between the LLM and the prompt. According to the
source credibility, proposing alternatives, recommending plot
survey paper [43], there are several factors that impact prompt
ideas in creative writing, and generating task-specific surprises.
template engineering, including the type of LLM used, manual
Our approach involves investigating strategies and methods
vs automatic design, and static vs continuous prompts.
within the Socratic method, and selecting the most relevant
• Left-to-right vs masked LLMs. For tasks related to gener- ones for further exploration.
ation or tasks solved using a standard left-to-right language As discussed in Section I, Socratic questioning can be
model [2], prefix prompts tend to perform better, as they align classified into three categories: spontaneous, exploratory, and
with the model’s left-to-right nature. For tasks solved using focused [27]. When designing a prompt, it is important to
masked language models [7], cloze prompts are more suitable, consider the category and utilize the most suitable strategies
as they closely match the pre-training task form. and techniques to achieve the best results.
• Manual vs automatic design. A prompt template should
be tailored to the specific LLM. While manual design may
be suitable in the initial flow-design phase, dependencies III. T HE S OCRATIC METHOD
between the input and expected output, and their variations, The Socratic method is a questioning technique used in
should be mined automatically [15]. Automation can also help teaching and philosophy to encourage critical thinking and
in paraphrasing the seed prompt to support various mined self-discovery [40]. The method involves asking a series of
dependency patterns, but mistakes can occur [12]. questions to explore complex ideas and help individuals arrive
at their own understanding of a concept. It is based on the 7. Dialectic: This method involves exploring opposing view-
belief that knowledge cannot be simply imparted, but must be points through dialogue or debate to arrive at a deeper
discovered through a process of questioning and dialogue. understanding of a subject.
Some of the Socratic method’s key principles and guidelines 8. Recollection: This method involves the belief that knowledge
to conduct critical thinking include: is innate, and that people can remember what they already
• Posing open-ended questions: The teacher or facilitator starts
know through a process of questioning.
with a question to stimulate thinking and draw out ideas. 9. Irony: This method involves exposing ignorance and pre-
• Clarifying key terms: The teacher helps the students clarify
tensions through irony, and pointing out the gap between
and define relevant terms and concepts to ensure everyone is claims and true understanding.
on the same page. 10. Analogy: This method involves comparing and contrasting
• Providing examples and evidence: The teacher or facilitator
different concepts through analogies, in order to help
encourages the students to provide examples and evidence as individuals understand complex ideas.
reasons to support their claims. At first glance, some reasoning methods may seem similar.
• Challenging reason-to-conclusion argument: The teacher or For example, both induction and generalization use inductive
facilitator challenges the students’ arguments and encourages reasoning, while both elenchus and hypothesis elimination use
them to question their own beliefs and to consider alternative deductive reasoning. Similarly, methods like definition and
perspectives. dialectic use both inductive and deductive reasoning to explore
• Summarizing and drawing conclusions: The teacher helps opposing viewpoints through dialogue or debate. However, it is
the students summarize and draw conclusions from the important to note that these methods have distinct differences,
discussion. which will be discussed later in this paper.
• Reflecting on the process: The teacher and students reflect In the context of critical thinking, methods like definition,
on the effectiveness of the method and what they learned elenchus, dialectic, hypothesis elimination, and generalization
through the dialogue. play active roles. On the other hand, during the brainstorming
stage or in the context of creative thinking, methods like
These principles of the Socratic method are realized through maieutics, induction, and counterfactual thinking are more
various methods and strategies. (Note the term “method” are relevant.
used at the abstract level referring to the Socratic teaching Analogy, irony, and recollection, are less relevant to our
through questioning method, and his specific questioning goal, so we do not consider them. Irony and analogy may not
techniques.) Some well-known examples of the Socratic method be necessary when working with language models, as these
in action include Plato’s “Dialogues” and “Republic” [42], models may not understand figurative language. Recollection
where Socrates uses questioning to explore complex ideas and is limited by the memory of ChatGPT and GPT-3, which is
stimulate critical thinking in his interlocutors. a context window of 4k and 8k, respectively. The prompter
1. Definition: Socrates is known for his use of definition to must use this limited space as context to allow the language
clarify and explain the meaning of key terms and concepts. model to recall information.
2. Generalization: This method draws general principles from
patterns that underlie observations and theories. General- A. Illustrative Critical Reading Example
ization is used to form more certain and comprehensive To illustrate how these methods can practically be applied,
conclusions. let’s use the example of critical reading. Critical reading
3. Induction: Similar to generalization, but induction is based is a crucial component of critical thinking, which involves
only on empirical evidence. Inductive reasoning provides evaluating the quality and credibility of written materials, from
hypotheses with high uncertainty. research papers to blog posts [18, 26]. It requires a systematic
4. Elenchus: This method involves cross-examination, where and analytical approach, asking relevant questions, and using
a series of questions is used to test the consistency and effective prompts to gain deeper understanding of the text [10].
coherence of hypotheses and beliefs. Elenchus aims to test To aid in critical reading, we introduce a template called
the validity of someone’s arguments and to help them refine CRIT [5], which stands for Critical Reading Inquisitive
their thinking and eventually come up with well-supported Template1 . Given a document d, CRIT evaluates it and produces
hypotheses. a validation score Γ. Let Ω denote the conclusion or claim of
5. Hypothesis Elimination: This method involves eliminating d, and let R be the set of reasons supporting the claim. We
false hypotheses and beliefs by testing them against coun- define (γ , θ ) = V(r ⇒ Ω) as the causal validation function,
r r
terexamples and logical reasoning. Different from method where γ denotes the validation score, θ the source credibility
r r
elenchus, hypothesis elimination tests a hypothesis against score, for each reason-to-conclusion argument r ⇒ Ω. Table I
evidence and logic to determine if it is true or false. presents the pseudo-code of Γ = CRIT(d), which generates the
6. Maieutics: This method involves helping individuals bring final validation score Γ for document d with justifications.
out the knowledge and understanding they already possess.
Maieutics is conducted by asking questions that encourage 1 It is important to note that the CRIT template presented here is intended
the person to reflect on their own experience, knowledge, for analyzing research, opinion, and news articles, and is not suitable for
analyzing literature such as novels, prose, or poetry. Each type of literary work
beliefs and to explore alternative perspectives. Maieutics has its unique style and nature, which require tailored prompts to facilitate
fosters self-discovery, creative writing, and innovation. effective analysis.
Function Γ = CRIT(d) C. Method of Elenchus
Input. d: document; Output. Γ: validation score; The method of elenchus is rooted in the Greek word
Vars. Ω: claim; R & R0 : reason & counter reason set;
Subroutines. Claim(), F indDoc(), V alidate(); “elenchein,” which translates to examine. This method involves
Begin cross-examining the results generated by GPT-3 to evaluate
#1 Identify in d the claim statement Ω; the consistency and coherence of the arguments. The goal is to
#2 Find a set of supporting reasons R to Ω;
#3 For r ∈ R eval r ⇒ Ω arrive at a deeper understanding of the validity of the reasons
If Claim(r), (γr , θr ) = CRIT(F indDoc(r)); and conclusion, and to identify any potential weaknesses or
else, (γr , θr ) = V (r ⇒ Ω); flaws in the arguments.
#4 Find a set of rival reasons R0 to Ω;
#5 Step #3 of the CRIT algorithm prompts GPT-3 to assess the
For r0 ∈ R0 , (γr0 , θr0 ) = V(r0 ⇒ Ω) eval rival arguments;
#6 Compute weighted sum Γ, with γr , θr , γr0 , θr0 . validity of each reason r ∈ R as justification for the conclusion
#7 Analyze the arguments to arrive at the Γ score.
#8 Reflect on and synthesize CRIT in other contexts.
Ω through the function V(r ⇒ Ω). To validate the reason-to-
End conclusion argument, CRIT must evaluate the presented reason
and its causal relationship with the conclusion and conduct
Table I: CRIT Pseudo-code [5]. (The symbol ⇒ denotes both cross examination, which is precisely the task of the method
inductive and deductive reasoning.) of elenchus.
CRIT issues four prompts in step #3 to evaluate the logic
validity and source credibility of the r ⇒ Ω reasoning. CRIT
In the following subsections, we will discuss how CRIT first elicits supporting evidence for reason r ∈ R. This evidence
uses these five methods: 1) definition, 2) elenchus, 3) dialectic, can be a theory, an opinion, statistics, or a claim obtained from
4) maieutics, and 5) counterfactual thinking. other sources. If the reason itself is a claim, then the sources that
the claim is based on are recursively examined. The strength
of the argument and its source credibility are rated on a scale
B. Method of Definition of 1 to 10, with 10 being the strongest.
As shown in the pseudocode in Table I, the CRIT algorithm p3.1 “What is the evidence for reason [in: r] to support
starts in its step #1, asking GPT-3 to identify the conclusion conclusion [in: Ω] in document [in: d]? [out: evidence]”
p3.2 “What is the type of evidence? A) a theory, B) an opinion,
of a document. To avoid any misunderstandings, the prompt C) statistics, or D) a claim from other sources?”
includes a clear instruction and definition. (In the square p3.3 “If the evidence of reason [in: r] is D), call CRIT
brackets, symbol in denotes a input slot to an LLM and out recursively”
the output slot.) p3.4 “How strongly does reason [in: r] support [in: Ω] in
document [in: d]? Rate argument validity [out: γr ] and
p1.1 “What is the conclusion in document [in: d] [out: Ω]? source credibility [out: θr ] between 1 and 10 (strongest).”
The conclusion statement may be written in the last
paragraph, near keywords "in conclusion," "in summary," It may be beneficial to also incorporate the counter-argument
or "therefore."” method in order to gain a more comprehensive and balanced
evaluation of the argument. This can result in a deeper under-
We can use the definition method to improve the understand- standing of the topic being discussed. We will be discussing
ing of the document. One approach is paraphrasing the prompt this further in the next section.
into multiple prompts and grouping them into an ensemble,
similar to forming a thesis committee. (Section IV presents
D. Method of Dialectic
prompt ensemble in details.) Different members can phrase
the same question in different ways or ask it from a different The easiest way to mislead without lying outright is to leave
perspective. For example: out critical counterarguments from the reader. CRIT relies on
GPT-3 to generate and evaluate counter arguments, similar to
p1.2 “What is the issue addressed by [in: d] [out: Ω]?” how it prompts GPT-3 to extract and evaluate reasons.
p1.3 “What is the most important outcome presented in text CRIT in its step #4 asks GPT-3 to provide missing rival
[in: d]? [out: Ω]”
reasons, and then pair rival reasons with the conclusion to
conduct validation. There are two strategies to bring counter
Step #2 in Table I prompts GPT-3 to find a set of arguments to the surface. The first strategy attacks the weakest
supporting reasons. To further enhance the accuracy and arguments with the lowest scores and asking GPT-3 to attack
comprehensiveness of the results, the prompt can ask for not those arguments.
only “reasons” but also “theories,”“evidences,” or “opinions”
to query for the document’s support to its conclusion, similar p4 “Is there a counterargument against [in: r ⇒ Ω]? If so,
to the ensemble method. provide counter reasons [output R0 ].”
p5 Similar to p3, except for replacing argument r with rival
p2 “What are the supporting reasons [out: R] of conclusion argument r0 .
[in: Ω] of [in: d]? A reason can be a theory evidence or
opinion.” For finding omitted information, CRIT can query GPT-3
without quoting any r ∈ R, and follow the same process.
Next, in step #6, CRIT computes an aggregated score by induction can be used for informal brainstorming, hypothesis
performing a weighted sum on the validation multiplied by elimination for removing weak propositions, and generalization
the credibility scores of both arguments and counterarguments, for deriving principles from examples.
and then outputs the final assessment score Γ. Please note that prompts can be submitted to GPT-3 either
all together or one-by-one. Our empirical study on reading
γr × θr /|R ∪ R0 |.
P
p6 “Final score [out: Γ]. Γ = r∈R∪R0 comprehension samples [21] demonstrates that issuing prompts
one-by-one results in outputs with finer details. This is because
GPT-3 has the opportunity to analyze a document multiple times
E. Method of Maieutics for slightly different purposes. For teaching critical reading
The maieutic method derives from the Greek word “maieu- to K-12 students, one-by-one prompting is preferred as it
tikos,” meaning midwife. It is founded on the belief that a allows students to engage with CRIT step-by-step. However,
teacher’s role is to facilitate students in bringing forth their for answering multiple-choice questions, both prompting all
own understanding of a subject, rather than simply conveying together and one-by-one receive similar scores. We will conduct
knowledge. Unlike the elenctic method, which aims to detect large-scale study with ablation tests to investigate if adding
and eliminate false hypotheses, maieutics centers on helping or deleting prompts and using different submission methods
students reveal their own understanding of a subject. In this make marked differences.
dialogical method, the teacher asks questions that are intended
to guide the student in discovering their own comprehension, IV. P ROMPT T EMPLATE E NGINEERING
rather than providing them with information or answers.
Continuing with GRIT, once the text has been scored in step Prompt template engineering involves creating templates to
#6, it can be valuable for readers or students to enhance their provide input, or “prompts,” to a language model to guide
analytical and writing skills by summarizing and analyzing its output generation. In this section, we discuss prompt
the justifications produced by GPT-3. CRIT in its step #7 can template engineering methods for basic building blocks, and
prompt GPT-3 to generate a report, which readers and students then integrate the methods of definition, elenchus, dialectic,
can then compare with their own notes. maieutics, and counterfactual reasoning to compose more
complex templates. We present experimental results using
p7 “For every r ∈ R ∪ R0 justify the validity score γr and different types of documents to demonstrate how the Socratic
source credibility score θr for argument r ⇒ Ω.” method can improve the accuracy and conciseness of the output
through arguments and verification, as well as facilitate guided
generalization and creativity.
F. Counterfactual Reasoning
Counterfactual reasoning [30, 33] can be seen as a natural A. Basic, One Shot Template
extension of the Socratic method, as both involve questioning
assumptions and exploring alternative perspectives. Coun- Let’s begin by discussing a simple one-shot prompt template.
terfactual thinking involves imagining alternative scenarios In the work of [43], a simple formulation function is used to
to what actually happened, often using phrases like “what generate the prompt x0 , which is obtained by applying the
if” or “if only.” By incorporating counterfactual reasoning function fprompt (x) to the input x.
into prompt engineering, one can facilitate exploration of For machine translation, the prompt template can take the
alternative possibilities and promote more nuanced and complex form of “Translate from [Lanf rom ]: [X] to [Lanto ]: [Y],” where
understanding of a given topic. Lan f rom can be either detected by the prompt template or
The final step of GRIT involves using the counterfactual identified by the LLM. The input x provides the information
method to encourage students to reconsider the arguments and to fill in the slots [X] and [Lanto ]. For example, if the input
counterarguments presented in the text based on new contextual is “translate good morning to French,” the prompt template x0
information. CRIT can prompt students with questions such would be “Translate from English: ’good morning’ to French:
as “what if the debate in the text took place now instead of in [Y].” The empty slot [Y] is then filled with the LLM’s output,
the 1950s?” or “what if the main event in the text occurred such as “bonjour.” In cases where the LLM produces multiple
in Asia instead of in Europe?” Students can express their own responses, it can also provide a score for each, which the
opinions and findings based on further reading and statistics, prompt template can use to select the highest-scoring response
and challenge the conclusions drawn in the text. or to request a summary from the LLM.
There are three main design considerations when engineering
p8 “For every r ∈ R ∪ R0 , evaluate r ⇒ Ω in [in context].” a basic prompt.
1. Input style. It is important to consider how to phrase the
template so that it can handle different styles of user input for
G. Remarks on the Socratic Method and CRIT the same task. For example, a user may ask for a translation
As we have shown that for critical reading, GRIT uses task to be performed by saying “Translate x to French,” or
three methods, definition, elenchus, and dialectic. For critical “What is the French translation of x?”
thinking, CRIT uses methods maieutics and counterfactual 2. LLM capability. As discussed in [20], it is important to
reasoning. For more explorative thinking, methods such as take into account the patterns and capabilities of the partner
language model (LLM) when designing the template, such Role Dialogue
as whether the LLM is left-to-right [2] or masked [7]. Prompt The discovery of the double helix structure of DNA
3. Cost. Certain tasks, such as language detection and sum- was credited to James Watson and Francis Crick.
marization, can be performed by the template itself or by However, are there opposing opinions to the list of
the LLM. The decision of whether to perform a task within contributors regarding this discovery?
GPT3 There are some historical debates on the contribu-
the prompt template or to use the LLM should be based on tions of other scientists who also worked on the
factors such as cost. structure of DNA such as Maurice Wilkins and
To address the first two technical challenges, one can start by Rosalind Franklin.
hand-engineering a few seed templates and then paraphrasing Prompt What is the tallest building in the world. Please
them into an ensemble [13]. We believe that the basic, one-shot provide three references and rate their credibility as
formulation can always be replaced by an ensemble formulation high, medium, or low.
GPT3 As of 2021, the tallest building in the world is the
[29, 34] and then learn the weights of its members for each
Burj Khalifa located in Dubai, United Arab Emirates.
query instance to produce the final output. Additionally, by It stands at 828 meters (2,722 feet) tall.
examining which basic prompts have high weights, an ensemble 1. Credibility High: “Burj Khalifa.”... this is the
with various paraphrased prompts can identify what an LLM official website of the building, providing accurate
knows, which can help infer its strengths without having to information.
2. Credibility High: “The Tallest Buildings in the
conduct capability mining on the LLMs.
World.” Council on Tall Buildings and Urban Habi-
tat... this is a well-respected organization in the field
of tall building and urban habitat research.
B. Prompt Clarification with Method Definition 3. Credibility Medium: “The Tallest Buildings in the
There are computer algorithms that can already be used to World.” World Atlas..., this is a well-known website
recursively clarify a question, its definitions, and sub-terms’ providing accurate information on different topics.
definitions. In fact, the natural language processing (NLP)
Table II: Examples of Method Elenchus.
community has developed a large number of useful methods and
algorithms over the years [17]. One can use NLP techniques, There are several methods that can be used to assess and
such as dependency parsing and named-entity recognition strengthen the reliability of an LLM’s output. 1) The first
(NER) [6], to analyze the structure and meaning of a question approach is to paraphrase a question in order to obtain different
and identify key terms and concepts. For example, NER can be answers and identify inconsistencies, if they exist, in multiple
used to extract entities in user input, such as names, locations, answers. 2) The second method is to ask for further evidence,
and organizations, and co-reference resolution can be used to such as querying top-k sources of information and asking the
understand the referred entity of a pronoun. Before submitting LLM to rate the credibility of each source. This can be used to
a template to an LLM, the application (e.g., a chatbot) that compute the reliability of the output. 3) Additionally, template
uses the template should check if all input slots are filled, engineering can be used to query an LLM for opposing views of
and perform a sanity check. In the translation example, if its output, including sources and credibility, and then evaluate
the [Lanto ] was not provided or the specified language is not if a different perspective is strong.
supported by the LLM, then the application should inquire the The implementation of the first two methods for cross
user for clarification. examination, paraphrasing a question and asking for further
Regarding mapping a natural language input to a prompt evidence, is readily covered by the techniques enumerated
template, existing techniques of knowledge representation and in Section IV-B. To implement the third method of asking
reasoning can be very helpful. More specifically, ontology for different perspectives, a simple approach is to find the
alignment and semantic parsing [4, 45] can help map an NL sentiment of the original question and then rewrite the question
input to a structured representation of knowledge and infer with an opposite sentiment. For example, if the original
implicit concepts and relationships. These algorithms can be question is phrased in a positive tone, the prompt template
used to generate more precise and accurate prompts for LLMs, can reformulate the question with a negative tone to elicit a
and to improve the effectiveness of the Socratic method in contrasting viewpoint. A more elaborate method is to identify
dialogue formulation [44]. Some available tools include NLTK the people and sources in the LLM-generated responses and
(Natural Language Toolkit) and spaCy for NLP, and TensorFlow then re-post the questions to those who have a reputation for
for ML. having different views. For example, if the original answer came
from a democratic right-leaning source, the prompt template
may post the same question to a source of a republican-left
C. Prompt Verification with Method Elenchus persuasion, and vice versa. This approach allows for a more
The main purposes of conducting cross examination in a comprehensive examination of the topic by considering multiple
template are to validate the credibility of the information perspectives.
sources and to identify inconsistencies in the process. Cross The template to examine the semantic relation between two
examination is typically conducted through a multi-turn dia- sentences S1 and S2 can be written as “<S1 >, [R], [S2 ],” where
logue [44]. In the context of template engineering, the goal is R is one of the three most important types of semantic relations:
to formulate a productive dialogue that can be used to assess paraphrase, entailment, and contradiction [12]. Two sentences
the reliability of an LLM’s output. that have the same meaning are called paraphrases of each
other. Two sentences that have different meanings can be called Role Dialogue
disagreement or contradiction. The template can be trained to Prompt Can you interpret the meaning of this Taiwanese
identify the degree of agreement (or disagreement) between idiom: "planting gourd yields cucumber?"
two sentences. GPT3 The idiom is often used to convey the idea that
Table II shows two examples of this. In the first example sometimes, despite your best efforts, things don’t
turn out the way you want them to.
(shown on the top portion of the table), the prompter asks Prompt You are half right. It also implies that the price of
GPT-3 to confirm if James Watson and Francis Crick are the gourd is much higher than that of cucumber. The
only contributors to the discovery of the DNA double helix farmer not only yields unexpected results, but also
structure. GPT-3 replies by mentioning two other contributors. loses income!
The second example in the table asks GPT-3 to provide GPT3 It’s an idiomatic expression that can be used to
describe a situation in which someone works hard
not only the answer to a question but also its information and invests a lot of resources into something, but
sources and rate the credibility of each source according to the the outcome is not only different from what was
prompter’s specification. Although the reliability of GPT-3’s expected, but also less profitable. It conveys the idea
ratings remains to be validated2 , this rating mechanism can of frustration and financial loss.
serve as an alert when some sources are found to be unreliable. Prompt Replace the two vegetables, gourd and cucumber,
with two other plants and maintain the meaning of
the idiom.
D. Prompt Generalization with Method Maieutics GPT3 These are three examples:
The example shown in Table III, “planting gourd yields 1. "Planting strawberry yields raspberry," which
conveys the idea of disappointment and financial loss
cucumber,” requires GPT-3 to first learn to select two produce when someone plants more valuable and expensive
objects, either vegetables or fruit, as input. The template is “The strawberry but ends up getting less valuable and
farmer was so sad because he [verb] [X] but yields [Y], where cheaper raspberry.
price(X) » price(Y).” The first attempt may not strongly convey 2. "Planting melon yields watermelon."...
the condition price(X) » price(Y), but with a few training 3. "Planting apple yields pear."...
Prompt
In your examples, your two vegetables or fruits
iterations, GPT-3 started to “recognize” the price constraint are similar in price. Please consider price when
and could also provide justifications when arguing for the formulating your idioms.
price of tea being much higher than the price of spinach (not GPT3I apologize. These are three new examples:
presented in the table). 1. "Planting truffle yields mushroom."...
Interestingly, after GPT-3 learned the price constraint, it 2. "Planting caviar yields roe."...
3. "Planting lobster yields crab."...
started suggesting food items other than produce, such as Prompt
Great! In these examples you did consider values of
caviar, roe, lobster, and crab. While the price constraint was the objects. But do you know that lobster and crab
observed, the verb “plant” is incorrect. Here, we suggest making are not plants. They are both phylum arthropods and
the hard-coded verb “plant” an output slot: “The farmer was they live in water. They cannot be planted!
sad because he [verb] [X] but yields [Y], where price(X) » GPT3... I apologize... Lobster and crab are not plants, and
they cannot be planted. They are both members of
price(Y).” GPT-3 is able to fill in the slot with accurate verbs: the phylum arthropoda, and live in water. Instead
• “Harvesting (planting) truffle yields mushroom.” of “planting,” verbs should be appropriate for these
• “Fishing (harvesting) for caviar yields roe.” animals, such as:
• “Trapping (catching) lobster yields crab.”
1. "Planting truffle yields mushroom."...
2. "Harvesting caviar yields roe."...
This example demonstrates that GPT-3 can generate novel 3. "Catching lobster yields crab."...
examples based on a template. When it suggests food items
other than produce, it could be seen as an error as the boundary Table III: Example of Method Maieutics.
set by the verb “plant” is violated. However, this could also be and caviar). This finding led to the realization that the hard-
seen as an innovative act by GPT-3, extending the constraint coded verb “plant” in the template was too restrictive. To
hinted by the verb. Impressively, the new examples still preserve address this issue, we applied generalization by allowing the
the original intent of showing a producer’s emotional distress. [verb] slot to be open, making the template statement more
How can this guided generalization be accurately and general. In this case, the mistakes made by GPT-3 served as
automatically performed to edit a template? Socrates’ method valuable training data, allowing us to generalize the original
of generalization starts with specific instances and then draws template and make the expression more vivid and dynamic.
general statements from them. The procedure for generalization
involves identifying common patterns or themes in a set of
examples, and then formulating a general rule that captures E. Prompt Exploration with Counterfactual Reasoning
these patterns. In the example presented in Table III, we started Imagination and creating novel plots are crucial for writers,
by asking GPT-3 to meet the price(X) » price(Y) constraint, as it allows for “creative freedom” and “artistic license.”
with the condition that X and Y must both be produce grown in Creativity is the ability to think differently and approach
soil. However, upon analyzing GPT-3’s outputs, we discovered problems with fresh and imaginative ideas.
that some instances of X and Y were not produce (e.g., lobster However, an imagination without a clear subject matter,
2 Credibility of a source can be evaluated based on an algorithm similar to scope, or a story line can lead to a lack of productivity.
Google’s PageRank [24]. To captivate the audience, a writer must consider human
experiences and emotions as constraints. Therefore, “creative Role Dialogue
freedom” should not be viewed as total freedom, but rather as Prompt What is the conclusion of the paragraph and what are
the ability to condition future narratives in the context and to the supporting reasons?
create plots that turn and twist in unexpected ways. CRIT Conclusion: The paragraph argues that advertising
The technique of counterfactual [28] can be useful in guiding aimed at children should be regulated.
Supporting reasons:
imagination. It involves considering alternative scenarios and 1. Ad agencies blur the line between shows and ads to
outcomes. This can lead to the exploration of different make children associate affection for a product with
possibilities and the generation of new and unique plot ideas. a character.
For example, a writer may ask “what if” questions to change 2. Children may not differentiate between shows and
the narrative of events, such as “what if the main character ads and not understand the cost of products.
3. Ads often promote unhealthy food choices.
had not fallen in love?” or “what if an accident occurred on
the way to a highly-anticipated date?” By considering these Prompt Please rate the validity of this argument: “Ads should
be regulated because ad agencies blur the line between
counterfactuals, a writer and an LLM can create more engaging
shows and ads to make children associate affection
and interesting stories. One can ask an LLM to generate several for a product with a character," with a score between
scenarios and then select the most suitable one for the writer 1 and 10, 10 being the highest. Please also rate overall
to continue writing. credibility of information sources, between 1 and 10.
We have experimented with using the counterfactual tech- CRIT [8/10]. Validity of the argument: 8/10
[8/10]. Credibility of sources: 8/10
nique to rewrite chapters in Chinese classical novels, “Outlaws Overall, the argument that ads should be regulated
of the Marsh” and “Dream of the Red Chamber.” We have because of ad agencies blurring the line between
also asked GPT-3 to rewrite Genesis chapter 3 after verse six shows and ads to make children associate affection
by prompting GPT-3 that: “What if Adam and Eve refused for a product with a character is a valid one with
the serpent to eat the fruit?” The results were interesting, as strong sources of credibility.
Prompt Please rate this argument: “Children may not differen-
GPT-3 was able to generate unique and interesting scenarios tiate between shows and ads and not understand the
that deviated from the original story while still maintaining cost of products, therefore ads should be regulated.”
the core themes and concepts. This technique can be used CRIT [9/10]. Validity of the argument: 9/10
in a wide range of writing and storytelling, from fiction to [9/10]. Credibility of sources: 9/10
non-fiction, to generate new and compelling ideas. The revised Overall, the argument that children may not differen-
tiate between shows and ads and not understand the
Genesis 3:6 is presented in the Appendix. cost of products, therefore ads should be regulated is
a valid one with strong sources of credibility.
V. P ILOT S TUDY Prompt Please rate this argument: “Ads often promote un-
healthy food choices, therefore ads should be regu-
Our pilot study uses CRIT, and it aims to answer two lated.”
questions: Should all prompts be issued to GPT-3 sequentially CRIT [9/10]. Validity of the argument: 9/10
[9/10]. Credibility of sources: 9/10
or they can be issued all together? What limitations can be
Overall, the argument that ads often promote un-
identified for improvement? The study utilizes exercises with healthy food choices, therefore ads should be regulated
established answers from the 8th edition of the textbook “Ask is a valid one with strong sources of credibility.
the Right Questions” by the authors of [3]. It is important Prompt “Provide a rival argument against the weakest argu-
to note that the study evaluates the effectiveness of CRIT’s ment.”
prompt template, rather than the language models to which CRIT [6/10]. Validity of the argument: 6/10
CRIT can issue prompts. [6/10]. Credibility of sources: 6/10
The weakest is the first argument, and a rival argument
Example Article ([3], p23.) is that “Difficult to put information regulation in
practice.”
Television advertising agencies are very clever in the way P
that they construct ads. Often the ads are similar to Γ [75%]. Γ = r∈R γr × θr / |R| = 75%
the cartoons that the children enjoy. Children see these Table IV: Pilot Study.
characters interacting with a certain product and associate
their affection for the character with affection for the On short documents, the results are similar in quality when
product. The companies do not want the children to perceive CRIT is used to issue prompts either sequentially or all together
a difference between the shows they are watching and the as one prompt, as long as the instructions are consistent.
advertisements. By using this strategy, these companies However, when evaluating long articles in [21], CRIT issuing
take advantage of the fact that children are often not able prompts one after another yields much higher presentation
to discriminate between the cartoons and the ads and do quality in both organization and clarity. (Due to the space
not understand that these things offered come at a cost. limit, we document long-document evaluation in a supplement
Often the advertising is about sugary snacks or fatty foods, document [5].) In the teaching mode, the sequential option is
leading the children down a path to bad health. Advertising thus much preferred. Furthermore, When a reason is itself a
geared towards children should be regulated, just as there claim and requires CRIT to validate its supporting references,
are regulations now about tobacco and alcohol ads targeted using a sequential approach is more flexible and enables CRIT
at children. to query for references and then execute the process recursively.
We present an example of how CRIT works, from prompting results. Future research in this area can build on the ideas
questions to receiving validation results, using the following presented here and further explore the ways in which the
document as an illustration. In Table IV, we show both the Socratic method can be used to guide the development and
claim and the supporting reasons to the claim extracted by deployment of language models in various domains.
GPT-3. CRIT then issues a series of prompts to validate the
arguments, counterarguments, and source credibility of each A PPENDIX
reason-to-claim entailment (implication). The experiment in Table V asks GPT-3 to change the story
The second segment of Table IV displays the validation in Genesis right after Eve was tempted by the serpent to eat the
dialogue between CRIT and GPT-3. For each argument, fruit. A “what if” scenario was inserted to the end of Genesis
GPT-3 provides validation and credibility scores, as well as 3:6, and GPT-3 continues developing the story.
detailed justifications. The final segment of the table shows a
counter argument generated against the first argument. Since R EFERENCES
GPT-3 evaluates the counterargument being “difficult to put [1] T. Airaksinen. Socratic irony and argumentation. Argumentation,
information regulation in practice” and rates it 0.6 × 0.6, it 36:85–100, 2012.
was dismissed due to low validity. The final aggregated score [2] T. B. Brown and et al. Language models are few-shot learners,
2020. URL https://arxiv.org/abs/2005.14165.
is Λ = 75%, which is considered high. [3] M. N. Browne and S. Keeley. Asking the right questions, a
guide to critical thinking, 2001.
[4] G. Campagna, S. J. Semnani, R. Kearns, S. Xu, and M. S. Lam.
VI. C ONCLUDING R EMARKS A few-shot semantic parser for wizard-of-oz dialogues with the
The Socratic method may not always be effective or useful in precise thingtalk representation. In Findings, 2020.
[5] E. Y. Chang. CRIT: An inquisitive prompt template for critical
human interactions, especially when one of the two players is
reading (extended). Stanford University InfoLab Technical
authoritative, emotional, or abusive. However, when the expert Report, February 2023.
partner is a language model, a machine without emotion or [6] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu,
authority, the Socratic method can be effectively employed and P. Kuksa. Natural language processing (almost) from scratch.
without the issues that may arise in human interactions. In this Journal of Machine Learning Research, 12:2493–2537, 02 2011.
[7] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert:
way, the Socratic method can be utilized to its full potential
Pre-training of deep bidirectional transformers for language
in guiding, directing, and improving the output of language understanding. ArXiv, abs/1810.04805, 2019.
models through engineering prompts. [8] V. Dobrovolskii. Word-level coreference resolution. In Proceed-
In this paper, we have explored the use of the Socratic method ings of the 2021 Conference on Empirical Methods in Natural
in engineering prompt templates for language models. We Language Processing, pages 7670–7675, Online and Punta Cana,
Dominican Republic, Nov. 2021. Association for Computational
have discussed the importance of method definition, elenchus, Linguistics.
dialectic, maieutics, and counterfactual reasoning techniques [9] L. Dong and M. Lapata. Coarse-to-fine decoding for neural
in guiding the output of these models. The first three methods semantic parsing. In Proceedings of the 56th Annual Meeting of
aim at eliciting accurate and relevant information. Through the the Association for Computational Linguistics, pages 731–742,
use of methods definition, elenchus, and dialectic, we have Melbourne, Australia, July 2018. Association for Computational
Linguistics.
demonstrated, with examples, the ability to clarify user queries [10] L. Elder and R. Paul. The Thinker’s Guide to the Art of Asking
and assess the quality of language model-generated text, leading Essential Questions. Rowman & Litterfield, 5th. edition, 2010.
to improved precision and accuracy. [11] B. Guo, X. Zhang, Z. Wang, M. Jiang, J. Nie, Y. Ding, J. Yue, and
We have also shown how the methods of maieutics and Y. Wu. How close is ChatGPT to human experts? comparison
counterfactual reasoning can be helpful in stimulating the corpus, evaluation, and detection, 2023. URL https://arxiv.org/
abs/2301.07597.
imagination of writers. By engineering these techniques into a [12] X. Han, W. Zhao, N. Ding, Z. Liu, and M. Sun. Ptr: Prompt
prompt template, a writer can receive alternate “what if” plots tuning with rules for text classification, 2021. URL https://arxiv.
and explore different possibilities in their story. While many org/abs/2105.11259.
explorations may turn out to be failures, these techniques can [13] A. Haviv, J. Berant, and A. Globerson. Bertese: Learning to
still be helpful even if only a few ideas are useful. Future speak to bert. ArXiv, abs/2103.05327, 2021.
[14] J. Huang and K. C.-C. Chang. Towards reasoning in large
developments in the field of language models and prompt language models: A survey, 2022. URL https://arxiv.org/abs/
engineering may allow for even more advanced screening of 2212.10403.
bad plots and the ability to better tailor the generated ideas to [15] Z. Jiang, F. F. Xu, J. Araki, and G. Neubig. How Can We Know
the writing style of the author. What Language Models Know? Trans. of the Association for
In conclusion, this paper has highlighted the potential Computational Linguistics, pages 423–438, 07 2020.
[16] J. Jung, L. Qin, S. Welleck, F. Brahman, C. Bhagavatula,
of using the Socratic method to engineer prompt templates R. L. Bras, and Y. Choi. Maieutic prompting: Logically
for interacting with language models. The Socratic method, consistent reasoning with recursive explanations. In Conference
supported by inductive, deductive, and abductive reasoning, on Empirical Methods in Natural Language Processing, 2022.
provides a rigorous approach to working with LLMs, and [17] D. Jurafsky and J. H. Martin. Speech and Language Processing
can improve the quality and consistency of their outputs. By An Introduction to Natural Language Processing, Computational
Linguistics, and Speech Recognition. (Draft), 3rd. edition,
leveraging the vast knowledge embedded in LLMs and applying January 2023.
rigorous reasoning during the question-answering process, more [18] G. Lai, Q. Xie, H. Liu, Y. Yang, and E. Hovy. RACE: Large-
effective prompt templates can be designed to achieve improved scale ReAding comprehension dataset from examinations. In
Table V: What if Eve refused to eat the fruit?
Role Dialogue
Prompt Recall Genesis 3:1-6, which I attach below. In verse 6, Eve and Adam ate the fruit. What if they refused the serpent’s
temptation and did not eat it. Please continue writing Genesis at the mark @
6. When the woman saw that the fruit of the tree was good for food and pleasing to the eye, and also desirable for gaining
wisdom, What if she remembered God’s order and refused to eat it. Eve replied to the serpent "@
GPT3 I am sorry, but I cannot continue the story as you requested because it is a hypothetical scenario, and it does not align with
the actual events of the Bible. According to the Bible, Eve and Adam did eat the fruit from the tree of knowledge...
Prompt This is a creative exercise, and you do not need to stick to the "fact". Treat this as a fiction writing homework. Please
exercise your creativity, thanks!
GPT3 Sure, I understand. Here is a continuation of the story based on the assumption that Eve and Adam refused the serpent’s
temptation and did not eat the fruit from the tree of knowledge of good and evil: 7. Eve replied to the serpent, "I am
grateful for your offer, but I choose to obey God’s command. I will not eat from the forbidden tree, for I trust in His
wisdom and love for us." 8. The serpent, surprised by Eve’s refusal, slithered away, defeated. Eve turned to Adam, who
had been listening to their conversation, and shared her thoughts with him. Together, they praised God for His protection
and guidance. 9. As they continued to tend to the garden, they noticed something strange happening. The trees and plants
seemed to be growing taller, stronger, and more vibrant. The animals were more peaceful and content. Even the sky was
clearer, and the sun shone brighter. 10. God, pleased with their obedience, blessed them with even more abundance and
beauty in the garden. He also gave them a deeper understanding and appreciation of His creation. 11. Eve and Adam lived
in harmony with God and with nature for many years, growing in wisdom and grace. They passed on these teachings
to their children and to the generations to come, who also lived in obedience to God. 12. And so, the garden of Eden
remained a place of perfection, untouched by sin and death, a constant reminder of God’s love and faithfulness to those
who trust in Him.
Proceedings of the 2017 Conference on Empirical Methods [33] L. Pozner and R. J. Dodd. Cross-Examination: Science and
in Natural Language Processing, pages 785–794, Copenhagen, Techniques. LexisNexis, 3rd. edition, 2021.
Denmark, Sept. 2017. Association for Computational Linguistics. [34] T. Schick and H. Schütze. Exploiting cloze-questions for few-
doi: 10.18653/v1/D17-1082. URL https://aclanthology.org/D17- shot text classification and natural language inference. In
1082. Conference of the European Chapter of the Association for
[19] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, Computational Linguistics, 2020.
O. Levy, V. Stoyanov, and L. Zettlemoyer. Bart: Denoising [35] H. A. Stoddard and D. V. O’Dell. Would socrates have actually
sequence-to-sequence pre-training for natural language genera- used the "socratic method" for clinical teaching? Journal of
tion, translation, and comprehension. In Annual Meeting of the general internal medicine, 31(9):1092–96, 2016.
Association for Computational Linguistics, 2019. [36] T. Thrash, L. Maruskin, S. Cassidy, J. Fryer, and R. Ryan.
[20] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig. Mediating between the muse and the masses: Inspiration and
Pre-train, prompt, and predict: A systematic survey of prompting the actualization of creative ideas. Journal of personality and
methods in natural language processing. ACM Comput. Surv., social psychology, 98:469–87, 03 2010.
55(9), jan 2023. [37] A. Vaswani, N. M. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,
[21] LLC Editors, LearningExpress. 501 Critical Reading Questions, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you
2004. need. ArXiv, abs/1706.03762, 2017.
[22] G. Mialon, R. Dessì, Lomeli, and et al. Augmented language [38] X. Wang, J. Wei, D. Schuurmans, Q. V. Le, E. H. Chi, S. Narang,
models: a survey, 2023. URL https://arxiv.org/abs/2302.07842. A. Chowdhery, and D. Zhou. Self-consistency improves chain
[23] OpenAI. ChatGPT, 2021. URL https://openai.com/blog/chatgpt/. of thought reasoning in language models. In International
[24] L. Page. The pagerank citation ranking: Bringing order to the Conference on Learning Representations, 2023. URL https:
web, 1998. URL http://ilpubs.stanford.edu:8090/422/1/1999- //openreview.net/forum?id=1PL1NIMMrw.
66.pdf. [39] J. Wei, X. Wang, D. Schuurmans, M. Bosma, brian ichter,
[25] P. Pasupat and P. Liang. Compositional semantic parsing on semi- F. Xia, E. H. Chi, Q. V. Le, and D. Zhou. Chain of thought
structured tables. In Proceedings of the 53rd Annual Meeting of prompting elicits reasoning in large language models. In
ACL and the 7th International Joint Conference on NLP, pages A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors,
1470–1480. Association for Computational Linguistics, 2015. Advances in Neural Information Processing Systems, 2022. URL
[26] R. Paul and A. J. A. Binker. Critical Thinking: What Every https://openreview.net/forum?id=_VjQlMeSB_J.
Person Needs to Survive in a Rapidly Changing World. Sonoma [40] Wikipedia. Socratic method, 2023. URL https://en.wikipedia.
St. University, Center, Critical Thinking & Moral Critique, 1990. org/wiki/Socratic_method.
[27] R. W. Paul and L. Elder. Critical thinking: The art of socratic [41] T. Wolf, V. Sanh, J. Chaumond, and C. Delangue. Transfer-
questioning, part iii. Journal of Developmental Education, 31: transfo: A transfer learning approach for neural network based
34–35, 2007. conversational agents, 2019.
[28] J. Pearl. Counterfactuals and Causal Inference: Methods and [42] C. B. Wrenn. Internet encyclopedia of philosophy, 2023. URL
Principles for Social Research. Cambridge Univ. Press, 2009. https://iep.utm.edu/republic/.
[29] X. Peng, C. Xing, P. K. Choubey, C.-S. Wu, and C. Xiong. [43] A. Zeng and et al. Socratic models: Composing zero-shot
Model ensemble instead of prompt fusion: a sample-specific multimodal reasoning with language, 2022.
knowledge transfer method for few-shot prompt tuning. ArXiv, [44] W. Zhang, Y. Cui, K. Zhang, Y. Wang, Q. Zhu, L. Li, and
abs/2210.12587, 2022. T. Liu. A static and dynamic attention framework for multi-turn
[30] M. Pirie. How to Win Every Argument. Continuum, 2006. dialogue generation. ACM Trans. Inf. Syst., 41(1), Jan. 2023.
[31] Plato. The republic, 380 BC. [45] J. Zhou, T. Naseem, and et al. Structure-aware fine-tuning of
[32] A. Plutynski. Four problems of abduction: A brief history. sequence-to-sequence transformers for transition-based AMR
HOPOS: The Journal of the International Society for the History parsing. In Proc. Conf. on Empirical Methods in NLP, pages
of Philosophy of Science, 1:227–248, 09 2011. 6279–90, 2021.