0% found this document useful (0 votes)
43 views10 pages

The Causal Reasoning Ability of Open Large Language Model A Comprehensive and Exemplary Functional Testing

This paper presents a comprehensive framework for functional testing of open large language models (LLMs), focusing on their causal reasoning abilities. It deconstructs causal reasoning into five dimensions and proposes a testing solution that includes various task forms and usage modes, aiming to establish benchmarks and empirical insights for practical applications. The study highlights the need for rigorous testing of open LLMs due to their potential software quality issues and provides a transferable testing paradigm for evaluating similar models.

Uploaded by

brunojimezz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views10 pages

The Causal Reasoning Ability of Open Large Language Model A Comprehensive and Exemplary Functional Testing

This paper presents a comprehensive framework for functional testing of open large language models (LLMs), focusing on their causal reasoning abilities. It deconstructs causal reasoning into five dimensions and proposes a testing solution that includes various task forms and usage modes, aiming to establish benchmarks and empirical insights for practical applications. The study highlights the need for rigorous testing of open LLMs due to their potential software quality issues and provides a transferable testing paradigm for evaluating similar models.

Uploaded by

brunojimezz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security (QRS)

The Causal Reasoning Ability of Open Large Language Model: A Comprehensive and
Exemplary Functional Testing

Shun-Hang Li1,2 , Gang Zhou1,2,∗ , Zhi-Bo Li1,2 , Ji-Cang Lu1,2 , and Ning-Bo Huang1,2
1
State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou, China
2
Information Engineering University, Zhengzhou, China
baikal [email protected], [email protected], [email protected], [email protected], rylynn [email protected]
*corresponding author

Abstract—As the intelligent software, the development and Cargo ship


application of large language models are extremely hot topics Most open LLMs
recently, bringing tremendous changes to general AI and soft-
ware industry. Nonetheless, large language models, especially
Yacht
open source ones, incontrollably suffer from some potential Canoe
software quality issues such as instability, inaccuracy, and
insecurity, making software testing necessary. In this paper, we
propose the first solution for functional testing of open large Super LLMs LLMs traditional LMs
language models to check full-scene availability and conclude
>20 B 1~20 B <1B
empirical principles for better steering large language models,
GPT-3, PaLM, … ChatGLM, LLaMA, … BERT, RoBERTa, …
particularly considering their black box and intelligence prop-
erties. Specifically, we focus on the model’s causal reasoning Figure 1: A metaphorical illustration of language models with
ability, which is the core of artificial intelligence but almost different parameter scales. Typically, the parameter sizes of
ignored by most previous work. First, for comprehensive LLMs (i.e. GPT-3 and ChatGLM) are more massive than tra-
evaluation, we deconstruct the causal reasoning capability into ditional LMs (i.e. Bert and RoBERTa), amounting to billions,
five dimensions and summary the forms of causal reasoning especially those super LLMs.
task as causality identification and causality matching. Then,
rich datasets are introduced and further modified to generate
test cases along with different ability dimensions and task
facilitates them to work well on downstream tasks in low-data
forms to improve the testing integrity. Moreover, we explore
and knowledge-intensive settings[5]. More surprisingly, there
the ability boundary of open large language models in two
is an obvious trend of coupling LLMs with traditional software
usage modes: prompting and lightweight fine-tuning. Our
to build extraordinary software for handling tasks beyond
work conducts comprehensive functional testing on the causal
NLP, such as smart calculators[6] and AI search engines[7].
reasoning ability of open large language models, establishes
Therefore, large language models, being honored as the
benchmarks, and derives empirical insights for practical usage.
most advanced general-purpose intelligent software systems,
The proposed testing solution can be transferred to other
are transforming the design, development, and testing of
similar evaluation tasks as a general framework for large
software.
language models or their derivations.
Although LLMs have shown powerful performance in most
Keywords–open large language model; causal reasoning; scenarios, their inherent downsides still exist, such as complex
black-box testing; prompt design; lightweight fine-tuning logic puzzle and inveracious content generation[8], directly
hindering the application of such intelligent software in high-
1. I NTRODUCTION stakes scenarios (e.g., healthcare, education and law). To
Large language models (LLMs), built on the skeleton of alleviate these hallucination and inaccuracy, correct logical
deep neural network and trained with abundant data resources reasoning ability is most vital, especially the causal reason-
through efficient training strategies, have attracted increasing ing capabilities. Causality is a stable and predictable logical
attention[1][2]. These models demonstrate remarkable per- relation[9], and plays a key role in providing interpretability
formance across various natural language processing (NLP) and robustness for many applications. Therefore, are LLMs
tasks[3]. GPT-4 and Pangu models 3.0 for example, show pow- excel at causal reasoning? is the most significant question
erful artificial general intelligence (AGI) and strong alignment needed to answer.
with human in the form of interactive dialogue[4]. Besides, due While several studies[10][11] have already been done for it
to the enormous number of parameters, different from tradi- and revealed initially the causal reasoning ability boundary of
tional language models, LLMs can store much more knowl- LLMs, several unanswered questions remain:
edge implicitly as metaphorically illustrated in Figure 1, which Q1: What about the causal reasoning ability of open

2693-9177/23/$31.00 ©2023 IEEE 240


DOI 10.1109/QRS60937.2023.00032
InstructGPT GPT-4 from different perspectives. Here, we classify the casual rea-
soning task forms as causality identification and causality
ChatGPT
matching; (3) Model Usage Modes. There are two basic usage
GLM ChatGLM
modes of LLMs: prompting and lightweight fine-tuning. We
test and analyze the performances of different models in above
OPT OPT-IML LLaMA basic usage modes respectively.
Galactica
For implementing full testing, rich datasets are introduced and
LaMDA Minerva Bard
partly modified to generate abundant test cases in accordance
with the combined requirement of above five ability dimen-
PaLM Flan PaLM sions, two task forms and two usage modes. To summarize,
Closed

Open
the main contributions of this paper are as follows:
YaLM • To the best of our knowledge, we are the first to take open
LM v4 Claude LLMs’ causal reasoning ability as testing target. Our goal is
to establish baselines, provide empirical guidance on open
2022 2023
LLMs usage, and explore a framework for fully testing
Figure 2: Main large language models released by some specific capabilities of general intelligent software.
organizations since 2022. Note that the timeline only shows • A meticulous causal reasoning ability decomposition solu-
the chronological order. tion is proposed, and a compound data set covering various
task forms is collected from different public data resources
for generating rich test cases (Section 3);
• Two usage modes of LLMs, i.e. prompting (Section 4) and
LLMs? Despite the rapid development of LLMs, a consider-
able number of them are not fully open mainly because of the lightweight fine-tuning (Section 5), are both considered.
huge and burdensome computational and storage requirements. Moreover, empirical usage principles of open LLMs are
Naturally, wider research and industrial applications are based discussed in Section 6.2;
• Our work provides a paradigm for open LLMs testing,
on open LLMs from open source community, providing open
LLMs more opportunity to be the backbone in building intelli- which can be transferred to similar LLMs evaluations.
gent software systems. However, existing research focuses on
2. BACKGROUND
evaluating causal reasoning abilities of super but closed LLMs
like GPT-3.5 without exception. Most open LLMs fall into 2.1. Large Language Model
LLMs with medium parameter scales as shown in Figure 1, Due to the development of computing power and neural
and thus blindly generalizing these evaluation schemes and network structure, pre-trained language models have devel-
conclusions from super LLMs to open LLMs is not proper. oped rapidly and shown excellent performance. Specially,
Q2: How to evaluate the causal reasoning ability more taking the roadmap of continuously enlarging model scale
comprehensively and objectively? Our test subject exhibits and deepening training, GPT-n family models [12][13][14]
a high level of complexity. On one hand, LLMs are black-box have exhibited increasing capabilities in understanding human
software with very limited available information about their intentions and generating human-like responses until ChatGPT
specifications. On the other hand, there are various definitions reaches a shocking level. Notablely, in addition to responding
and forms of causality in different research fields. Evaluation in natural language, LLMs are also capable of programming,
datasets vary in terms of languages and labeling strategies which enables large extension of LLMs, empowering them
used. Most existing work chooses a single thin data set to to resolve almost all computable problem. Consequently, they
support the testing, which greatly weakens the breadth and are often referred to as “fundamental models” to denote their
depth of the specific functional testing. versatility[15].
To answer above problems, our study designs a proper testing Following GPT-n family models, a plenty of LLMs have
solution, and carries out a comprehensive and exemplary emerged in recent two years. As illustrated in Figure 2, a
evaluation on open LLMs. First, we select several common- significant portion of these models are not openly available,
used and ancestral open LLMs as our testing objects to fill especially those with over 100 billion parameters called super
the research gap. Then, a comprehensive dimension-form- LLMs. Driven by the collaborative efforts of open source com-
mode testing solution is proposed, mainly including three munity, open LLMs, ChatGLM [16], LLaMA[17], contribute
steps: (1) Ability Deconstruction. We subdivide the causal largely to the advancement of intelligent software ecosystems
reasoning ability into five fine-grained dimensions, including as another vibrant option.
simple causal reasoning, complex causal reasoning, domain- Applying LMs to downstream tasks, two usage methods,
specific causal reasoning, multilingual causal reasoning and prompting and fine-tuning, are common-used[2]. Prompting
pure causal reasoning. Those ability dimensions can cover uses prompt words to steer LMs without training, while fine-
most LLMs’ application scenarios; (2) Task Formalization. tuning method needs abundant downstream training data sets
Different task forms check the ability of models under test to re-train the LMs.

241
2.2. Causal Reasoning
Causality Identification
Causal reasoning is a complex logical task that aims to judge INPUT:
- [context] Two vehicles collided near the Caoxi Interchange
or rebuild causal relation between two events described in section of the Inner Ring Elevated Road, causing one of the
natural language[18]. The concept scope of causality is very engineering vehicles to cross the elevated guardrail without
causing any casualties.
broad, such as document-level causality, implicit causality and
- [event 1] Two vehicles collided
statistical causality. Following [19], we divide the diverse - [event 2] cross the elevated guardrail
causal reasoning task forms into two categories as instances
in Figure 3: OUTPUT:Causality

• Causality Identification (CI). Given the context C and


event mentions e1 and e2 , determine whether there is a Causality Matching
causal relationship between the two events, or further judge INPUT:
the direction of causal relation. - [event 1] A shadow appeared behind me.
- [relationship] select the cause event of event 1
• Causality Matching (CM). Given an event e1 and a
- [candidate events] [Option 1] The sun is rising. [Option 2]
relationship rel ∈ {cause, ef f ect}, select an event e2 from The lawn has been mowed.
the candidate event set to form this certain relationship.
OUTPUT:Option 1
Indeed, there is a lot of work [20][21] that has tested the causal
reasoning ability of traditional LMs. There are still some Figure 3: Two task forms of causal reasoning: causality
challenges. First, poor generalization. Due to the small scale identification and causality matching.
and scarce labeled datasets and large distribution differences
between the training and testing sets in real world, the model’s
performance is unsatisfactory. Second, hard to handle complex question answering tasks, and [24] explores GPT-3.5 and GPT-
causal reasoning. There are various ways to express causality 4 on knowledge graph construction and reasoning tasks.
in natural language, such as cross-sentence causality and multi- Our study serves as another strong supplement to the specific
hop causality. Due to the limitations of traditional language functional testing of LLMs, which focuses on the causal
models, existing methods still struggle to perform well in reasoning ability of open LLMs with different task forms,
complex causal reasoning. usage modes and ability dimensions. Moreover, we provide an
Leveraging the rich training corpus, LLMs possess distinct exemplary testing framework, which can be migrated, scaled
advantages when it comes to data distribution transfer and task and customized.
generalization. Additionally, LLMs excel in comprehending
3. T ESTING P REPARATION
long-distance contexts. Hence, it is both feasible and worth-
while to leverage LLMs for causal reasoning tasks. In this section, we prepare the models under test, testing
solution and test cases for the functional testing. At first,
2.3. Specific Functional Testing of LLMs the large language models under test are selected following
two vital principles. Then, the proposed dimension-form-mode
Built on the deep neural network, the intrinsic mechanisms testing solution is introduced in detail. Finally, to realize the
of LLMs are considered black boxes. Thus, testing on LLMs testing solution, we collect multiple data sources to achieve
almost falls into functional testing by nature. A common form data-driven test case generation.
of testing is universal functional testing, and BIG-bench[22]
published by Google is a typical universal testing benchmark, 3.1. Testing Objects
which involves 204 subtasks. Many tasks in Big-bench go The popularity of open LLMs promotes the continuous emer-
beyond the scope of traditional NLP performance testing, and gence of new versions. As a result, selecting suitable test
focus more on the security and practical application ability objects is not straightforward work. Here, three widely used
of LLMs, such as social bias and childhood education. Until and representative open LLMs are chosen by us: ChatGLM
now, the performance of LLMs published does not achieve (Tsinghua University)[16], Alpaca (Stanford University)[17],
satisfactory results on BIG-bench. and Bloomz (BigScience)[25]. The principles of selection are:
Specific functional testing is another testing target, which is (1) prioritizing precursor LLMs. Generally, the LLM family
crucial to domain-specific implementation and aims to under- originates from a precursor model that determines the archi-
stand the pros and cons of LLMs on specific businesses. For tecture and capability scope of the entire model family. Along
example, [23] tests the performance of GPT-3.5 on complex with the evolution of open models, the software quality also

Table 1. A list of open LLMs chosen for testing: ChatGLM, Alpaca, and Bloomz
Model Architecture Parameter Scale Options Features
ChatGLM Transformer 6B Better both NLG and NLU abilities with the auto-regressive blank infilling pre-training framework
Alpaca Decoder only 7B Better balance between model size and performance based on LLaMA
Bloomz Decoder only 560M, 1.1B, 1.7B, 3B, 7B Training with 59 languages

242
Table 2. Different datasets are collected to generate test
Simple CR
cases for different ability dimensions and task forms
Forms
Cauality Reasoning (CR)
CI CM

Five dimensions
Complex CR Pure CR
Open Simple CR Eval8 COPA
LLMs Complex CR ESC, MECI en HLC
Donanial CR BioCause —
Multilingual CR MECI es —
Pure CR Corr2cause —
Domain-specific CR Multilingual CR

Causality Identification Prompting


expressions are obvious, the event labeling and semantics

Two modes
are complete, and the context length is not too long.
Two forms

Data { Test Set }


splitting • Complex causal reasoning means the process of analyzing
Lightweight causality in situations that do not meet the criteria of simple
Causality Matching Fine-tuning
{ Train Set }
causal reasoning.
• Domain-specific causal reasoning aims to explore causal-
Figure 4: An illustration of the dimension-form-mode testing ity within a specific domain, requiring open LLMs equipped
solution proposed by us for comprehensive causal reasoning with domain-specific knowledge.
ability evaluations. • Multilingual causal reasoning is to identify causal re-
lationships in different languages, especially non-English
languages.
turns much uncontrollable. Hence, taking precursor models as • Pure causal reasoning refers to determine causal re-
testing objects is relatively objective; (2) widely used LLMs lationships between abstract variables from correlational
are often representative. The above three LLMs are widely statements. It involves less empirical knowledge but more
used in the community and industry, making it meaningful to statistical knowledge.
evaluate them for providing model understanding and usage. Moreover, the two model usage modes introduced in Section
We summarize the characteristics of different models as shown 2.1 and the two task forms for causal reasoning described
in Table 1. in Section 2.2 are all considered along with the five ability
3.2. Testing Solution dimensions.
At the data level, we argue that the multi-dimensional decom-
The purpose of our study is to explore the causal reasoning
position of abilities necessitates adequate test case coverage,
ability of open LLMs, which is natural for people but difficult
which means the ability dimensions decide the data resource
for computers. Different from existing research, we hope
selection criteria. And, the task forms determine the format of
to evaluate the causal reasoning ability systemically. Thus,
test cases while the model usage modes dictate how to split
as shown in Figure 4 (top), the causal reasoning ability is
the dataset as shown in Figure 4 (bottom).
deconstructed into five dimensions in our study, which covers
most causal reasoning ability demand of intelligent software. 3.3. Data-driven Test Case Generation
• Simple causal reasoning refers to the causal reasoning Test case generation is the core of software testing, which is
task in the scene where the linguistic features of causality also the different step between LLMs testing and traditional

causal (%) non-causal (%)


100.0%

90.0%

BioInfer (all) 80.0%


50.1 50.0
COPA (all) 70.0% 58.1
Corr2cause (4) 60.0% 81.4
ESC (4 topic) 90.3 91.9 89.2
50.0% 95.9
Eval8 (test)
3994 3992 40.0%
HLC_en (test)
30.0%
MECI_en (test)
20.0%
MECI_es (test)
10.0%
41.9 49.9 9.7 8.1 10.8 50.0 18.6 4.1
0.0%
1105 997 BioIn COPA Corr2 E E H M M
fer (a
ll) (all) cause SC (4 topic val8 (test) LC_en (tes ECI_en (t ECI_es (t
13666 3045 (4) ) t) est) est)

Figure 5: The data set sizes and the proportions of positive and negative samples used for model capability testing based on
prompt engineering.

243
software testing. First, unlike traditional software, which has Simple Prompt (SP) In-Context Prompt (ICP)
a clear scope of function, the internal structure of the LLMs {context �}. Tell me whether there is
{context �}.There is (not) a causal
is not interpretable and thus the function also cannot be a causal relation between pairs of
relation between pairs of nominals:
CI nominals (or words): {�� } and {�� }
fully defined. Second, in the traditional testing process, some {�� } and {�� } in above sentence;
in above sentence. Answer yes or no
methods, such as equivalence partitioning and boundary value +…+ SP
only.
analysis, can be used to decrease the test case scale required Given the event {�� }, which choice is Given the event {�� }, the {���} of
more likely to be the {���} of this
for testing. For the testing of intelligent software like LLMs, CM this event is likely to be {�� }; +…+
event? 1. {�� } 2. {�� }. Only answer
we can only check whether the model output is correct or not. 1 or 2 without any other words. SP
And it is difficult to understand why the model gives such a
Figure 6: An overview of different prompt texts for different
response. To better predict model behavior and improve the
task forms. In-Context prompt (ICP) consists of k input-label
testing integrality, we advocate that data-driven testing is a
pairs and simple prompt (SP).
feasible solutions, which means inputting large-scale test cases
and conducting statistical analysis.
We argue that the richer the test cases for LLMs testing are, causal reasoning ability of open LLMs.
the higher the testing integrality is. Therefore, a single data We obtain vast test cases from above datasets by generating
set is difficult to meet the requirements of different causal question-answer pairs, and take F1 score (F1) and Accuracy
reasoning dimensions and forms for test case generation. Here, (A) as the statistical metrics to evaluate open LLMs’ re-
we introduce seven public datasets and modify partial datasets sponses. We release our datasets and corresponding processing
to meet the different ability dimensions and input formats, as scripts in https://github.com/Ewillingfly/CausalityTesting.
shown in Table 2. These datasets are summarized as follows:
4. T ESTING IN P ROMPTING M ODE
• Eval8[26] is a human-labeled dataset for relation extraction In this and next section, we conduct testing on causal reasoning
released by SemEval 2010. It contains 10 types of relations, ability of open LLMs in two usage modes: prompting and
of which the number of causality samples accounts for more lightweight fine-tuning respectively.
than 12%. This dataset annotates nominal phrases as event
mentions. 4.1. Prompt Design
• COPA[27] is a typical commonsense reasoning dataset, Prompting (also prompt learning) is a newfangled paradigm
which belongs to SuperGLUE, a NLU benchmark. This to transfer pre-trained model to downstream tasks by adding
dataset is organized in the form of causality matching. elaborately designed prompts[2]. Generally, prompt learning
• ESC[28] is a complex dataset similar to Eval8 in format. could obtain better p (x |y ) between input x and output y by
The differences are as follows: (1) ESC usually labels verbs indirectly estimating p (x |y, prompt ) without training.
as event mentions, and the semantics expressed by verbs are The core of prompt learning is prompt text(usually the task
more diverse; (2) ESC dataset has a greater density of event description), which can be viewed as a “regulator” for model
annotations and thus more complex causal structures. Up to state control. Here, we explore the performance of two prompt
now, effective solutions for this dataset are still lacking. design method, called simple prompt (SP) and in-context
• HLC[29] is a dataset describing the causality between news prompt (ICP). SP only uses task description to instruct open
titles, while containing a large number of implicit causal LLMs, while ICP additionally adds real samples as demonstra-
relations. Formally, an event in HLC is described by a tions to provide more information. Figure 6 summaries the two
sentence, so we can easily transfer HLC into the format different prompt forms used in this paper.
of causality matching. For test case generation in prompting mode, we take the test
• BioCause[30] is built for causality identification task in sets of Eval8, HLC, MECI en and MECI es, the first 4 topics
biomedical field, primarily describing the causality between of ESC, all of COPA and BioInfer, and the subset with 4
genes, proteins, and other entities. variables in Corr2cause as the data sources to facilitate model
• MECI[31] is a multilingual dataset for document-level comparisons. The statistical details are shown in Figure 5.
causality identification in Spanish, English, Danish, Turkish,
and Urdu. Considering the theoretical ability of the language 4.2. Main Results and Analysis
model under test, the Spanish part (marked as MECI es) Table 3 presents the causal reasoning ability of target models
is used for multilingual causal reasoning evaluation, and with simple prompt across various test case groups. We
the English part (marked as MECI en) for complex causal observe that:
reasoning. (1) Open LLMs demonstrate a certain degree of causal rea-
• Corr2cause[32] is a newly proposed statistical causal in- soning ability with simple prompts (superior to random
ference dataset described in natural language, centered on methods in most cases).
testing when if is valid or invalid to infer causation from (2) Bloomz demonstrates supesrior performance compared to
correclation. It builds a bridge between statistical causality other models on average. More than 75% of test cases
inference and natural language causal reasoning, used to generated from Eval8 and COPA (i.e. Simple CR) can be
effectively investigate the statistical knowledge and pure passed by Bloomz.

244
Table 3. Results of different models under test with SP on various causal reasoning tasks. Each experiment is repeated 5
times and the average value is taken to ensure the reliability. Bold indicates the optimal result, same below.
Simple CR Domain-specific CR Complex CR Multilingual CR Pure CR
Eval8 COPA BioInfer ESC MECI en HLC MECI es Corr2cause-4
F1 A A F1 A F1 A F1 A A F1 A F1 A
Random 17.72 50.00 50.00 45.59 50.00 13.89 50.00 27.14 50.00 50.00 7.52 50.00 16.28 50.00
SOTA 84.55 - 99.4 60.70 - 67.90 - 58.10 - 89.40 52.80 - 94.74 -
ChatGLM-6B 22.02 15.79 76.43 61.14 60.54 15.36 16.36 17.36 18.36 19.36 - - 1.38 0.69
Bloomz-7B 45.77 78.98 76.93 43.51 61.00 19.32 80.34 21.59 80.72 51.90 5.40 95.64 16.67 86.11
Alpaca-7B 21.54 12.07 52.26 59.06 41.90 14.93 8.07 31.41 18.63 47.47 - - 17.72 9.72

Table 4. The performance of ChatGLM and Bloomz with in-context prompts in both simple and complex causal reasoning
tasks
Simple CR Complex CR
Model COPA Eval8 ESC HLC
A F1 A F1 A A
Random 50.00 19.46 50.00 12.98 50.00 50.00
ICP 67.30 21.66 14.43 9.42 5.14 55.06
ChatGLM-6B
∆ -9.13 -0.36 -1.36 -5.94 -9.29 -12.66
ICP 75.13 38.31 83.93 13.96 89.50 51.26
Bloomz-7B
∆ -1.80 -7.46 +4.95 -5.36 +9.16 -0.64

(3) ChatGLM and Alpaca exhibit a tendency to classify all


test cases as positive samples, resulting in higher F1-
value but lower accuracy value. Moreover, Alpaca even
faces challenges in comprehending questions and can not
generate answers following our instructions.
(4) In domain-specific causal reasoning, ChatGLM demon-
strates the most outstanding performance. It may be at-
tributed to ChatGLM’s exposure to a more relevant corpus
during pre-training.
(5) Most models perform poorly on complex and pure causal
reasoning tasks, this indicates that open LLMs do not pos-
Figure 7: Results with different model sizes on Eval8 and
sess causality reasoning abilities comparable to humans.
COPA. The backbone model is Bloomz.
(6) Despite Bloom being exposed to approximately 10.8% of
Spanish corpus during pre-training[25], it seems to lost its
Spanish language understanding ability after downsizing
Generally, larger models tend to exhibit greater versatility but
and retraining. This implies that the ability to support
we find an anomalous phenomenon. On the Eval8 dataset, as
complex tasks in multiple languages is not compatible with
the model size increases, its accuracy value shows N-shape
smaller model sizes.
(initially rises, then declines, and finally rises again) , while
Table 4 shows the causal reasoning ability of the models under its F1 score presents V-shape (initially decreases and then
test with in-context prompts on simple and complex reasoning increases). Before the model parameter size reaches 3 billions,
tasks. Most notably, we find that in-context prompting (ICP) the trends of F1 score and accuracy are opposite, but after that
does not help to causal reasoning and even hurts the perfor- point, they both increase simultaneously. Similar observations
mance, and this is consistent with the conclusions reached in have also been found on the COPA dataset. This phenomenon
[1] on ChatGPT. Among them, Bloonz’s performance with is named as emergence, which is a unique characteristic of
ICP shows least decline. complex systems. As the size of the model increases, it demon-
For the subsequent experiments of this section, we use Bloomz strates the progression from random guesses to hallucinations
as an example to study the influence of model parameters and and eventually to a better comprehension of tasks.
in-context prompting hyperparameters because Bloomz shows
better performance in above experiments. 4.4. Analyses of Hyperparameters in In-Context Prompting
In this section, we analyze the setting of hyperparameters.
4.3. Impact of Model Size We carry out the experiment on Eval8 based on Bloomz
As shown in Figure 7, we examine the performance of Bloomz with SP, and repeat independently 10 times experiment with
with different model sizes in simple prompting setting on each configuration. From the box diagram shown in Figure 8,
Eval8 and COPA. we find that as the number of samples increases, the model

245

 $PRXQW
3RV 





WUDLQGDWDDPRXQW
 

SRV 
 

 

 
Figure 8: Performance with different demonstration numbers 

(ICP-k) on Eval8. The backbone model is Bloomz.
(YDO WUDLQ &RUUFDXVH WUDLQ (6& WUDLQ 0(&,BHQ WUDLQ 0(&,BHV WUDLQ
WUDLQGDWDVHWV

  Figure 10: The train data set sizes and the proportions of
 
positive samples used for model capability testing based on
lightweight fine-tuning
 

 
$FFXUDF\
)VFRUH

)VFRUH
 $FFXUDF\  learning a small certain subset of parameters internally, or in-
  troducing additional trainable modules to achieve transferring
to downstream tasks. In summary, they can be divided into the
 
following categories:
 
• Freeze Tuning. This approach freezes most of the model
 
      parameters and only allows the gradient of loss function
3URSRUWLRQRISRVLWLYHVDPSOHV
to back-propagate to optimize certain layers, which is rela-
Figure 9: Results with varying proportions of positive samples tively naive.
in the demonstrations. • Adapter Tuning. Some studies insert new layers into the
original model structure, fix the model parameters, and fine-
tune the parameters of the newly added layers. Recently
performance will improve in general and become more stable. proposed LoRA [34] is the most popular method guided by
But when it increases to 25, the model performance shows this idea, and its theoretical assumption is that the learned
slight decline. It can be attributed to the issue of incomplete parameters reside on a low intrinsic dimension.
inputs resulting from exceeding the model’s input limit due to • Prompt Tuning (P*-tuning). As mentioned above, the
lengthy prompt texts. prompts provide context to LLMs, helping it adapt to
Then, what is the impact of disrupting the balance of samples downstream tasks. However, manually designed prompts
in the prompt? We change the proportions of positive samples do not couple well with LLMs, and inducing the correct
from 0 to 1, and find that higher positive proportions makes model capabilities is difficult. Therefore, some scholars have
better F1-score as shown in Figure 9. This phenomenon shows proposed to learn the virtual prompts, which are mapped
that the simple label space does not need to be fully present into continuous vector space, such as Prefix-tuning[35], P-
in the prompt, which is not consistent with the conclusion of tuning[36], and P-tuning v2[37]. P-tuning v2 is the lat-
[33]. est proposed method, which combines both P-tuning and
Prefix-tuning’s advantages.
5. T ESTING IN L IGHTWEIGHT F INE - TUNING M ODE LoRA, Freeze-tuning and P-tuning v2 are chosen as our testing
Pre-training + fine-tuning is the classic paradigm for NLP. lightweight fine-tuning methods. These three methods are
Fine-tuning is actually the process of learning the specific applied to ChatGLM and Bloomz models to observe the model
data distribution of downstream tasks. For LLMs, full fine- performance. As shown in Figure 10, a portion of the dataset
tuning methods are not applicable due to the high cost. Instead, needs to be used as a training set for lightweight fine-tuning.
many lightweight fine-tuning techniques have been proposed
in recent years. In this section, we explore the causal reasoning 5.2. Main Results
ability of open LLMs after lightweight fine-tusning. Table 5 presents the causal reasoning ability of target models
across various test datasets by lightweight fine-tuning. Based
5.1. Lightweight Fine-tuning Methods on the results, we can draw the following initial conclusions:
The lightweight fine-tuning methods for large models typically (1) In summary, lightweight fine-tuning methods can improve
involve fixing all or a majority of the parameters and then the performance of LLMs, even over traditional LMs.

246
Table 5. The performance of ChatGLM and Bloomz with different lightweight fine-tuning methods on various datasets
Eval8 ESC MECI en MECI es Corr2cause-4
F1 A F1 A F1 A F1 A F1 A
Traditional LMs (Full fine-tuning) 82.43 - 45.3 - 43.9 - 39.0 - 69.29 92.92
LoRA 92.59 98.23 25.05 91.16 62.46 86.91 - - 38.46 88.89
ChatGLM P-tuning v2 64.39 92.71 14.61 75.7 18.94 79.22 - - 8.33 84.72
Freeze tuning 89.88 97.5 13.81 91.56 38.8 81.05 - - 55.00 87.50
LoRA 90.65 97.83 15.22 91.91 63.54 87.33 44.72 96.36 46.15 90.28
Bloomz P-tuning v2 58.32 89.73 14.64 90.09 19.57 80.03 37.92 96.14 9.86 89.26
Freeze tuning 93.09 98.34 19.69 92.23 40.67 81.3 35.26 95.65 60.00 91.67

%ORRP]XVLQJ/R5$

%ORRP]XVLQJ)UHH]HWXQLQJ

systematically tested. Here, we will discuss the theoretical and
 
         
practical implications, tell the empirical insights on LLMs
DO

O

 
D

usage, and the limitations of our study and future work


HY

HY

           
directions.
HVF

HVF
WUDLQGDWDVHWV

 
         
HQ

HQ

6.1. Theoretical and Practical Implications


FLB

FLB

 
PH

PH

           
This study is the first to evaluate open LLMs’ causal reasoning
HV

HV
FLB

FLB
PH

PH

 
          abilities, filling the blank of specific functional testing on open
VH

VH
FDX

FDX

 
UU

UU

LLMs. Our aim is to understand open LLMs’ performances


FR

FR

HYDO HVF PHFLBHQ PHFLBHV FRUUFDXVH HYDO HVF PHFLBHQ PHFLBHV FRUUFDXVH
WHVWGDWDVHWV  WHVWGDWDVHWV 
in different task forms, model usages and ability dimensions.
Figure 11: The out-of-distribution performance after Therefore, a “dimensions-forms-modes” testing solution is de-
lightweight fine-tuning. Take Bloomz fine-tuned with signed. Following this testing solution, we effectively improve
LoRA and Freeze-tuning as an example the comprehensiveness of testing and make it generalized
for other tasks. That is, we can break down a complex
functionality into several sub-functional dimensions, which
(2) LoRA and Freeze Tuning show stronger performance in can be parallel, sequential or other logical connections.
most scenarios than P-tuning V2. Besides, our study focuses on the causal reasoning abilities
(3) Training with P-tuning V2 shows slow convergence, es- of LLMs, which are ignored by most previous work. By
pecially when there are only small data sets to support introducing massive labeled data sets, we statistically study
training. this certain ability boundaries, verify the effects of different
(4) The tested lightweight fine-tuning methods hardly funda- model parameter sizes, prompt engineering, and lightweight
mentally introduce abilities that the LLMs do not originally fine-tuning methods. Hence, our study provides a benchmark
possess even though increasing the training dataset scale. and a guideline for further open LLMs development.
5.3. Out-of-Distribution (O.O.D.) Performance after Fine-
6.2. Empirical Insights on LLMs Usage
tuning
Based on the experiments and analysis above, we present some
To further investigate the generalizability of target models after
empirical principles for using large language models in causal
lightweight fine-tuning, we implement the fine-tuned models
reasoning.
on out-of-distribution (o.o.d.) test sets. Figure 11 shows the
o.o.d. testing results using Bloomz with LoRA and Freeze- • LLMs have a certain level of causal reasoning ability, but

tuning respectively, where the horizontal axis represents the they still cannot reach a satisfactory level. And, blindly
test dataset and the vertical axis shows the training set. The increasing the model size does not necessarily lead to
diagonal line is the fine-tuning performance under the same improved performance. In comparative terms, employing
distribution of the training and test datasets, set to unit 1. For meticulous design in the utilization methods and training
each column, the remaining off-diagonal elements are the ratio data of models represents a more cost-effective approach to
of o.o.d. performance to independently identical distribution enhancing the specialized capabilities of large-scale models;
(i.i.d.) performance. We observe that models fine-tuned with • LoRA is an excellent lightweight fine-tuning method, par-

LoRA have slightly better generalizability to o.o.d. data sets ticularly in scenarios with limited training data. Moreover,
than Freeze-tuning. Moreover, this testing indirectly reflects LoRA demonstrates stronger modularity and exhibits ease
these data sets distributions’ difference. Specially, meci es and of scalability and transferability;
meci en helps to ESC test set than it self. • Enhancing a specific capability of large models through
continued training often results in a decline in other aspects
6. D ISCUSSION of performance;
In this paper, the performances of open LLMs in different • For knowledge-intensive tasks, using LLMs with in-context
causal reasoning dimensions, task forms and usage modes are prompting often leads to a decline in performance;

247
• Large models are highly sensitive to prompts, and appropri- ACKNOWLEDGMENT
ate prompts should be human-readable and avoid ambiguity The authors would like to thank all open source communities
in wording; for providing such great open large language models. They do
• Appropriate task formulations help elicit stronger perfor- a good job!
mance from LLMs. Generally, smaller label spaces are more
R EFERENCES
conducive to the functioning of large models.
[1] W. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, et
6.3. Limitations and Further Directions al., “A survey of large language models”, arXiv preprint
Although our work provides a comprehensive and exemplary arXiv: 2303.18223, 2023.
framework for data-driven specific functional testing, there [2] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neu-
are still some limitations. First, the orthogonalization and big, “Pre-train, prompt, and predict: A systematic survey
unbiasednesss of test cases is difficult to achieve. For example, of prompting methods in natural language processing”,
tests of multilingual causal reasoning ability may also involve ACM Comput. Surv., vol. 55, no. 9, pp. 1–35, January
the test cases complying with complex causal reasoning task, 2023.
which may compromise the independence. The reasons are that [3] X. Han, Z. Zhang, N. Ding, Y. Gu, X. Liu, Y. Huo, et al.,
test cases are described by flexible natural language and the “Pre-trained models: Past, present and future,” AI Open,
deconstruction of functional dimensions cannot be completely vol. 2, pp. 225–250, 2021.
independent. We claim that this can only be mitigated rather [4] L. Ouyang, J. Wu, X. Jiang, D. Alemida, C. L. Wain-
than completely avoided. Second, larger open LLMs have yet wright, P. Mishkin, et al., “Training language models to
to be explored. The parameter scale of the test object selected follow instructions with human feedback”, arXiv preprint
in our study is about 7 billion because We argue that the arXiv: 2203.02155, 2022.
applicability of open LLMs with excessively large parameter [5] K. Zhang, B. J. Gutierrez, and Y. Su, “Aligning instruction
sizes remains limited in most practical scenarios. tasks unlocks large language models as zero-shot relation
The trend of combining large models with traditional software extractors”, The findings of the association for computa-
is irreversible. In this context, we believe that future research tional linguistics: ACL 2023, pp. 794–812, 2023.
directions include the following: (1) reasonably deconstruct [6] T. Schick, J. Dwivedi-Yu, R. Dess‘ı, R. Raileanu, M.
capability demands and map them into the corresponding Lomeli, L. Zettlemoyer, et. al, “Toolformer: Language
LLMs capabilities for testing. (2) richer and independent models can teach themselves to use tools”, arXiv preprint
test data sets for better test case generation. (3) test-free arXiv: 2302.04761, 2023.
performance prediction for specific capabilities of LLMs. (4) [7] R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C.
lighter test case generation approachs. (5) highly-integrated Kim, et.al, “Webgpt: Browser-assisted question-answering
testing tools and platforms. with human feedback”, arXiv preprint arXiv: 2112.09332,
2021.
7. C ONCLUSION [8] Z. Ji, N. Lee, R. Frieske, T. Yu, D. su, Y. Xu, et al.,
In this paper, the causal reasoning ability of open large “Survey of hallucination in natural language generation”,
language models is comprehensively evaluated via a general ACM Comput. Surv., vol. 55, no. 12, pp. 1–38, 2023.
testing framework, which fully considers different capability [9] J. Yang, S. C. Han, and J. Poon, “A survey on extraction
dimensions, task forms and model usage modes. Experi- of causal relations from natural language text”, Knowl Inf
ment results show that: (1) open LLMs can fulfill simple Syst, vol. 64, no. 5, pp. 1161–1186, 2022.
causal reasoning tasks, but have great shortcomings in com- [10] M. Hobbhahn, T. Lieberum, and D. Seiler, “Investigating
plex causal reasoning, multilingual causal reasoning, domain- causal understanding in LLMs,” in NeurIPS 2022 work-
specific causal reasoning and pure causal reasoning. (2) Open shop on causality for real-world impact, 2022.
LLMs are sensitive to input prompts, and the sensitivity is [11] X. Liu, D. Yin, C. Zhang, Y. Feng, and D. Zhao, “The
more obvious in smaller model scale. (3) In-Context prompting Magic of IF: Investigating Causal Reasoning Abilities
is not beneficial to the performance, and aggravates the model in Large Language Models of Code,” in Findings of
illusion. (4) Lightweight fine-tuning for large models have the association for computational linguistics: ACL 2023,
positive effects, especially LoRA, which can rapidly improve Toronto, Canada, pp. 9009–9022, 2023.
the performance with small training data sets. (5) It is impor- [12] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei,
tant to choose the proper model base for different downstream I. Sutskever et al., “Language models are unsupervised
applications. multitask learners,” OpenAI blog, p. 9, 2019.
In addition, the “dimension-form-mode” testing solution pro- [13] A. Radford, K. Narasimhan, T. Salimans, and I.
posed in this paper can effectively support specific competency Sutskever, “Improving language understanding by gener-
testing in theory and practice. It has the possibility of gener- ative pre-training”, 2018.
alizing to other similar fields. We hope that this work will [14] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Ka-
inspire future work about ”big model + traditional software”, plan, P. Dhariwal, et al., “Language models are few-shot
and the intelligent software testing in high-risk social domains. learners”, arXiv preprint arXiv: 2005.14165, 2020.

248
[15] R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. tion of Commonsense Causal Reasoning,” in *SEM 2012:
Arora, S. von Srx, et al., “On the opportunities and risks The First Joint Conference on Lexical and Computational
of foundation models”, arXiv preprint arXiv: 2108.07258, Semantics, Montréal, Canada: Association for Computa-
2022. tional Linguistics, Jul. 2012, pp. 394–398.
[16] A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, et [28] T. Caselli and P. Vossen, “The Event StoryLine Corpus:
al., “GLM-130B: An open bilingual pre-trained model” in A New Benchmark for Causal and Temporal Relation
The eleventh international conference on learning repre- Extraction,” in Proceedings of the Events and Stories in
sentations (ICLR), 2023. the News Workshop, Vancouver, Canada: Association for
[17] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. Computational Linguistics, 2017, pp. 77–86.
Lachaux, T. Lacroix, et al., “LLaMA: Open and effi- [29] I. Gusev and A. Tikhonov, “HeadlineCause: A dataset of
cient foundation language models”, arXiv preprint arXiv: news headlines for detecting causalities,” in International
2302.13971, 2023. conference on language resources and evaluation, 2021.
[18] B. Drury, H. Gonçalo Oliveira, and A. de Andrade Lopes, [30] C. Mihăilă, T. Ohta, S. Pyysalo, and S. Ananiadou,
“A survey of the extraction and applications of causal “BioCause: Annotating and analysing causality in the
relations,” Natural Language Engineering, vol. 28, no. 3, biomedical domain,” BMC Bioinformatics, vol. 14, no. 1,
pp. 361–400, 2022. p. 2, Jan. 2013, doi: 10.1186/1471-2105-14-2.
[19] J. Gao, X. Ding, B. Qin, and T. Liu, “Is ChatGPT a [31] V. D. Lai, A. P. B. Veyseh, M. L. Nguyen, F. Der-
good causal reasoner? A comprehensive evaluation”, arXiv noncourt, and T. H. Nguyen, “MECI: A multilingual
preprint arXiv: 2305.07375, 2023. dataset for event causality identification,” in International
[20] X. Zuo, P. Cao, Y. Chen, K. Liu, J. Zhao, W. Peng, et conference on computational linguistics, 2022.
al., “LearnDA: Learnable Knowledge-Guided Data Aug- [32] Z. Jin, J. Liu, Z. Lyu, S. Poff, M. Sachan, R. Mihalcea,
mentation for Event Causality Identification,” in Proceed- et al., “Can large language models infer causation from
ings of the 59th Annual Meeting of the Association correlation?”, arXiv preprint arXiv: 2306.05836, 2023.
for Computational Linguistics, Online: Association for [33] S. Min, X. Lyu, A. Holtzman, M. Artetxe, M. Lewis, H.
Computational Linguistics, 2021, pp. 3558–3571. Hajishirzi, et al., “Rethinking the role of demonstrations:
[21] S. Shen, H. Zhou, T. Wu, and G. Qi, “Event Causality What makes in-context learning work?,” in Proceedings
Identification via Derivative Prompt Joint Learning,” pre- of the 2022 conference on empirical methods in natural
sented at the Proceedings of the 29th International Confer- language processing, Abu Dhabi, United Arab Emirates:
ence on Computational Linguistics, COLING, 2022, pp. Association for Computational Linguistics, Dec. 2022, pp.
2288–2299. 11048–11064.
[22] BIG-bench authors, “Beyond the Imitation Game: [34] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang,
Quantifying and extrapolating the capabilities et al., “LoRA: Low-Rank Adaptation of Large Language
of language models,” Transactions on Machine Models.”, arXiv preprint arXiv: 2106.09685, 2021.
Learning Research, 2023, [Online]. Available: [35] X. L. Li and P. Liang, “Prefix-Tuning: Optimizing Con-
https://openreview.net/forum?id=uyTL5Bvosj tinuous Prompts for Generation,” in Proceedings of the
[23] Y. Tan, D. Min, Y. Li, W. Li, N. Hu, Y. Chen, et al., 59th Annual Meeting of the Association for Computa-
“Evaluation of ChatGPT as a question answering system tional Linguistics, Online: Association for Computational
for answering complex questions.”, arXiv preprint arXiv: Linguistics, 2021, pp. 4582–4597.
2303.07992, 2023. [36] X. Liu, Y. Zheng, Z. Du, M. Ding, Y. Qian, Z. Yang,
[24] Y. Zhu, X. Wang, J. Chen, S. Qiao, Y. Ou, Y. Yao, et al., “GPT Understands, Too.”, arXiv preprint arXiv:
et al., “LLMs for knowledge graph construction and 2103.10385, 2021.
reasoning: Recent capabilities and future opportunities.”, [37] X. Liu, K. Ji, Y. Fu, W. L. Tam, Z. Du, Z. Yang, et
arXiv preprint arXiv: 2305.13168, 2023. al., “P-Tuning v2: Prompt Tuning Can Be Comparable to
[25] T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilić, D. Fine-tuning Universally Across Scales and Tasks.”, arXiv
Hesslow, et al., “BLOOM: A 176B-Parameter open-access preprint arXiv: 2110.07602, 2021.
multilingual language model”, arXiv preprint arXiv:
2211.05100, 2023.
[26] I. Hendrickx, S. N. Kim, Z. Kozareva, P. Nakov, D. Ó
Séaghdha, S. Padó, et al., “SemEval-2010 task 8: multi-
way classification of semantic relations between pairs of
nominals,” in Proceedings of the Workshop on Semantic
Evaluations: Recent Achievements and Future Directions
- DEW ’09, Boulder, Colorado: Association for Compu-
tational Linguistics, pp. 33–38, 2009.
[27] A. Gordon, Z. Kozareva, and M. Roemmele, “SemEval-
2012 Task 7: Choice of Plausible Alternatives: An Evalua-

249

You might also like