The Causal Reasoning Ability of Open Large Language Model A Comprehensive and Exemplary Functional Testing
The Causal Reasoning Ability of Open Large Language Model A Comprehensive and Exemplary Functional Testing
The Causal Reasoning Ability of Open Large Language Model: A Comprehensive and
Exemplary Functional Testing
Shun-Hang Li1,2 , Gang Zhou1,2,∗ , Zhi-Bo Li1,2 , Ji-Cang Lu1,2 , and Ning-Bo Huang1,2
1
State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou, China
2
Information Engineering University, Zhengzhou, China
baikal [email protected], [email protected], [email protected], [email protected], rylynn [email protected]
*corresponding author
Open
the main contributions of this paper are as follows:
YaLM • To the best of our knowledge, we are the first to take open
LM v4 Claude LLMs’ causal reasoning ability as testing target. Our goal is
to establish baselines, provide empirical guidance on open
2022 2023
LLMs usage, and explore a framework for fully testing
Figure 2: Main large language models released by some specific capabilities of general intelligent software.
organizations since 2022. Note that the timeline only shows • A meticulous causal reasoning ability decomposition solu-
the chronological order. tion is proposed, and a compound data set covering various
task forms is collected from different public data resources
for generating rich test cases (Section 3);
• Two usage modes of LLMs, i.e. prompting (Section 4) and
LLMs? Despite the rapid development of LLMs, a consider-
able number of them are not fully open mainly because of the lightweight fine-tuning (Section 5), are both considered.
huge and burdensome computational and storage requirements. Moreover, empirical usage principles of open LLMs are
Naturally, wider research and industrial applications are based discussed in Section 6.2;
• Our work provides a paradigm for open LLMs testing,
on open LLMs from open source community, providing open
LLMs more opportunity to be the backbone in building intelli- which can be transferred to similar LLMs evaluations.
gent software systems. However, existing research focuses on
2. BACKGROUND
evaluating causal reasoning abilities of super but closed LLMs
like GPT-3.5 without exception. Most open LLMs fall into 2.1. Large Language Model
LLMs with medium parameter scales as shown in Figure 1, Due to the development of computing power and neural
and thus blindly generalizing these evaluation schemes and network structure, pre-trained language models have devel-
conclusions from super LLMs to open LLMs is not proper. oped rapidly and shown excellent performance. Specially,
Q2: How to evaluate the causal reasoning ability more taking the roadmap of continuously enlarging model scale
comprehensively and objectively? Our test subject exhibits and deepening training, GPT-n family models [12][13][14]
a high level of complexity. On one hand, LLMs are black-box have exhibited increasing capabilities in understanding human
software with very limited available information about their intentions and generating human-like responses until ChatGPT
specifications. On the other hand, there are various definitions reaches a shocking level. Notablely, in addition to responding
and forms of causality in different research fields. Evaluation in natural language, LLMs are also capable of programming,
datasets vary in terms of languages and labeling strategies which enables large extension of LLMs, empowering them
used. Most existing work chooses a single thin data set to to resolve almost all computable problem. Consequently, they
support the testing, which greatly weakens the breadth and are often referred to as “fundamental models” to denote their
depth of the specific functional testing. versatility[15].
To answer above problems, our study designs a proper testing Following GPT-n family models, a plenty of LLMs have
solution, and carries out a comprehensive and exemplary emerged in recent two years. As illustrated in Figure 2, a
evaluation on open LLMs. First, we select several common- significant portion of these models are not openly available,
used and ancestral open LLMs as our testing objects to fill especially those with over 100 billion parameters called super
the research gap. Then, a comprehensive dimension-form- LLMs. Driven by the collaborative efforts of open source com-
mode testing solution is proposed, mainly including three munity, open LLMs, ChatGLM [16], LLaMA[17], contribute
steps: (1) Ability Deconstruction. We subdivide the causal largely to the advancement of intelligent software ecosystems
reasoning ability into five fine-grained dimensions, including as another vibrant option.
simple causal reasoning, complex causal reasoning, domain- Applying LMs to downstream tasks, two usage methods,
specific causal reasoning, multilingual causal reasoning and prompting and fine-tuning, are common-used[2]. Prompting
pure causal reasoning. Those ability dimensions can cover uses prompt words to steer LMs without training, while fine-
most LLMs’ application scenarios; (2) Task Formalization. tuning method needs abundant downstream training data sets
Different task forms check the ability of models under test to re-train the LMs.
241
2.2. Causal Reasoning
Causality Identification
Causal reasoning is a complex logical task that aims to judge INPUT:
- [context] Two vehicles collided near the Caoxi Interchange
or rebuild causal relation between two events described in section of the Inner Ring Elevated Road, causing one of the
natural language[18]. The concept scope of causality is very engineering vehicles to cross the elevated guardrail without
causing any casualties.
broad, such as document-level causality, implicit causality and
- [event 1] Two vehicles collided
statistical causality. Following [19], we divide the diverse - [event 2] cross the elevated guardrail
causal reasoning task forms into two categories as instances
in Figure 3: OUTPUT:Causality
Table 1. A list of open LLMs chosen for testing: ChatGLM, Alpaca, and Bloomz
Model Architecture Parameter Scale Options Features
ChatGLM Transformer 6B Better both NLG and NLU abilities with the auto-regressive blank infilling pre-training framework
Alpaca Decoder only 7B Better balance between model size and performance based on LLaMA
Bloomz Decoder only 560M, 1.1B, 1.7B, 3B, 7B Training with 59 languages
242
Table 2. Different datasets are collected to generate test
Simple CR
cases for different ability dimensions and task forms
Forms
Cauality Reasoning (CR)
CI CM
Five dimensions
Complex CR Pure CR
Open Simple CR Eval8 COPA
LLMs Complex CR ESC, MECI en HLC
Donanial CR BioCause —
Multilingual CR MECI es —
Pure CR Corr2cause —
Domain-specific CR Multilingual CR
Two modes
are complete, and the context length is not too long.
Two forms
90.0%
Figure 5: The data set sizes and the proportions of positive and negative samples used for model capability testing based on
prompt engineering.
243
software testing. First, unlike traditional software, which has Simple Prompt (SP) In-Context Prompt (ICP)
a clear scope of function, the internal structure of the LLMs {context �}. Tell me whether there is
{context �}.There is (not) a causal
is not interpretable and thus the function also cannot be a causal relation between pairs of
relation between pairs of nominals:
CI nominals (or words): {�� } and {�� }
fully defined. Second, in the traditional testing process, some {�� } and {�� } in above sentence;
in above sentence. Answer yes or no
methods, such as equivalence partitioning and boundary value +…+ SP
only.
analysis, can be used to decrease the test case scale required Given the event {�� }, which choice is Given the event {�� }, the {���} of
more likely to be the {���} of this
for testing. For the testing of intelligent software like LLMs, CM this event is likely to be {�� }; +…+
event? 1. {�� } 2. {�� }. Only answer
we can only check whether the model output is correct or not. 1 or 2 without any other words. SP
And it is difficult to understand why the model gives such a
Figure 6: An overview of different prompt texts for different
response. To better predict model behavior and improve the
task forms. In-Context prompt (ICP) consists of k input-label
testing integrality, we advocate that data-driven testing is a
pairs and simple prompt (SP).
feasible solutions, which means inputting large-scale test cases
and conducting statistical analysis.
We argue that the richer the test cases for LLMs testing are, causal reasoning ability of open LLMs.
the higher the testing integrality is. Therefore, a single data We obtain vast test cases from above datasets by generating
set is difficult to meet the requirements of different causal question-answer pairs, and take F1 score (F1) and Accuracy
reasoning dimensions and forms for test case generation. Here, (A) as the statistical metrics to evaluate open LLMs’ re-
we introduce seven public datasets and modify partial datasets sponses. We release our datasets and corresponding processing
to meet the different ability dimensions and input formats, as scripts in https://github.com/Ewillingfly/CausalityTesting.
shown in Table 2. These datasets are summarized as follows:
4. T ESTING IN P ROMPTING M ODE
• Eval8[26] is a human-labeled dataset for relation extraction In this and next section, we conduct testing on causal reasoning
released by SemEval 2010. It contains 10 types of relations, ability of open LLMs in two usage modes: prompting and
of which the number of causality samples accounts for more lightweight fine-tuning respectively.
than 12%. This dataset annotates nominal phrases as event
mentions. 4.1. Prompt Design
• COPA[27] is a typical commonsense reasoning dataset, Prompting (also prompt learning) is a newfangled paradigm
which belongs to SuperGLUE, a NLU benchmark. This to transfer pre-trained model to downstream tasks by adding
dataset is organized in the form of causality matching. elaborately designed prompts[2]. Generally, prompt learning
• ESC[28] is a complex dataset similar to Eval8 in format. could obtain better p (x |y ) between input x and output y by
The differences are as follows: (1) ESC usually labels verbs indirectly estimating p (x |y, prompt ) without training.
as event mentions, and the semantics expressed by verbs are The core of prompt learning is prompt text(usually the task
more diverse; (2) ESC dataset has a greater density of event description), which can be viewed as a “regulator” for model
annotations and thus more complex causal structures. Up to state control. Here, we explore the performance of two prompt
now, effective solutions for this dataset are still lacking. design method, called simple prompt (SP) and in-context
• HLC[29] is a dataset describing the causality between news prompt (ICP). SP only uses task description to instruct open
titles, while containing a large number of implicit causal LLMs, while ICP additionally adds real samples as demonstra-
relations. Formally, an event in HLC is described by a tions to provide more information. Figure 6 summaries the two
sentence, so we can easily transfer HLC into the format different prompt forms used in this paper.
of causality matching. For test case generation in prompting mode, we take the test
• BioCause[30] is built for causality identification task in sets of Eval8, HLC, MECI en and MECI es, the first 4 topics
biomedical field, primarily describing the causality between of ESC, all of COPA and BioInfer, and the subset with 4
genes, proteins, and other entities. variables in Corr2cause as the data sources to facilitate model
• MECI[31] is a multilingual dataset for document-level comparisons. The statistical details are shown in Figure 5.
causality identification in Spanish, English, Danish, Turkish,
and Urdu. Considering the theoretical ability of the language 4.2. Main Results and Analysis
model under test, the Spanish part (marked as MECI es) Table 3 presents the causal reasoning ability of target models
is used for multilingual causal reasoning evaluation, and with simple prompt across various test case groups. We
the English part (marked as MECI en) for complex causal observe that:
reasoning. (1) Open LLMs demonstrate a certain degree of causal rea-
• Corr2cause[32] is a newly proposed statistical causal in- soning ability with simple prompts (superior to random
ference dataset described in natural language, centered on methods in most cases).
testing when if is valid or invalid to infer causation from (2) Bloomz demonstrates supesrior performance compared to
correclation. It builds a bridge between statistical causality other models on average. More than 75% of test cases
inference and natural language causal reasoning, used to generated from Eval8 and COPA (i.e. Simple CR) can be
effectively investigate the statistical knowledge and pure passed by Bloomz.
244
Table 3. Results of different models under test with SP on various causal reasoning tasks. Each experiment is repeated 5
times and the average value is taken to ensure the reliability. Bold indicates the optimal result, same below.
Simple CR Domain-specific CR Complex CR Multilingual CR Pure CR
Eval8 COPA BioInfer ESC MECI en HLC MECI es Corr2cause-4
F1 A A F1 A F1 A F1 A A F1 A F1 A
Random 17.72 50.00 50.00 45.59 50.00 13.89 50.00 27.14 50.00 50.00 7.52 50.00 16.28 50.00
SOTA 84.55 - 99.4 60.70 - 67.90 - 58.10 - 89.40 52.80 - 94.74 -
ChatGLM-6B 22.02 15.79 76.43 61.14 60.54 15.36 16.36 17.36 18.36 19.36 - - 1.38 0.69
Bloomz-7B 45.77 78.98 76.93 43.51 61.00 19.32 80.34 21.59 80.72 51.90 5.40 95.64 16.67 86.11
Alpaca-7B 21.54 12.07 52.26 59.06 41.90 14.93 8.07 31.41 18.63 47.47 - - 17.72 9.72
Table 4. The performance of ChatGLM and Bloomz with in-context prompts in both simple and complex causal reasoning
tasks
Simple CR Complex CR
Model COPA Eval8 ESC HLC
A F1 A F1 A A
Random 50.00 19.46 50.00 12.98 50.00 50.00
ICP 67.30 21.66 14.43 9.42 5.14 55.06
ChatGLM-6B
∆ -9.13 -0.36 -1.36 -5.94 -9.29 -12.66
ICP 75.13 38.31 83.93 13.96 89.50 51.26
Bloomz-7B
∆ -1.80 -7.46 +4.95 -5.36 +9.16 -0.64
245
$ P R X Q W
3 R V
W U D L Q G D W D D P R X Q W
S R V
Figure 8: Performance with different demonstration numbers
(ICP-k) on Eval8. The backbone model is Bloomz.
( Y D O W U D L Q &