Bugs in LLms Genereated Code
Bugs in LLms Genereated Code
Abstract Large Language Models (LLMs) for code have gained significant at-
tention recently. They can generate code in different programming languages
based on provided prompts, fulfilling a long-lasting dream in Software Engi-
neering (SE), i.e., automatic code generation. Similar to human-written code,
LLM-generated code is prone to bugs, and these bugs have not yet been thor-
oughly examined by the community. Given the increasing adoption of LLM-
based code generation tools (e.g., GitHub Copilot) in SE activities, it is crit-
ical to understand the characteristics of bugs contained in code generated by
LLMs. This paper examines a sample of 333 bugs collected from code gener-
ated using three leading LLMs (i.e., CodeGen, PanGu-Coder, and Codex) and
identifies the following 10 distinctive bug patterns: Misinterpretations, Syntax
Error, Silly Mistake, Prompt-biased code, Missing Corner Case, Wrong In-
put Type, Hallucinated Object, Wrong Attribute, Incomplete Generation, and
Non-Prompted Consideration. The bug patterns are presented in the form of
a taxonomy. The identified bug patterns are validated using an online survey
with 34 LLM practitioners and researchers. The surveyed participants gener-
ally asserted the significance and prevalence of the bug patterns. Researchers
and practitioners can leverage these findings to develop effective quality as-
This work was supported by: Fonds de Recherche du Québec (FRQ), the Canadian Institute
for Advanced Research (CIFAR) as well as the DEEL project CRDPJ 537462-18 funded
by the Natural Sciences and Engineering Research Council of Canada (NSERC) and the
Consortium for Research and Innovation in Aerospace in Québec (CRIAQ), together with
its industrial partners Thales Canada inc, Bell Textron Canada Limited, CAE inc and
Bombardier inc.
Florian Tambon · Arghavan Moradi Dakhel · Amin Nikanjam · Foutse Khomh · Michel C.
Desmarais · Giuliano Antoniol
Polytechnique Montréal, Montréal, Canada
E-mail: {[Link], [Link]-dakhel, [Link], [Link],
[Link], [Link]}@[Link]
“*” authors contributed equally
2 F. Tambon and A. Moradi Dakhel et al.
surance techniques for LLM-generated code. This study sheds light on the
distinctive characteristics of LLM-generated code.
Keywords Large Language Models · Bugs · Software Testing · Empirical
Study
1 Introduction
The recent surge in the development of Large Language Models (LLMs) tai-
lored for code generation has garnered significant attention [1, 2, 3]. Transformer-
based models such as Codex [4] and Llama-2 [5], trained on large amounts of
open-source code repositories, have demonstrated success in producing code
across various programming languages, such as Python, Java, and C [6, 7, 8].
These models approach code generation as a transformation process, convert-
ing natural language descriptions (prompts) into executable programming lan-
guage statements.
Positioned as potential AI pair programmers [2, 9, 10] in software projects,
LLM-based code generators are poised to play a crucial role in the quality of
the overall software project. However, similar to human-written code, LLM-
generated code is prone to errors [9]. Nonetheless, Asare et al. [11] who re-
cently examined vulnerabilities contained in LLM-generated code (using Copi-
lot) compared to those in human-written code, reported that LLM models do
not generate the same vulnerabilities as humans. In parallel, several studies
[12, 13] reported that LLMs and human developers may not focus on the same
part of the prompt to solve a coding task. This raises an important question:
Do LLMs generate faults similar to those produced by human developers?
This question is crucial because the effectiveness of popular quality assurance
techniques like mutation testing depends on a precise characterization of faults
occurring in the code under test [14].
To illustrate the potential differences between bug patterns in LLM-generated
code and those in human-written code, let’s consider the example from List-
ing 1 which presents the code generated by Codex [4], a well-known LLM
to generate code, for the task of “returning the flags present in the Python
ArgumentParser object” (shown in green as a docstring in the figure).
While the code seems reasonably correct at first glance, the attentive reader
will notice in the last line (Line 14) that the LLM suddenly decided to sort the
flags it extracted using the ‘sorted’ function. This step, not requested in the
prompt, leads to the code being faulty for the requested use case. This kind of
snippet is interesting as it highlights two key points: 1) LLMs might generate
a seemingly correct code at first glance, which could deceive non-experienced
users particularly if no (or not enough) testing is performed, and 2) LLMs may
add a non-prompted feature to the code, leading to an error, something that
a human developer would rarely do, which shows that LLMs bugs might not
be exactly similar to human-made bugs.
While there exist studies that investigated bugs in LLM-generated code
[15, 16, 17], to the best of our knowledge, none of them examined the bug
Bugs in Large Language Models Generated Code 3
2 Background
In this section, we introduce the three LLMs used in our study and describe
the open coding methodology followed in this paper.
Different LLMs have been introduced for automatic code generation in Soft-
ware Engineering (SE). One highly potent LLM is OpenAI’s Codex [4], which
has been extensively employed in diverse code-related SE tasks [19, 20, 21].
This closed-source GPT-based auto-regressive LLM [22] boasts up from 12
million to 12 billion parameters and undergoes fine-tuning on 54 million pub-
lic repositories from GitHub. Notably, Codex is the model behind GitHub
Copilot1 , an in-IDE developer coding assistant, proficient at generating code
based on user-provided context with the maximum input length of 4,096 to-
kens. A recent addition to GitHub Copilot is “Copilot Chat”2 , which is tuned
with human feedback for dialog use cases. Copilot Chat proves versatile for
a broad range of code-related tasks. It excels at generating code fragments,
describing code snippets in natural language, generating unit tests, and fixing
buggy code, all tailored to the specific context of the task at hand.
PanGu-Coder [23] is an open-source pre-trained LLM for generating code
from text. This model is based on PanGu-α architecture [24], a unidirectional
decoder-only transformer with an additional layer for querying added on top of
it. PanGu-Coder was trained in two stages using Python programs extracted
from GitHub. In the first stage (unsupervised), the model is pre-trained on raw
programming language data containing docstrings/inline comments written in
1 [Link]
2 [Link]
Bugs in Large Language Models Generated Code 5
natural language. Next, this model is fine-tuned to generate code from text in
a second stage (supervised). Two versions of this model were published and
leveraged for code generation recently [25]. These versions have respectively
317 million and 2.6 billion parameters, with a maximum input length of 1,024
tokens.
CodeGen [26] is a family of LLMs trained for text-to-code generation.
Based on a decoder-only architecture, this family of models is designed for
multi-turn program synthesis, in which a user engages with the LLM by grad-
ually feeding specifications in natural language to obtain a corresponding pro-
gram. This collaborative interaction allows the user, in conjunction with the
LLM, to iteratively build and refine a program in multiple steps. CodeGen
models were trained on a natural language corpus and code extracted from
GitHub (i.e., a multilingual code dataset, and a dataset of Python programs).
Various CodeGen models come with 350 million to 16.1 billion parameters and
a maximum input length of 2,048 tokens [26]. CodeGen open-source models
have been employed for code generation tasks in many recent studies [25, 27].
Filtering Filtering
by by Sampling
runnable buggy
level fragment
34 Participants Survey
CoderEval 6,900 1,997 333
Python
Data Collection
Labelling Discuss
Preliminary Disagreement
33 Codebook Final Compare
Codebook
Sample
10%
Pilot Study Updated
Codebook
300
333
Analysis
3 Study Design
RQ1: What are the characteristics of bugs occurring in code generated by LLMs
for real-world project tasks?
RQ2: To what extent are the identified bug patterns in LLM-generated code
relevant for software practitioners and researchers working with LLMs?
In case of disagreement between the two coders, the third author would act as
a tie-breaker (highlighted as ○ 2 in Figure 1).
In the second phase, the remaining buggy samples were divided into six
parts, each containing 15% of the remaining buggy codes, and we repeated
the previous process: the two coders manually investigated all samples in each
part independently, and then the entire team met to cross-check the labels and
address conflicts while resolving discrepancies. If a new category emerged (i.e.,
a bug could not be classified under the existing categories) or a previous label
was refined, the entire team gathered to discuss and incorporate the change
into the codebook. We also made sure to re-label the affected samples from the
previous rounds, if any. If a sample contained multiple types of bugs, it was
classified into all identified bug patterns (i.e., using multiple labels). We had
no constraint on the number of labels that could be assigned to a particular
buggy snippet. The entire process required approximately 108 person-hours
and resulted in a taxonomy comprising 10 bug patterns (highlighted as ○ 3 in
Figure 1).
Given that we did not have any prior-defined labels and since we do not
have a predefined number of labels for every buggy code snippet in our dataset,
it is not possible to compute inter-rater agreement levels; which is typical for
open coding. However, after finalizing categories, we realized that 78.2% of the
bugs were labeled similarly during independent labeling by the two reviewers
before solving disagreements. Any conflicts were then thoroughly discussed
until a 100% agreement was reached [30]. We retained all the initial labels and
comments on our shared document to facilitate any subsequent discussions.
We provide in Table 1 each step of our process. These labels and comments
are shared in our replication package available at [18].
To identify participants for our survey, we collected the contacts (email ad-
dresses) of GitHub users who collaborated on repositories that contain code
generated using various LLMs. We followed the methodology proposed by
Yujia et al. [34] to identify GitHub repositories containing code generated
using LLMs. Yujia, et al. [34] employed various search keywords to collect
GitHub repositories containing code generated by GitHub Copilot. We lever-
aged their shared dataset and broadened our list of participants, by adopting
their methodology to collect additional repositories containing code-generated
using Codex, Llama, Llama-2, CodeLlama, and ChatGPT. This process al-
lowed us to collect a total of 113 repositories. Next, we employed the Py-
Driller [35] library to extract the email addresses of the collaborators on each
repository. In total, we obtained 200 unique email addresses of practitioners
who collaborated on GitHub repositories containing LLM-generated code.
We complemented this list of practitioners with a list of researchers who
have published on code generation using LLMs. To obtain this list of re-
searchers, we searched relevant LLM-based code generation papers over Google
Scholar6 using two keywords: “LLMs” and “code generation”. We collected
the first 75 papers returned by Google Scholar and manually inspected each
of them to ensure their relevance i.e., that they leveraged LLMs for automatic
code generation. This inspection process allowed us to identify 56 relevant
papers that focused on code generation using LLMs. Next, we extracted the
email addresses of the authors of each of these papers (who are assumed to be
researchers working on generating code with LLMs). We obtained a total of
182 emails after removing duplicates. Overall, we successfully sent the survey
questionnaire to 382 unique email addresses. We discuss the details about the
participants and the response rate in Section 4.2.
We also posted our survey questionnaire on two relevant Reddit channels:
LocalLLaMA and MachineLearning. Figure 5 in Appendix A shows our post
on Reddit.
The questionnaire was dispatched along with a detailed message outlining
the purpose, scope, and estimated completion time of the survey (which is
around 10 minutes). We made it clear to the participants that the survey was
anonymous, though participants had the option to provide their emails for
additional communication and to receive a summary of the study.
The survey was open for two weeks, during which we sent a reminder one week
after the initial email. Both the survey form and the anonymized responses that
we collected are available in our replication package [18]. The survey is divided
into two parts and is inspired by similar empirical studies of bug patterns [29,
33, 36]. In the first part, we asked demographic questions to participants.
6 [Link]
Bugs in Large Language Models Generated Code 11
We then asked more specific questions about bugs in LLMs generated code
regarding their frequencies and complexities associated with detecting or fixing
the bugs (see Figure 6 in Appendix B.1 for more details).
In the second part, we wanted to gather the participants’ feedback on
several aspects of each pattern of bug identified. To do so, we defined the
questions similarly for each pattern of bugs in the taxonomy. Specifically, for
each pattern of bug, we present participants with a description of the bugs
belonging to the pattern and provide an example code snippet; we use the
same examples that will be presented in Section 4.1. Next, we presented the
participants with several questions to be answered using a 5-level Likert scale.
The full questionnaire is presented in Appendix B.1 in Figure 7.
The questions start by asking about the frequency at which the participants
encountered the bug pattern (ranging from 1-Never to 5-Always). Next, we
defined the following concepts to the participants:
i. Diagnosing: How hard would it be for the respondents to diagnose such an
error in an LLM-generated code? Could they distinguish it easily or would
it require running the code or using a debugger rendering the diagnostic
hard? (1-Easy to 5-Hard)
ii. Complexity: Is the type of error rather trivial (i.e. not something a human
developer with decent programming knowledge would do) or, on the con-
trary, does it denote some degree of complexity? (1-Trivial to 5-Complex)
iii. Fixing: If the respondents had to fix such a mistake, would it be hard?
Would it just require adjusting some part of the code (i.e. easy), or would
it require extensive refactoring (i.e. hard)? (1-Easy to 5-Hard)
Participants were then asked to answer from their own perception/experience
of each bug pattern.
The survey also contained optional comment fields at the end of each sec-
tion and the end of the survey. Participants had the opportunity to provide
suggestions such as additional information related to the bug patterns in the
study or bug patterns not described in the study.
Once the results were collected for each bug pattern by the respondents, we
aggregated the results (see Section 4.2). To do so, we followed a methodol-
ogy used in similar studies [33] leveraging weighted average. We processed
the results for the frequency/diagnosing/complexity/fixing of each bug pat-
tern reported by the participants by multiplying the value of the Likert Scale
with each percentage. That is to say, for instance, we would obtain a score of
2.79 for a bug pattern where respondents replied the following: 1 - (17.6%),
2 (26.5%), 3 (23.5%), 4 (23.5%), and 5 - (8.8%). We also complemented the
explanation of the results with comments from respondents relevant to the
description of the results. The goal was to capture the perceived difficulty in
fixing/diagnosing and the complexity of the different bug patterns according to
practitioners. Additionally, we performed an analysis comparing the frequency
12 F. Tambon and A. Moradi Dakhel et al.
of reported bug patterns among respondents with those in our sample set. We
wanted to assess whether the bug patterns within our sample set (extracted
from CoderEval) matched LLM practitioners’ experience when dealing with
bugs. The idea is to provide further evidence that our labeled categories are
encountered in similar proportions by the respondents who used LLMs, not
on a dataset such as CoderEval but as a tool in their development activities.
Specifically, we evaluated if there was a correlation between the weighted av-
erage of the frequency of bug patterns reported by the participants and the
proportion of each bug pattern found in the studied sample. To do so, we cal-
culated the Spearman Rho between those scores and the distribution of bug
patterns. The results are reported in Table 2 (Section 4.1).
4 Empirical Results
In this section, we delve into the bug patterns obtained from our analysis as
well as the respondents’ answers from our survey. We examine the nature,
frequency, and distribution of these bugs across the three different LLMs. All
collected artifacts and data generated during our study are accessible in our
replication package [18].
4.1.1 Taxonomy
Bug Patterns in
LLM-generated
code
The generated The generated The The generated The generated The The generated The generated The model The generated
code deviates code contains generated code is bias on code operates generated code utilizes code contains generates no code contains
from the syntax errors, code contains provided correctly, code contains an object that an code or statements that
intention of such as a issues such examples or except for an incorrect neither exists incorrect/non- produces, an are unrelated
the prompt. missing as redundant particular terms overlooking input type in a nor has been existent empty to the task
parenthesis or conditions or in the prompt. certain corner correct defined. attribute for an function such specification.
semicolon. unnecessary cases. function call. object or as a `pass'
casting. module. statement.
Fig. 2: Final taxonomy of bug patterns in code generated by LLM. The number
following each category represents the percentage of code samples assigned to
that category during manual labeling.
Listing 2: Reference Solution for the task “int to string’, and code generated
by PanGu-Coder labeled as “Misinterpretation”.
1 def int to string(number: int, alphabet: List[str], padding: Optional[int] = None) −> str:
2 ”””
3 Convert a number to a string, using the given alphabet.
4
5 The output has the most significant digit first.
6 ”””
7 output = ‘‘’’
8 alpha len = len(alphabet)
9 while number:
10 number, digit = divmod(number, alpha len)
11 output += alphabet[digit]
12 if padding:
13 remainder = max(padding − len(output), 0)
14 output = output + alphabet[0] ∗ remainder
15 return output[::−1]
16
17 def int to string(number: int, alphabet: str) −> str:
18 return alphabet[number]
to clarify the task provided in the prompt. In the code in Listing 5 generated
by Codex, the code is accurate only for the specific example provided in the
prompt, when the length of the vertex is 4, and will fail in other cases. This
code is labeled as a “Prompt-biased code”. This type of pattern is less common
across the models; however, CodeGen and Codex generated more buggy code
biased on the prompt compared to PanGu-Coder. We discussed more details
in Section 4.1.2.
5. Missing Corner Case: This bug occurs when the generated code op-
erates correctly, except for overlooking certain corner cases. Listing 6 provides
an example of this bug pattern along with its corresponding reference solu-
tion (top). Lines 17 to 24 represent the prompt and line 25 shows the code
generated by Codex. The task involves implementing a function that checks if
a given host is the localhost. Although the Codex-generated code checks for
various representations of localhost, the reference solution is expected to verify
additional potential options, such as [Link](), which were present
in the oracle given for the task described in the prompt. Listing 6 is labeled as
Missing Corner Case. This bug pattern is the second most common pattern
with 15.27% of buggy code samples categorized as Missing Corner Case.
[Link] Input Type: We use this label when LLM uses an incorrect
input type in a correct function call. For example in Listing 7, the task de-
scribes implementing a function to reduce a given list of classes to its ordered
minimum equivalent. Lines 5 to 8 represent the code generated by CodeGen
to address this task. In line 8, the Python built-in function min is called on
a List of ClassType inputs which raises an error. We labeled this buggy code
example as Wrong Input Type. It is worth noticing that this is not the only
label for this buggy code example in Listing 7. Wrong Input Type is the least
frequent bug pattern across the buggy code generated by three models.
7. Hallucinated Object: We apply this label when the LLM halluci-
nates by attempting to utilize an object that neither exists nor has been de-
fined. For instance, Listing 8 represents the code generated by Codex for the
16 F. Tambon and A. Moradi Dakhel et al.
Listing 6: Reference Solution for the task “is local” and code generated by
Codex, labeled as “Missing Corner Case”.
1 import socket
2 def is local(host):
3 ”””
4 Checks if the host is the localhost
5
6 :param host: The hostname or ip
7 :return: True if the host is the localhost
8 ”””
9 return host in [”[Link]”,
10 ”localhost”,
11 [Link](),
12 # just in case [Link]() does not work
we also try the following:
13 [Link](),
14 [Link]([Link]())[0]
15 ]
16 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
17 import socket
18 def is local(host):
19 ”””
20 Checks if the host is the localhost
21
22 :param host: The hostname or ip
23 :return: True if the host is the localhost
24 ”””
25 return host in [’localhost’, ’[Link]’, ’::1’]
task “make find paths”that involves transforming all the paths given in the
find-paths variable into glob patterns. The generated function calls a function
named “find path to glob” (line 14), which is not defined. Hallucinated Object
is among the third common bug patterns across the models.
8. Wrong Attribute: This label is applied when the LLM utilizes an
incorrect/non-existent attribute for an object or module. For example, in List-
ing 9, there is code generated by PanGu-Coder for the same task presented
in Listing 1 in the introduction, named “parser flag”. The prompt describes
the task as “Given an [Link] instance, returns its argument
Bugs in Large Language Models Generated Code 17
The bug patterns have been defined based on the bug patterns observed in our
studied samples. As mentioned in Section 3, some of the buggy code samples
are labeled in multiple categories due to the presence of different types of bugs
or because a bug could be labeled as an overlap of multiple categories: 62.16%
of the samples are labeled in a single category, 29.13% with two, 8.41% with
three categories, and only one instance of buggy code is assigned four labels.
Table 2 presents the distribution of buggy code across different categories
and by various LLMs. Some categories are more prevalent in specific models.
For instance, in the case of a robust model like Codex, the most common
bug pattern is Missing Corner Case (23.53% of the buggy samples). Missing
Corner Case occurs when the model generates code that addresses the task
description but fails on one or a few exceptional test inputs due to a minor
oversight, such as forgetting some particular cases. Listing 6 presents an exam-
ple of this type. Conversely, in the case of relatively weaker and open-source
models like PanGu-Coder or CodeGen, the most common category is Misin-
terpretation. This bug pattern occurs when the code generated by the model
does not align with the prompt and deviates from the intended purpose of the
task.
Among all models, CodeGen has the highest number of buggy samples in
the Silly Mistake and Syntax Error categories. Codex, as a stronger model,
exhibits the highest number of samples in the Non-Prompted Consideration
category. This category pertains to situations where the model considers addi-
tional conditions or statements that were not requested in the prompt, leading
to bugs. For example, the model might convert the final output of a function
Bugs in Large Language Models Generated Code 19
Table 2: The distribution of bug patterns in the generated code by three dif-
ferent LLMs. We put in bold the top categories per LLM.
into an integer, which was not required in the prompt, or check if the output
is a string and then apply further functionality (see Listing 1).
_c_optimizations_ignored 1.0
_dictsum
_extract_supplment_issue
_legacy_mergeOrderings
_replace_register
_reset_logging
addignored 0.8
files_list
find_roots
force_string
gaussian
get_logical_path_map
get_repo_archive
is_fill_compute_el 0.6
is_gitbash
is_local
Task
is_run_el
make_array
match_file_by_prefix
minimalBases 0.4
os_is_mac
parse_frequency
parser_flags
process_text_links
remove_ending_os_sep
round_half_to_even
setdefault 0.2
split
strip_root
subprocess_run_helper
unit_of_work
vertex3tuple
write_configuration 0.0
Hallucinated Object
Syntax Error
Misinterpretation
NPC
Silly Mistake
Wrong Attribute
Prompt-biased code
Incomplete Generation
Bug Patterns
The survey was open for two weeks resulting in 34 answers, i.e., we achieved a
return rate of 8.9%, which is in line with many SE surveys conducted outside
specific companies [36, 37]. We first give results of the questions concerning
the demographics and experience of the participants. We then detail results
regarding the frequency of reported bug patterns, as well as the analysis of
the reported difficulty of diagnosing/fixing and the perceived complexity of
the bug patterns (see Table 4).
All the participants answered the demographic questions. We obtain the fol-
lowing answers:
– For the academic field: 12 Ph.D. students, 4 Researchers, 8 Undergradu-
ate/Graduate students and 1 Lecturer. For the industry field: 6 Developers,
2 Data scientists and a CTO.
– ChatGPT was used by 31 participants. 67% of the participants reported
only non-open-source LLMs such as ChatGPT, Bard [38] or Claude [39].
10 participants cited open-source LLMs such as LLama or CodeGen.
– In terms of programming language used for LLMs, besides Python, the
participants responded: 47% for JavaScript, 27% for C++ and Java and
24% for C. These languages are among the most used programming lan-
guages according to different programming language indexes [40, 41]. Other
programming languages were reported in smaller proportion: Rust (6%),
HTML/CSS (6%), SQL (9%), C# (6%), or Go (3%). Three respondents
(all PhD students) reported exclusively using LLMs only to program in
Python language.
Note that the sum of responses for LLMs and programming languages goes
above the number of responses as one could provide multiple answers.
Before asking participants specific questions about the categories of bugs,
we collected information about their general experience dealing with bugs in
LLM-generated code. 80% of the participants reported that they often or
always experience bugs when using LLMs to generate code. No respondent
said they never faced bugs in LLM-generated code. We asked the participants
about the complexity of the bugs they encountered in their LLM code gen-
eration activities and 68% of them mentioned that the bugs had a medium
difficulty to fix, with only 21% of participants reporting that their encountered
bugs were easy to fix. We also asked the participants how they proceeded to
fix the encountered bugs and 30% of them reported that they fixed the bugs
manually by reading the stack traces and googling the error messages. The re-
maining participants reported using a combination of manual inspection with
re-prompting the LLM to correct the mistake. No participant declared only
24 F. Tambon and A. Moradi Dakhel et al.
Frequency
Diagnosing
47
Complexity 3.
3.5 35
Fixing 29 3.
3. 23 23
3. 15 3.
3.
97
91 91 2.
3.0 2. 2. 85
79 79 2. 77
2.
Weighted average of Likert scale results
2. 71 2.
2. 68 9 9 65 68
2. 5 5 2. 2.
2. 2.
47 47
2. 2.
2.5 35 35 2 32
2. 2. 29 .3 29 26 2.
2. 2 2. 2.
03
2.
91
2.0 1.
79 79
1. 71 1.
1.
56 56
Bugs in Large Language Models Generated Code
47 1. 1.
1.
1.5 35
29 1. 6
1. 21 1.
2
1.
1.0
0.5
0.0
Hallucinated Incomplete Misintepretation NPC Silly Mistake Wrong Missing Wrong Synthax Prompt-biased
Object Generation Input Type Corner Cases Attribute Error Code
Fig. 4: Aggregated results of the validation survey. Questions related to the frequency of encounter of bug patterns, the difficulty
to diagnose and fix them as well as the complexity of the bug. We highlight in bold the highest number in each category for
each bug pattern. 1 represents never/easy/trivial/low and 5 represents always/hard/complex/high.
26 F. Tambon and A. Moradi Dakhel et al.
developers use IDE, Syntax Error s or similar mistakes are pretty easy to catch.
Similarly Silly Mistakes, such as the example presented in Listing 4, are not a
bug that developers would likely do, and, even if done, they are easy to detect.
On the contrary, Misinterpretations and Missing Corner Cases bug patterns
are more complex and more similar to a type of error that a developer would
make. This is highlighted by the respondents, as both categories have the high-
est aggregated score. One survey participant commented that: “For example,
I would very much forget about IPv6 (don’t we all?:)) and miss that corner
case myself. So it is both my lack of knowledge and implicitly trusting the LLM
output that can make this such an insidious bug to diagnose and solve”. Thus,
the complexity of such bugs could arise from both the nature of the error
and developers’ excessive reliance on LLMs, which amplifies the impact and
intricacy of the issue.
Regarding the difficulty of fixing the bugs from the different categories,
the survey participants consider the bugs to be moderately difficult to fix;
mirroring their responses in the first part of the survey, when we asked about
the difficulty of fixing LLM-generated bugs. Only Syntax Error s and Silly
Mistakes are considered easy to fix with an aggregated score of < 2. Bugs
from the Misinterpretation category are reported to be more difficult to fix
than the other categories by the survey’s participants. This result is consistent
with the observation we made during the labeling phase. Misinterpretation
bugs generally lead to a code that is way different compared to the prompt
specification. In that case, re-prompting or extensive manual effort is needed
to fix the generated code fragment. For some bug categories, like Hallucinated
Objects, participants suggested that providing additional information in the
prompt can often help the LLMs to correct the mistake: “Most often time
after providing a header or function definition, the model can correctly create
the method or function”. However, in several cases, especially for bugs in the
Misinterpretation category, it is not enough to provide additional information:
“It depends on whether you know what you are expecting the code to look like.
If so, this is easy to spot, and probably you’re better off manually writing
the code. I have tried to prompt-engineer, but I find that it’s a waste of my
time. Sometimes, you will get code generated that misinterprets the prompt,
but maybe you are not aware of what the generated solution looks like so you
have to read the generated code intently or figure out the library calls that are
generated to understand that it misunderstood the prompt”.
Finally, as we reported in Section 4.1, some LLMs like Codex (and so Copi-
lot), tend to generate additional code that is not explicitly requested in the
task’s description. This may be linked to the different settings of the LLM:
an IDE plugin for Copilot/Codex or a web browser/standalone for CodeGen
and PanGu-Coder. This was echoed by one of the respondents of our valida-
tion survey who also observed this kind of behavior being more prevalent in
Copilot-based models compared to the others: “There is a difference in LLM-
generated code quality between using an in-IDE plugin such as copilot, and
using a browser-based tool such as chat gpt or google bard. I find that copilot
28 F. Tambon and A. Moradi Dakhel et al.
has more of a tendency to recommend code that I don’t need, or code that is
redundant; [...] ”.
5 Lessons Learned
This study shed light on the nature of bugs occurring in code generated using
LLMs. We observed that bugs contained in LLMs-generated code differ from
bugs in human-written code. Our analysis and the feedback received from
survey participants also reveal that LLM-generated code can appear correct
at first glance and deceive users depending on the complexity of the code and
the experience of said users. To the light of our manual labeling and validation
survey, in the following, we provide several lessons learned for both LLMs’ users
and researchers.
systematic and some samples were correctly using the expected (more recent)
function. Hence, having an older version in the generated code can not only be
explained by deprecated training data. Similarly, on tasks requiring a different
coding between Python 2 and 3, the LLMs would often only consider one of
them (e.g., using the API for Python 3 for both the Python 2 and Python 3
cases).
More testing, less trusting: Even when given enough information, LLMs
might generate faulty codes. In that case, having access to test cases to prop-
erly test the generated code is important. This however can be quite a manual
effort. In a regression test setting, using automated test case generation tools
such as EvoSuite [47] can help. Leveraging LLMs themselves was also shown
to be a potential way to generate unit tests [8, 48]. Nonetheless, one should
be cautious even when test cases are used. For instance, in the case of Mis-
interpretation or Missing Corner Case, the error might not be obvious, and
potential existing test cases might not catch the error in the LLMs’ generated
code. For instance, Liu et al. [8] showed that existing test cases in HumanEval
dataset [4], which encompasses simpler tasks compared to CoderEval, would
not catch all existing mistakes in LLMs’ generated code. Worse, if no tests
are available, users either have to trust the LLMs’ generated code or manually
read and analyze the obtained code to make sure there are no errors. The
first option introduces a high risk depending on the context in which the code
should be used while the second one adds a tedious and time-consuming task
for the developers, proportional to the code complexity. With the previous
points mentioned, it is obvious that LLM-generated code should not be too
trusted as it can easily be error-prone. If more experienced developers are less
likely to fall into this trap, novice users or people unfamiliar with the particular
task requested in the prompt can easily trust generated code, especially if it
“looks” correct, as highlighted in one comment from our survey participants:
“It is a risk to become complacent to LLM code generation, because sometimes
code looks reasonable (in this case), but you might be unaware of all the ways
30 F. Tambon and A. Moradi Dakhel et al.
it could fail ”. Some previous studies also pointed out that more novice users
tend to over-rely on the code [9, 49, 50] which can have serious negative effects.
Knowing how and when to use LLMs: In some cases, if too many
details are needed for the LLMs to generate a proper code, it might be more
straightforward to write the function directly. For instance, Vaithilingam et al.
[54] showed that users leveraging Copilot for coding tasks were not necessarily
faster than users without Copilot for the reasons we observed in our study:
the code generated might be buggy, oftentimes not in a straightforward way
(Misinterpretation/Missing Corner Case), or the users might have a hard
time understanding the logic of the generated code. This is best put by one
of our survey’s respondents: “It depends on whether you know what you are
expecting the code to look like. If so, this is easy to spot, and probably you’re
better off manually writing the code. I have tried to prompt and engineer, but I
find that it’s a waste of my time. Sometimes, you will get code generated that
misinterprets the prompt, but maybe you are not aware of what the generated
solution looks like so you have to read the generated code intently or figure
out the library calls that are generated to understand that it misunderstood the
prompt”. Moreover, straight code generation might not be recommended in all
cases, such as we analyzed in our study. An alternative would be to relax the
problem to instead do code completion as commented by a survey’s participant:
“In most of my usage scenarios, I generate code with LLM based on existing
code. It could be different from letting LLMs generate code from scratch.”. In
that case, the user can start coding the functionality and then use the LLM to
complete some lines to lower the workload. The additional context would help
the model in making more accurate and less error-prone generations. Even in
the case of the generated code containing an error, it will likely be easier for a
developer to debug as part of the logic was implemented by said developer and
the generated part should also follow a similar logic. This, however, requires
some context to work with, which might hamper the usefulness of the method.
Another way to mitigate full code generation would be to use LLMs to kick-
start coding, saving time on more redundant or simple parts. This has the
advantage of not necessarily needing code context and letting the users have
more control over the code. This was another point mentioned in the study
of Vaithilingam et al. [54]: even when they did not observe a significant time
advantage while using LLMs, developers still preferred them because they
Bugs in Large Language Models Generated Code 31
could kick-start the coding process and avoid searching for solutions on the
Internet.
Following our results and analysis, we provide additional lessons learned that
could be future potential research venues for the community regarding bugs
in the code generated by LLMs:
Repairing the code generated by LLMs: One way to mitigate bugs in
LLM-generated code would be to directly repair them, ideally before the code
is returned to the user. Automatic Program Repair (APR) tools [55] have
been used to fix traditional software programs and several studies [56, 57]
started applying them to code generated by LLMs. These studies, however,
made use of simpler datasets (e.g., HumanEval [31] or LeetCode [58], which
are not based on programming tasks extracted from real-world projects) and
still failed to find several mistakes. Hence, we recommend analyzing existing
bugs in practical tasks similarly to what we did, which could be beneficial
to guide such repair. This could help improve existing APR tools for LLMs
generated code or complement prompt for LLMs to self-check generated code
against identified bug patterns.
Proposing Code features related Benchmarks for Testing code
LLMs: We observed that LLMs-generated code can contain different bug pat-
terns and that, depending on the code dependency level, LLMs might be more
prone to certain errors. Thus, creating standardized benchmark datasets and
evaluation metrics specifically tailored for assessing code-related features, the
performance of bug detection and triaging methods for LLM-generated code.
This will enable fair comparison and evaluation of different testing approaches.
Existing benchmarks already focusing on code repair such as CodeXGlue [59]
or HumanEvalPack [60], on dependency level such as CoderEval [6] or Clas-
sEval [61] or even on code efficiency such as EffiBench [62] are important
to be considered. Hence, promoting new benchmarks or complementing the
above benchmark with additional features such as different bug patterns, de-
pendency levels or other code-related features could help in better assessing
LLMs shortcomings.
32 F. Tambon and A. Moradi Dakhel et al.
6 Related Works
In this section, we review related works and highlight their findings relevant
to buggy code generated by LLMs. While various studies have incorporated
LLMs for diverse programming tasks, our focus is specifically on studies that
investigated bugs in LLM-generated code. We also discuss studies proposing
LLM-based programming assistant tools and studies that examine the quality
of code generated by LLMs.
Vaithilingam et al. [54] conducted a human study involving 24 participants
to investigate the user experience of using Copilot for completing programming
tasks. Their study revealed that participants using Copilot had a lower success
rate in accurately completing these tasks compared to those using Intellisense7 .
This occurred primarily because the participants struggled to detect and cor-
rect the errors contained in the code generated by Copilot. Participants faced
difficulties correcting the bugs generated by Copilot, to the point that they
preferred writing the code from scratch instead of repairing the bugs contained
in Copilot’s code.
Moradi et al. [9] assessed the quality of code generated by Copilot for solv-
ing different fundamental algorithmic problems such as searching and sorting.
Moradi et al. also compared the quality of code generated by Copilot as an
AI pair programming assistant for certain programming tasks with the qual-
ity of human-written code. They also examined the effort required to repair
bugs in code generated by Copilot using an APR tool, comparing it with effort
required to fix bugs in human-written code. Their results highlight some differ-
ences in the cost of repairing buggy Copilot-generated code compared to those
generated by humans. Their results also suggest that this difference could be
due to Copilot occasionally overlooking specific details in the prompt.
Mastropaolo et al. [65] studied the robustness of LLM-based programming
assistant tools and investigated the extent to which the input provided to
7 [Link]
Bugs in Large Language Models Generated Code 33
Copilot as a prompt affects the generated output. To conduct this study, they
employed a Deep Learning based paraphrasing technique to rephrase task
descriptions and assessed the correctness of the rephrased descriptions through
manual evaluation by the authors. Their results indicate that rephrasing the
task description influences the distribution of solutions generated by Copilot
across categories of failing test, passing test, syntax error, and no solution.
Thus, there is a potential loss in accuracy (moving generated code snippets
from passing test category to failing test or syntax error ) associated with the
description of the prompt.
While the findings of these studies shed light on the bug-proneness of code
generated by LLMs, none of them have thoroughly investigated the bug pat-
terns and–or the characteristics of such bugs. Only a handful of studies have
delved into the types of bugs observed in the code generated by LLMs, and
we will discuss them in the following.
Honarvar et al. [66] conducted a study to assess the robustness of four
different LLMs in addressing what they termed as instances or neighborhood
questions. To create a set of neighborhoods for a template question, they re-
placed variables in 60 different programming tasks with input tests collected
from their test oracle and generated different instances for a single program-
ming question. In the evaluation process, rather than just labeling a code as
buggy, they assigned a correctness score to indicate the number of instances of
a template question successfully handled by the code generated by an LLM.
Their findings revealed that there are buggy code snippets generated by LLMs
that failed only in a few instances. Subsequently, they categorized the reasons
behind failures based on observed error types into classes such as syntax error,
runtime error, assertion error, and wrong number of arguments.
Fan et al. [15] aimed to enhance the reliability of code generated by Codex
by fixing its generated buggy code. They employed both an APR tool and also
leveraged Codex to repair the buggy code. To assess the feasibility of applying
APR to repair buggy code from Codex, they initially analyzed common mis-
takes found in solutions generated by Codex. Two authors manually attempted
to fix the bugs in the Codex-generated code, creating fixing patches by refer-
encing other human-provided solutions for the same task on Leetcode [58]. The
authors categorized buggy snippets using a predefined set of defect categories,
derived from a benchmark study on human buggy code for programming com-
petition [67]. This classification was inspired by the type of fix required for the
buggy code, such as operator mutation or variable mutation. They concluded
that the buggy code generated by Codex shared similar mutant operators for
fixing as those found in human buggy code. However, their results on repair-
ing buggy code revealed that Codex outperformed APR tools (which are also
inspired by human bugs) in fixing its own buggy code, and they recommended
enhancements to APR tools to overcome this limitation. Their conclusions
are not confirmed by our study. Indeed, while Misinterpretation and Missing
Corner Case bug patterns, which relate more to something human developers
would do (as validated by the survey in Section 4.2), are still present, multiple
non-human like mistakes occur even on a more advanced model like Codex
34 F. Tambon and A. Moradi Dakhel et al.
of this study was on Java, not all equivalent SStuBs in Python were observed
in our sample set. However, we categorized bugs with similar characteristics as
Missing Corner Cases, where a small change or a single statement modification
can transform the buggy code into correct code.
The study by Pan et al. [17] is focused on analyzing the effectiveness of
LLMs in code translation and is the only study that as part of their results
developed a systematic bug taxonomy of buggy code generated by LLMs when
employed for code translation tasks. The characteristics outlined in their tax-
onomy are particularly relevant to the challenges associated with translating
code from one language to another such as Mismatch of API behavior between
source and target, Removal of logic from the source code, and Mismatch of
behavior after replacing API call.
To the best of our knowledge, our study is the first to systematically an-
alyze the characteristics of LLM-generated buggy code and construct a bug
taxonomy based on the observed bug patterns. Furthermore, we are the first
to evaluate the appropriateness of our proposed taxonomy on LLM-generated
buggy code through a survey study targeting users of LLMs. In contrast to
prior studies that based their analyses on programming tasks collected from
platforms like LeetCode, our analysis consists of the buggy code generated by
three different LLMs using real-world programming tasks collected from the
CoderEval benchmark dataset. We assert that our findings contribute to a
more comprehensive understanding of bugs in LLM-generated code.
7 Threats to Validity
Construct validity: Our methodology may pose a potential threat as the pro-
cess of collecting and labeling buggy samples could influence our results. While
our methodology is similar to that of existing works that categorized bugs [29,
33, 36], we addressed this threat by meticulously describing our approach, to
allow for external validations. Another limitation could come from the dataset
we used: Python functions and the three examined LLMs. Python is a widely
used programming language in various domains, and the examined LLMs have
been used in previous studies for code generation tasks [19, 20, 21, 25, 27]. In
the absence of a pre-existing taxonomy for categorizing bugs, we recognized
the risk of introducing bias into our classification. To mitigate this risk, we
followed an open coding procedure. Each rater independently evaluated buggy
samples, and conflicts were discussed until a consensus was reached. Addition-
ally, to validate our categories, we surveyed practitioners who employed LLMs
for code generation. Finally, removing bugs classified as runnable level above
plib runnable could introduce a loss of information regarding the bug patterns
identified. Nonetheless, any bugs in the code generated from the prompt of
those categories would not necessarily reflect a weakness of LLM in generating
code but just a lack of proper information. As such, we preferred removing
them and potentially losing some information rather than adding noises to
our study.
36 F. Tambon and A. Moradi Dakhel et al.
Internal Validity: One source of internal threats to validity is the potential bias
in manual inspection and labeling of buggy samples. To alleviate this threat,
the first two authors (two senior Ph.D. candidates who have experience in SE
research and programming) labeled buggy code samples independently. After
a series of meetings with the third author (a senior researcher with 10 years
of research experience in SE and AI), a consensus was reached on the label-
ing process and the criteria for categorizing different bug patterns. Another
threat arises from the sampling method used to select buggy snippets from
CoderEval. To address potential sampling bias, we followed similar statistical
approaches [69, 70] utilizing 95% confidence levels and 5% confidence inter-
vals. Furthermore, we validated through a survey, following existing studies
[29, 33, 36], to verify that our obtained categories are representative of bug
patterns faced by practitioners. The response rate, 8.9%, was in line with recent
survey studies in SE that received response rates of 8.4% to 13.3% [36, 37, 71].
In particular, we showed in Section 4.2 that there is a correlation between the
bug patterns encountered by respondents and the distribution of bug patterns
sampled for Codex, which further comforts us that our sampling is represen-
tative. Finally, no new bug patterns were reported by survey respondents.
External Validity: One external threat to the validity is the LLMs used to
generate the code in the dataset. In this study, three LLMs were used, Code-
Gen, PanGu-Coder, and Codex: the first two are open-source and have been
used in previous studies that harness the power of LLMs for code genera-
tion tasks [21, 25, 27, 72], and the third one, Codex, is the model behind
Copilot. Another threat is that the tasks for which the LLMs generated code
may not be representative of real tasks which could limit the relevance of the
bug patterns found. The choice of the CoderEval dataset mitigates this issue,
as it is based on real tasks mined from GitHub. Nonetheless, future works
should consider expanding our study to cover a more diverse set of LLMs and
functions (problem and prompt) to generate code. Another threat to external
validity stems from our focus on Python code. Since Python is one of the most
popular languages for development, we believe that our findings are relevant
to the majority of programming languages. However, some language-specific
bugs can not be detected in Python because of its properties. For instance,
type-based variable declaration or memory management issues would not be
something we could easily obtain in a Python program. Thus, future works
should consider expanding our study to include other programming languages.
The last threat is regarding the rapid development of LLMs and the long-term
relevance of the proposed taxonomy. While future studies could expand this
taxonomy based on the evolving landscape of LLM advancements, we aimed
to draw attention to the distinctive nature of bugs in code generated by LLMs
compared to human bugs, even in common types like syntax errors. Moreover,
we believe the bug patterns remain broad and will not be overshadowed by
the advancements in LLMs over time. Categories such as Prompt-biased code,
Hallucinated Objects, or Wrong Input Type are linked to the limitations aris-
ing from the generative nature of LLMs. While they may see a reduction in
frequency with improved model performance, they are likely to persist.
Bugs in Large Language Models Generated Code 37
Conclusion Validity: The conclusions drawn in this study have certain limi-
tations, notably the potential for wrong/missing bug patterns. We conducted
manual inspections of 333 buggy snippets from a pool of over 1,997 in the
dataset. Similarly, buggy snippets were generated by 3 different LLMs to en-
sure the reliability of our findings. In both instances, the sample sizes were
deemed substantial enough to be representative and avoid leading to misguided
conclusions. Additionally, to minimize potential errors, bug labeling was in-
dependently conducted by two raters and subsequently discussed. Moreover,
a validation survey was used to compare our results to users’ experience and
feedback. Lastly, we have provided a replication package [18] to facilitate the
reproducibility of our findings, enabling other researchers to expand upon our
study.
patterns of bugs in LLMs generated code are different from the bug patterns
of human developers even in categories such as Syntax Error.
Our results serve as an initial taxonomy of bug patterns observed in LLM-
generated code. While further investigation is required to compare their ap-
pearance in human-written code and other LLMs, the findings in this study
combined with the validation survey represent the first steps in characterizing
bugs in LLM-generated code. This opens up avenues for future studies on en-
hancing the quality of code generated by LLM-based programming assistant
tools. Our findings also can be leveraged to improve different SE tools that
rely on characteristics of human buggy code, such as mutant operators in MT
or repair rules in APR tools, as well as potential development in Computer
Science (CS) education studies.
Acknowledgment
We would like to thank the participants who contributed to the survey ques-
tionnaire.
The dataset generated during the current study is available in the replication
package, which is accessible via [18].
References
Appendix
A Reddit Post
We posted our survey questionnaire on two relevant Reddit channels: LocalLLaMA and
MachineLearning. Figure 5 shows the posted survey on Reddit.
B Survey
To prepare the survey, we used Google Forms [73], a well-known online tool for creating
and sharing surveys and quizzes. The survey has two parts inspired by previous empirical
Bugs in Large Language Models Generated Code 45
studies on bug patterns [29, 33, 36]. In the first part, we asked demographic questions to
participants i.e. what is your current job title, what LLMs they have used, and for what
programming languages they have used LLMs to generate code, other than Python (since
it is the programming language of the given code snippets), as presented in Figure 6. Then,
we asked for more specific questions about bugs in LLMs generated code, more precisely,
we asked:
– How often do you encounter mistakes in LLMs generated code you use? Those mistakes
can be anything, from code that will not compile (variable not defined, function that
does not exist) to more silly choices from the LLMs (multiples If condition checking the
same thing or casting back and forth a variable for no reason).
– When you encounter mistakes in LLMs generated code, how complex is fixing those
issues?
– To fix those issues, how do you proceed?
For the first three questions, we used a 5-level Likert scale [74] (Never to Always or Trivial
to Very Hard). For, the third one, the respondents could choose between “Manually”, “Re-
prompting the LLM”, “Combination of both” or “Other” (and they were given the possibility
to write down how).
In the second part of the survey, we wanted to consolidate our bug patterns and gather
the participants’ feedback on several aspects of each pattern of bug in our study. An example
46 F. Tambon and A. Moradi Dakhel et al.
Fig. 7: Example of questions for a bug pattern in the survey. The description
of the task is collected from the CoderEval dataset [6].
of questions in the second part is shown in Figure 7. The structure of the questions is similar
for each pattern of bugs in the taxonomy.
Table 4: Results of validating survey for the bug patterns. Questions related to the frequency of encounter of bug patterns,
the difficulty to diagnose and fix them as well as the complexity of the bug. We highlight in bold the highest number in each
category for each bug pattern. 1 represents never/easy/trivial/low and 5 represents always/hard/complex/high.
Responses
Types of Bug Frequency Diagnosing Complexity Fixing
1—2—3—4—5 1—2—3—4—5 1—2—3—4—5 1—2—3—4—5
Hallucinated
17.6% - 26.5% - 23.5% - 23.5% - 8.8% 79.4% - 17.6% - 0% - 0% - 2.94% 47.1% - 32.4% - 14.7% - 5.9% - 0% 17.6% - 26.5% - 29.4% - 20.6% - 5.9%
object
Incomplete
20.6% - 11.8% - 29.4% - 32.4% - 5.9% 73.5% - 14.7% - 2.9% - 8.8% - 0% 41.2% - 38.2% - 8.8% - 11.8% - 0% 26.5% - 20.6% - 11.8% - 29.4% - 11.8%
Generation
Bugs in Large Language Models Generated Code
Misinterpretation 5.9% - 35.3% - 26.5% - 26.5% - 5.9% 29.4% - 17.6% - 32.4% - 17.6% - 2.9% 14.7% - 23.5% - 26.5% - 32.4% - 2.9% 5.9% - 11.8% - 38.2% - 35.3% - 8.8%
NPC 17.6% - 23.5% - 35.3% - 20.6% - 2.9% 23.5% - 23.5% - 26.5% - 23.5% - 2.9% 17.6% - 32.4% - 29.4% - 14.7% - 5.9% 29.4% - 29.4% - 23.5% - 11.8% - 5.9%
Silly
29.4% - 29.4% - 20.6% - 17.6% - 2.9% 58.8% - 26.5% - 14.7% - 0% - 0% 67.6% - 29.4% - 2.9% - 0% - 0% 55.9% - 26.5% - 8.8% - 8.8% - 0%
Mistake
Wrong
17.6% - 38.2% - 14.7% - 20.6% - 8.8% 35.3% - 23.5% - 20.6% - 17.6% - 2.9% 17.6% - 50% - 14.7% - 17.6% - 0% 26.5% - 38.2% - 17.6% - 14.7% - 2.9%
Input Type
Missing
11.8% - 17.6% - 20.6% - 35.3% - 14.7% 5.9% - 14.7% - 26.5% - 44.1% - 8.8% 14.7% - 14.7% - 38.2% - 23.5% - 8.8% 17.6% - 29.4% - 17.6% - 29.4% - 5.9%
Corner Cases
Wrong
0% - 17.6% - 26.5% - 47.1% - 8.8% 20.6% - 44.1% - 26.5% - 5.9% - 2.9% 17.6% - 50% - 17.6% - 11.8% - 2.9% 8.8% - 38.2% - 35.3% - 11.8% - 5.9%
Attribute
Syntax
47.1% - 20.6% - 20.6% - 5.9% - 5.9% 85.3% - 8.8% - 5.9% - 0% - 0% 82.4% - 8.8% - 8.8% - 0% - 0% 70.6% - 14.7% - 2.9% - 11.8% - 0%
Error
Prompt-biased
11.8% - 23.5% - 11.8% - 44.1% - 8.8% 50.0% - 26.5% - 17.6% - 5.9% - 0% 11.8% - 47.1% - 26.5% - 11.8% - 2.9% 5.9% - 23.5% - 26.5% - 29.4% - 14.7%
Code