0% found this document useful (0 votes)
136 views12 pages

Test-Driven Development and LLM-based Code Generation: Noble Saji Mathews Meiyappan Nagappan

This paper explores the integration of Test-Driven Development (TDD) principles with Large Language Models (LLMs) for code generation, emphasizing the importance of validating machine-generated code against predefined tests. The authors hypothesize that providing LLMs with both problem statements and corresponding tests can enhance code correctness and adherence to requirements. Through empirical evaluations using a framework called TGen, the study aims to assess the effectiveness of TDD in improving the quality of LLM-generated code and its robustness in various scenarios.

Uploaded by

intandidik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
136 views12 pages

Test-Driven Development and LLM-based Code Generation: Noble Saji Mathews Meiyappan Nagappan

This paper explores the integration of Test-Driven Development (TDD) principles with Large Language Models (LLMs) for code generation, emphasizing the importance of validating machine-generated code against predefined tests. The authors hypothesize that providing LLMs with both problem statements and corresponding tests can enhance code correctness and adherence to requirements. Through empirical evaluations using a framework called TGen, the study aims to assess the effectiveness of TDD in improving the quality of LLM-generated code and its robustness in various scenarios.

Uploaded by

intandidik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

2024 39th IEEE/ACM International Conference on Automated Software Engineering (ASE)

Test-Driven Development and LLM-based Code Generation


Noble Saji Mathews Meiyappan Nagappan
University of Waterloo University of Waterloo
Canada Canada
[email protected] [email protected]

ABSTRACT misleadingly erroneous [19, 25, 26]. This issue becomes especially
Recent Large Language Models (LLMs) have demonstrated signifi- critical when the cost of failure is high. The potential pitfalls of
cant capabilities in generating code snippets directly from problem LLMs, such as generating syntactically correct but logically flawed
statements. This increasingly automated process mirrors traditional code or failing to adhere to specific requirements, underscore the
human-led software development, where code is often written in need for stringent validation mechanisms. Ensuring correctness
response to a requirement. Historically, Test-Driven Development is not just about preventing errors. It is about building trust in
(TDD) has proven its merit, requiring developers to write tests machine-generated code and ensuring it meets the rigorous stan-
before the functional code, ensuring alignment with the initial dards required in real-world applications. Programmers will need
problem statements. Applying TDD principles to LLM-based code to act as quality assurance specialists, spending more time checking
generation offers one distinct benefit: it enables developers to verify the validity of autogenerated code, utilizing both traditional testing
the correctness of generated code against predefined tests. This tools and advanced automated program repair techniques to ensure
paper investigates if and how TDD can be incorporated into AI- code quality [20]
assisted code-generation processes. We experimentally evaluate our Generating tests has gained additional momentum with the emer-
hypothesis that providing LLMs like GPT-4 and Llama 3 with tests gence of LLMs, and recent techniques boast testing performance
in addition to the problem statements enhances code generation similar to human-written tests. Past work, however, has questioned
outcomes. We experimented with established function-level code how coverage as a goal may not be an effective target to set [12].
generation benchmarks such as MBPP and HumanEval. Our results If LLMs fail to handle edge cases regarding code, generated tests
consistently demonstrate that including test cases leads to higher derived from the implementation can be as wrong as the gener-
success in solving programming challenges. We assert that TDD is ated code itself. While LLMs might be useful in helping developers
a promising paradigm for helping ensure that the code generated formalize intent, this paper does not seek to answer the question
by LLMs effectively captures the requirements. "When and how should AI write tests?". This work rather seeks to
explore whether there might be a better paradigm for utilizing tests
CCS CONCEPTS in LLM-based code generation.
Test-driven development (TDD) [2] offers a structured approach
• Software and its engineering → Software development tech-
to mitigate the risks associated with code generation. This software
niques; • Computing methodologies → Artificial intelligence.
development process revolves around the repetitive cycle of writing
KEYWORDS a test for a specific function or feature, developing code to pass
that test, and subsequently refactoring the code to meet the desired
Code Generation, LLM, TDD, Testing, Software Engineering standards of quality and efficiency. Hence, TDD could serve as a
ACM Reference Format: critical framework for validating the correctness and functionality
Noble Saji Mathews and Meiyappan Nagappan. 2024. Test-Driven Develop- of LLM-generated code.
ment and LLM-based Code Generation. In 39th IEEE/ACM International Con- There is extensive research into generating test cases from code,
ference on Automated Software Engineering (ASE ’24), October 27-November whether human or machine-authored [16, 23]. However, the recip-
1, 2024, Sacramento, CA, USA. ACM, New York, NY, USA, 12 pages. https:
rocal process has not been systematically evaluated, particularly
//doi.org/10.1145/3691620.3695527
regarding how human-written tests (created as part of the TDD pro-
cess) can improve LLM-generated code. By examining how LLMs
1 INTRODUCTION
respond to the additional context and constraints of test cases, we
Large Language Models (LLMs) are transforming software devel- can gain insights into their potential to enhance code correctness
opment. However, the importance of correctness in this domain through a more informed code generation process. While it may
remains a challenge. Despite their advanced capabilities, LLMs seem obvious to assume that including tests alongside problem
often exude deceptive confidence in their outputs, which can be statements will inherently enhance the correctness of machine-
Permission to make digital or hard copies of all or part of this work for personal or generated code, this assumption requires careful examination. The
classroom use is granted without fee provided that copies are not made or distributed actual impact of human-written tests on improving the quality
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the
of LLM-generated code remains underexplored. Specifically, the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or extent to which LLMs can interpret and utilize these tests to re-
republish, to post on servers or to redistribute to lists, requires prior specific permission fine their code outputs has not been thoroughly investigated. This
and/or a fee. Request permissions from [email protected].
ASE ’24, October 27-November 1, 2024, Sacramento, CA, USA gap highlights a critical area for empirical study: measuring the
© 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM. effectiveness of test cases in guiding LLMs toward generating more
ACM ISBN 979-8-4007-1248-7/24/10
https://doi.org/10.1145/3691620.3695527

1583
ASE ’24, October 27-November 1, 2024, Sacramento, CA, USA Noble Saji Mathews and Meiyappan Nagappan

accurate code that captures requirements which may not be explic- upfront, TDD ensures that all new code functions as intended
itly stated (much like the goals of TDD). and adheres strictly to the specified requirements. This rig-
In our empirical study, we create a simple framework TGen to orous approach to software development could be transfor-
evaluate our research questions. TGen is rooted in TDD principles, mative when applied to the code-generation capabilities of
requiring developers to precisely articulate their coding intents LLMs. We hypothesize that by providing LLMs like GPT-4
through tests. TGen operates within a continuous self-improving with both problem statements and corresponding tests, the
loop. Developers begin by detailing their requirements and intended generated code will achieve the required functionality and
functionalities through a combination of problem statements and exhibit an enhanced adherence to the intended specifications
test cases. These test cases then serve as a definitive guide for the and requirements.
expected behaviour of the code, allowing TGen to generate code RQ3: Can failed tests help LLMs fix their mistakes? We inves-
that adheres to these requirements. Integral to the framework is tigate the ability of LLMs to learn from their errors through
an LLM-based feedback mechanism designed to iteratively refine remediation loops. This involves assessing whether LLMs
and correct the generated code, ensuring it passes all provided can iteratively refine and correct their code outputs based
unit tests. By leveraging test cases as a benchmark for validation, on feedback or error detection. The remediation loop in our
TGen attempts to generate code satisfying these requirements. In study closely mirrors the iterative cycle at the heart of TDD.
cases where TGen cannot create correct code, by design, developers In traditional TDD, the process involves initially writing
know that the code generated does not pass all the test cases and, tests, which the new code then fails, leading developers to
therefore, can correct the code before including it in their projects. refine the function until it passes the tests. We aim to ex-
We investigate whether incorporating tests in the code genera- plore the potential of LLMs to self-correct as reported in past
tion process improves the model’s ability to generate correct code literature [29]and the ability to align better the code they
and also discuss the impacts and challenges of this approach. While generate with the provided specifications.
we use TGen in our study for evaluation, it’s important to note that RQ4: Do the generated solutions only satisfy the supplied
our contribution is not the framework itself but the empirical study tests or even unseen tests? This research question ex-
we conduct. We systematically explore whether the TDD paradigm plores a critical aspect of AI-generated code: its robustness
works in AI-driven code generation, under what conditions it is and generalizability beyond the explicitly defined test cases.
effective, and how tests can be useful. Our findings can support or In traditional software development, the ability of code to
challenge the use of TDD in this context. perform well under a variety of scenarios—especially those
By integrating TDD, developers can continuously test the out- not explicitly anticipated during testing—is a hallmark of
puts of LLMs against predefined specifications and use cases. This high-quality software. TDD is grounded in the principle that
helps identify and correct errors early in the development process tests are pivotal in improving design and functionality. How-
and aligns the model’s outputs with the specific requirements of ever, there remain questions about the extent to which this
the task at hand. Thus, the adoption of TDD for code generation principle applies when code is generated by LLMs.
could stand as a promising strategy to enhance the practical utility
In addition to these research questions, we delve deeper into the
of these advanced AI systems in software development.
complexities of incorporating tests, discussing our findings further
through the following additional explorations:
2 RESEARCH QUESTIONS
EXP1: How does problem difficulty affect these results? We
Our goal is to examine the baseline capabilities of neural code gen-
hypothesize that a problem statement’s difficulty could sig-
eration and the incremental benefits introduced by incorporating
nificantly impact an LLM’s performance in code generation.
tests. Specifically, we explore if and how tests are helpful through
We aim to explore how varying levels of problem complexity
the following four research questions:
influence the model’s ability to generate correct code that
RQ1: How good is basic code generation? Basic neural code satisfies the requirements expressed by the statement and
generation, particularly with LLMs, has shown promising supplied test cases. We curate our own dataset consisting
results in various tasks. However, its efficacy in complex of competitive programming problems of different difficulty
coding scenarios without any test information remains a levels and attempt to understand how increasing difficulty
subject of exploration. We aim to evaluate the fundamen- impacts our approach.
tal capabilities of these models in generating syntactically EXP2: How many tests is good enough? Determining the opti-
and logically correct code, especially in standard program- mal number of tests necessary for effective code generation
ming tasks. This investigation will provide insights into the by LLMs is a key concern. In traditional TDD, the number
baseline performance of LLMs in code generation, setting a and quality of tests are crucial for ensuring robust software.
benchmark for further discussions. We aim to investigate how the quantity of provided test cases
RQ2: Does providing test cases improve code generation? influences the correctness of the code generated by LLMs.
The foundational principle of TDD is that software should While intuitively, adding tests should only help clarify the
be developed primarily by writing tests before any functional requirements. However, when working with LLMs, we are
code is written. This method has historically led to robust, limited by the context length of the model, which controls
well-documented software that is closely aligned with user how much information we can feed the model. By experi-
requirements. By requiring developers to define explicit tests menting with varying numbers of tests, we wish to identify

1584
Test-Driven Development and LLM-based Code Generation ASE ’24, October 27-November 1, 2024, Sacramento, CA, USA

if there exists a threshold that balances thoroughness and • MBPP: Comprises around 1,000 crowd-sourced Python pro-
efficiency, guiding best practices for using tests to enhance gramming problems designed for entry-level programmers.
LLM-based code generation. Each problem includes a task description, a code solution,
EXP3: Do these results hold with an open model like Llama 3? and three automated test cases. The version we use includes
While our approach is LLM agnostic, the LLM itself plays a 35 times the original number of unit tests for each problem. It
major role in our experiments, requiring the ability to write consists of 399 tasks, which are a subset of the original MBPP
correct code and review and fix incorrect code. While there dataset. This subset is selected by the authors of EvalPlus
may exist an ideal combination of different LLMs specialized from the sanitized MBPP (a set of 427 manually filtered by the
for each of these tasks, we wish to explore if this structure original MBPP authors), with further removal of low-quality
would work with other models outside OpenAI’s GPT family and ill-formed tasks for benchmark quality control.
of models as well. By evaluating Meta’s LLama 3 under the We never supply the LLM with any additional test cases that
same conditions, we can assess if similar patterns emerge, EvalPlus added using dynamic program testing techniques and
providing insights into the consistency and reliability of our only use them to test correctness. However, existing work in the
findings across different LLMs. literature has questioned the accuracy of function-level benchmarks
for evaluating LLMs [4, 22]. We also look at file-level generation
3 METHODOLOGY to address this concern and validate our findings. For this, we
We design our experiments to help evaluate the effectiveness of curate our own dataset of 1100 problems from a popular competitive
integrating TDD with LLM-based code generation. We aim to de- programming website, CodeChef, for the following reasons:
termine if tests affect LLMs and when they might be useful. • We also wish to explore how difficulty impacts our results,
and existing datasets do not have a balanced set of problems
3.1 Datasets Used spanning various difficulty levels
Evaluating code generation typically employs benchmark datasets • CodeContests v2 is not public, and the older dataset is out-
such as MBPP [1], HumanEval [6], CodeContests [17], or APPS dated and filled primarily with Python 2 solutions and prob-
[13]. Recently, newer datasets have explored class-level [9] and lems that no longer accept solutions on the competitive pro-
repository-level code completion [30], as well as automated soft- gramming platforms they were extracted from
ware engineering [14]. However, we focus on function-level gener- • APPS dataset, which is a more recent dataset of coding chal-
ation datasets like MBPP and HumanEval for two reasons. lenge problems, does not give a clear difficulty measure but
rather categorizes problems into abstract buckets, which
(1) MBPP and HumanEval datasets are standard benchmarks doesn’t work well for our purposes
widely used to evaluate LLMs on code generation leader-
boards [9, 19]. 3.2 TGen Framework
(2) Repository-level code generation requires context about the
Figure 1 provides an overview of the proposed TGen framework’s
codebase, making it complex and less controlled. Class-level
pipeline, which we use for evaluation. Each part of the pipeline is
benchmarks can be evaluated in various generation configu-
described below:
rations, introducing many variables complicating the analy-
sis. Function-level generation datasets like MBPP and Hu- (1) Input Phase: The evaluation pipeline starts by accepting
manEval are relatively more straightforward. These involve two main inputs: the problem statement describing the task
generating the body of a function with a given signature, and the tests containing the public unit tests and the required
providing a clear and manageable scope for evaluation. By signature, if any, to be enforced in the desired solution. These
focusing on MBPP and HumanEval, we can systematically tests serve both as specifications for the code to meet and
explore our research questions in a controlled environment as a verification tool for the output. These tests serve as
where we can focus on TDD and its effects. both specifications and verification tools. Tests provided
to the LLM during generation are labelled as supplied tests,
By establishing the effectiveness of TDD in LLM-based code while the rest are private tests. For competitive programming
generation using these datasets, we lay the groundwork for ex- problems, the challenge description is the problem statement,
ploring more complex datasets focused on real-world TDD data and available public tests are supplied tests.
in future studies. Conversely, if TDD proves ineffective with these (2) LLM Engine: The core reasoning engine, powered by state-
benchmarks, it may indicate limited applicability in this context. We of-the-art LLMs (e.g., Meta Llama 3, GPT-3.5 & GPT-4 Turbo),
utilized refined versions of HumanEval and MBPP datasets curated processes the structured input. The foundational model lever-
by EvalPlus [19], which adds additional tests to allow for a more ages its own knowledge and any supplied context to generate
rigorous evaluation of the generated code. an output response. This study uses GPT-4 Turbo v1106 for
• HumanEval: Contains 164 handwritten Python program- all experiments, with results validated using Meta Llama 3.
ming problems. Each problem includes a function signature, (3) LLM Agents: In the context of Large Language Models
docstring, body, and several unit tests. The EvalPlus variant (LLMs), "agents" are systems that utilize LLMs as their under-
we use includes additional unit tests (80 times the original) lying reasoning engine [3]. These agents are designed with
for each of the 164 problems, providing more extensive cov- predefined prompts and thought patterns, enabling them to
erage and thorough testing. interact intelligently with users or other systems. Agents

1585
ASE ’24, October 27-November 1, 2024, Sacramento, CA, USA Noble Saji Mathews and Meiyappan Nagappan

Problem 3.a
Coder Agent
Statement
Remediation
6
Supplied Loop
Tests 4 Agent Configuration
Generated
Code
1 Verifier 2 RQ1 (Baseline): Code generation with
problem statement alone
LLM RQ3
5
RQ2 (Prompt + Test): Code generation
Crash with problem statement and test cases
Logs
RQ4
3.b Iteration Limit: Loop "N" times or until
Remediation
"M" repeated test failures for RQ3
Agent
Private Tests

Figure 1: Overview of the evaluation pipeline

in our pipeline utilize the intent and available context from loop runs for a fixed number of iterations or until the same
the input phase to guide the LLM engine in generating code, tests fail repeatedly, showing no progress. We limit the loop
reviewing snippets, and suggesting fixes. to five iterations and three repeated failures for our exper-
(a) Coder Agent: This agent takes the supplied information iments. We observed that the remediation advice often be-
and generates a code snippet from it. The system prompt comes repetitive beyond three to four iterations, yielding
used when tests are supplied for function-level generation diminishing returns on further attempts to fix the code.
cases is shown in Figure 2. The first part provides a rough To conduct our experiments in a cost and compute-efficient
thought process to instruct the LLM on how to utilize the manner, our pipeline executes in the following fashion:
tests, and the second part contains instructions specific
• Solved without tests: The first solution is attempted with
to the structure of tasks in a dataset. Though the overall
the problem statement alone and verified for correctness
structure of the prompt remains the same, variants of
• Needs tests: If the problem statement alone fails to generate
this are used in different cases, namely, when no tests are
a valid solution, the pipeline is rerun, supplying all public
supplied or when a file-level generation task is performed.
tests along with the problem statement. The newly generated
(b) Remediation Agent: This agent looks at failures reported
code is then verified against all tests
by the verifier and attempts to suggest how to fix the
• Needs remediation: If the previous step results in an invalid
code, which is then taken up by the “Coder Agent” in the
solution, the failure data is collected from the verifier and
next iteration. The structure of the prompts is shown in
the remediation loop is executed until a correct solution is
Figure 3.
generated
(4) Output Phase: The Generated Code is extracted from the
• Unsolved: If the remediation loop exits without generating
LLMs output, which is then subject to validation. The output
a valid solution, the problem is considered unsolvable by
is expected to satisfy both the functional requirements and
TGen
the constraints imposed by the tests, if any.
(5) Verification: The output of the coder agent is always run We provide a replication package 1 , which includes output and
through the verifier that uses PyTest, a Python unit-testing runtime details from TGen for all cases we experimented with and
framework, to verify the generated code against all available code scripts for the experiments. Output from LLMs is often unpre-
tests. The verifier collects crashes or test failures and creates dictable. We use a seed value of "1106" and a temperature of 0 in
a comprehensive list of failures for the LLM to inspect and our experiments to improve the replicability of our results.
address issues. The verifier identifies any discrepancies and
ensures only correct solutions are accepted and returned to 4 IMPACT OF TDD IN CODE GENERATION
the user. In this section, we seek to answer the “IF” question: Do tests help
(6) Remediation Loop: The system enters a remediation loop the code generation process? We also manually explore the data to
if the generated code fails the tests. The remediation agent assess the impact of tests and determine which types of problems
uses failure information (e.g., stack trace, detected issues, benefit from tests and which do not. The fraction of correct solutions
executed lines) to refine feedback for the coder agent. This 1 https://osf.io/e3jy6/?view_only=bc67e33bebd3435abf5537d56767401d

1586
Test-Driven Development and LLM-based Code Generation ASE ’24, October 27-November 1, 2024, Sacramento, CA, USA

Your task is only to write a Python function to satisfy requirements System: You are an expert Python programmer, your job is to
specified by the users Prompt. look at code and reason why it doesn’t work as intended. Once
Only generate a single complete Python code snippet. Start you reason why it doesn’t work, you will provide a prompt that
generating after the [PYTHON] tag. Your python solution should highlights the key points that need to be fixed in the code. You
end with a [/PYTHON] tag. will not write the code yourself. IMPORTANT: Tests cannot be
modified. Never make a suggestion outside the scope of modifying
| Multi-step prompting for case with tests supplied the solution code snippet
Assistant: <Code Snippet under remediation>
1. Look at the "Prompt" and "Tests" provided to understand the User: This code is not correct as it led to the following issues:
users requirements. <List of issues following test execution extracted by log-analyzer>
2. Generate code based on requirements in the prompt and ensure
the logic is such that it would pass the corresponding tests provided.
3. Do not attempt to programmatically verify the tests. The test Figure 3: Configuration of the Remediation Agent
should not be part of the generated code in any way.
4. Do not write code in order to satisfy the tests, primarily focus on
satisfying the prompt.
5. Include necessary imports in the generated code.
6. Your code is expected to read input from stdin and write output
to stdout as per the requirements in the prompt.
7. Define additional helper functions if needed
8. Your output should be as shown in the sample tests. Ensure you
do not return or print any additional information that can cause
the tests to fail.

| Dataset-specific guidelines for function-level case

If starting code is provided look at the issues mentioned and attempt


to fix them
Ensure that the signature of the function is the same as the one
provided by the user Figure 4: GPT4: Problems resolved and strategies used for
And format your response as follows: Function-level code generation benchmarks
[PYTHON]
def function_name(parameters_as_provided_in_signature):
{ the following notation “Dataset/Key” where the key is a unique
Ensure the function signature is the same as the one provided in identifier for the problem within the corresponding dataset.
the signature
Infer meaning of parameters from the prompt and tests 4.1 How good is basic code generation?
Valid python code with necessary imports if any that satisfies the
GPT-4, being one of the most powerful LLMs available today, does
prompt
reasonably well solving 80.5% of MBPP problems and an even higher
Ensure the entry method is a function with the same name as the
82.3% of HumanEval tasks, as shown in Figure 4, both of which
function name in the tests
consist of function-level benchmarks. Section 5.1 discusses file-
}
level generation, which is inherently more challenging, and also
[/PYTHON]
how using another LLM affects these results. These numbers are
slightly lower than have been reported for GPT-4 Turbo v1106 in
Figure 2: Prompt used by the Coder Agent popular LLM coding benchmark leaderboards [19] by about 5%. The
drop in performance might be attributed to the removal of sample
test cases if present in the problem statement for this experiment
and differences in the prompt and model parameters like seed and
temperature.
obtained following the aforementioned workflow with GPT-4 Turbo
version 1106 as the LLM for the two function-level benchmark
datasets MBPP and HumanEval can be seen in Figure 4. RQs 1
4.2 Does providing test cases improve code
through 4 break down these results further. Satisfying public tests is generation?
considered the correctness criterion in RQs 1, 2 and 3. RQ4 dives into For MBPP and HumanEval, including tests contributes to solving an
the notion of private tests and uncaptured requirements. We also additional 12.0% and 8.5% of problems, respectively, showcasing that
manually investigate the data to highlight key details and insights. tests as part of the input help the LLM reach the correct solution.
Please note that we refer to specific cases within the dataset using These percentages correspond to an additional 51 problems in the

1587
ASE ’24, October 27-November 1, 2024, Sacramento, CA, USA Noble Saji Mathews and Meiyappan Nagappan

MBPP dataset and 15 additional problems in the HumanEval dataset. solved. Manually inspecting the data, we find the following cases
Manually inspecting the results of the evaluation with and without that are benefited by remediation:
tests, several key insights emerged:
• Problems where the initial implementation had cor- • Edge Case Handling: Many problems initially failed due
rect logic but incorrect function signatures: benefited to inadequate handling of edge cases. Remediation loops
from tests as they helped identify mismatches and ensure allowed the system to incorporate specific checks for these
adherence to the required function signatures (e.g., MBPP/6, cases, leading to correct code generation (e.g., MBPP/92,
MBPP/90, MBPP/233). MBPP/113, MBPP/137)
• Problems involving mathematical formulas or geomet- • Issues related to data type handling and logical errors
ric calculations saw improvements as tests helped catch were effectively addressed through remediation loops (e.g.,
errors in formula application and ensured the correct imple- MBPP/115, HumanEval/22)
mentation of mathematical logic (e.g., MBPP/14, MBPP/430). • Problems requiring precise mathematical computa-
• Problems requiring precise string manipulations or tions saw significant improvements through remediation.
regular expressions benefited from tests highlighting edge Initial attempts often failed due to issues like integer vs.
cases and incorrect pattern matching, leading to more robust floating-point division, but remediation advice helped cor-
solutions (e.g., MBPP/16, HumanEval/48). rect these errors (e.g., MBPP/300, HumanEval/77)
• Problems involving complex data structures such as • Problems involving text manipulation, such as punctu-
nested dictionaries or tuples. Tests helped ensure that the ation handling, benefited greatly from remediation loops.
functions correctly handled various input formats and edge Initial attempts often failed due to improper handling of
cases. (e.g., MBPP/106, MBPP/299) punctuation, but iterative refinements led to progressively
• Edge Case Handling: Many problems initially failed due to better solutions (e.g., MBPP/7)
unhandled edge cases. Tests were instrumental in identifying • Problems requiring multiple remediation attempts high-
these edge cases, prompting the system to refine its logic to light the importance of iterative refinement. Errors with
handle all possible scenarios (e.g., MBPP/170, HumanEval/5). boundary checks, simple arithmetic corrections, or basic in-
• Problems requiring specific algorithmic approaches, put validation were often resolved through iterative remedi-
such as finding the maximum difference or sorting crite- ation. Each attempt built on the previous one, incorporating
ria, benefited from tests that guided the system to correct new insights and corrections until the correct solution was
logical flaws and align with the problem requirements (e.g., achieved (e.g., MBPP/160, HumanEval/57)
MBPP/63, HumanEval/127).
The general trajectory from initial to final remediation attempts
LLMs seem to be able to improve their understanding of the re-
involved a progressive refinement process. In most cases, the first
quirements presented in the problem statement using the provided
remediation led to basic fixes and the handling of straightforward
tests. The tests also seem to have played a significant role in validat-
errors highlighted by the crashes. This often addressed the most
ing mathematical logic, ensuring the correct application of formulas
obvious errors, such as missing imports or missing type conver-
to produce results indicated by the tests. Tests also seem to prompt
sions. In the case of a second remediation attempt, more detailed
the system to refine its logic to accommodate a broader range of in-
adjustments were involved, addressing deeper logical issues and
puts and edge cases. Consistency in data types, especially in output
additional edge cases. A few examples of cases remediation at-
formats, was another area where tests highlighted discrepancies.
tempts to fix are adjusting modulo operations to map to the correct
In some cases, adding tests led to simplifying logic, replacing initial
character range, correcting base cases and loop logic in recursive
complex approaches presumably derived from misunderstanding
functions, and handling edge cases like empty arrays or specific
the statement with more straightforward solutions.
input formats.
The remediation agent did not have access to the history to pre-
4.3 Can failed tests help LLMs fix their vent the context length of the models from being exceeded during
mistakes? the evaluation, which means the LLM could not review past fixes.
The remediation loop’s role as a corrective mechanism can also be There were cases where the remediation advice misled the gener-
observed, with further improvements noted in problem-solving. For ator and often failed to fully implement the remediation advice,
MBPP and HumanEval, the loop seems to effectively refine the code, leading to repeated errors. We ignore cases like MBPP/462 and
with an additional 2.8% and 3% of problems solved, respectively. HumanEval/145, which occurred due to verifier and environment-
This corresponds to the correct code for 21 additional problems related issues, as these are limitations of the evaluator as imple-
for MBPP, out of which 11 were remediated in the first attempt, mented presently. An interesting question is what kind of problems
seven on the second, two on the third and one on the fourth. For the remain unsolved, manually inspecting we find the following:
HumanEval dataset, nine additional problems were solved correctly,
of which seven were remediated on the first attempt and one each • Misunderstanding core logic or requirements often led
was remediated on the second and third attempts. It should be noted to persistent errors across multiple remediation attempts
that in our evaluation pipeline, we allowed for up to five remediation (e.g., MBPP/83, HumanEval/38, MBPP/138, HumanEval/83,
attempts. Although a few cases utilized all five attempts, none were MBPP/765).

1588
Test-Driven Development and LLM-based Code Generation ASE ’24, October 27-November 1, 2024, Sacramento, CA, USA

• Handling specific data structures like empty arrays, neg-


ative numbers, nested lists, and tuples remained challeng-
ing despite multiple iterations (e.g., MBPP/119, MBPP/559,
MBPP/431, MBPP/630).
• Logical errors and inadequate adaptation of remedia-
tion advice were common. Despite following remediation
advice, the system consistently produced incorrect logic in
cases such as rotating binary strings, renaming functions,
and counting and pairing differing bits often led to repeated
mistakes (e.g., MBPP/109, MBPP/468, MBPP/126, MBPP/595).
• Misinterpreting input/output formats and complex
cases involving multiple conditions or constraints frequently
caused failures. These included handling nested list inputs
and incorrectly handling the input format for date validation
(e.g., MBPP/278, MBPP/427, MBPP/102, MBPP/581).
• Performance bottlenecks and inefficiencies were en- Figure 5: Problems resolved and strategies used compared
countered, particularly with large inputs and complex logic, across difficulty levels using GPT4
leading to timeouts and overly complex solutions (e.g., MBPP/765,
HumanEval/63).
success hinges on the quality of these tests. Having a wide variety
4.4 Do the generated solutions only satisfy the of tests ensures solutions are not just tailored to meet the explicit
supplied tests or even unseen tests? criteria in the problem statement but are also able to capture implicit
requirements.
An interesting question in our study is whether the generated solu-
tions are tailored merely to pass the provided tests or if they are
5 PARAMETERS AFFECTING TDD
robust enough to succeed against unseen tests as well. This ques-
tion becomes particularly pertinent when considering the potential In this section, we explore the “HOW” question: how do different
for omitted requirements in the supplied test cases. While TDD parameters impact the effectiveness of tests in code generation? We
advocates for the use of tests as the primary specifications, the risk aim to understand the influence of problem difficulty, the number
of overlooking certain criteria due to suboptimal test quality cannot of tests, and the LLM itself. RQs 5 through 7 address these aspects in
be ignored. detail. RQ5 examines how problem difficulty affects code generation
For the MBPP and HumanEval datasets, we employed their results. RQ6 investigates the optimal number of tests needed for
EvalPlus [19] enriched variants with 35x and 80x more tests, respec- effective code generation. RQ7 evaluates whether the results hold
tively, offering a rigorous assessment platform. EvalPlus seeds its when using an open model like Llama 3 as the reasoning engine.
test generator with high-quality inputs and extends them through This section builds on the findings from Section 4, providing a
type-aware mutation to establish a robust framework for evaluating deeper understanding of the factors that influence the success of
the true correctness of LLM-synthesized code. integrating TDD with LLM-based code generation.
Our extended evaluation of the generated code utilizing private
tests, summarized in the first two columns of Table 1, indicates that 5.1 How does problem difficulty affect these
including public tests still results in improved performance across results?
benchmarks. Even though there is a reduction in the number of 5.1.1 Data collection. We curated a problem set of coding prob-
problems solved corresponding to each category once verified, the lems from CodeChef and their corresponding human solutions.
addition of public tests yielded a performance increase of 12.78% for CodeChef assigns each problem on its platform a difficulty score
MBPP and 9.15% for HumanEval. Further application of remediation and classifies ranges of difficulties into 11 buckets, as depicted in Ta-
loops led to additional improvements: 5.26% for MBPP and 5.49% ble 2. To ensure that our study’s problem set has a good distribution
for HumanEval. This demonstrates that even without the LLM with respect to difficulty, we fetched the 100 most popular problems
having access to the comprehensive test suites provided by EvalPlus, from each difficulty level. Here, popularity refers to the number of
including some tests, it aids in enhancing the correctness of the accepted solution attempts that existed when the data was scraped
generated solutions over not using tests. (November 2023). The curated multi-level dataset of 1100 problems
The occurrence of failures in some solutions when subjected to represents a spectrum of real-world coding challenges of varied
private tests highlights a crucial point: the more exhaustive the test difficulties that require full program synthesis.
suites are, the better the results tend to be. Our findings reveal that
solutions developed with test information fare well against private 5.1.2 Results. A breakup of the problems and solution strategies
tests, underscoring the benefits of integrating tests in the code is shown in Figure 5. The data shown can be summarized as 35.72%
generation process to bolster robustness. This effectiveness points (393) problems solved without providing tests, 7.72% (85) additional
to a nuanced balance: while incorporating tests, even partially problems solved with public tests being supplied to the LLM, and
comprehensive ones, bolsters solution robustness, the ultimate another 9.36% (103) problems solved using the LLM to look into

1589
ASE ’24, October 27-November 1, 2024, Sacramento, CA, USA Noble Saji Mathews and Meiyappan Nagappan

Table 1: Impact of private tests on GPT-4 based TGen results with improvement offered by tests after validation highlighted

Category MBPP(#399) HumanEval(#164) CodeChef(#1100)


Solved without tests 69.67% 78.66% 23.00%
Needs tests 82.45% (+12.78%) 87.81% (+9.15%) 26.09% (+3.09%)
Needs remediation 87.71% (+5.26%) 93.30% (+5.49%) 30.27% (+4.18%)
Unsolved 12.28% 6.71% 69.73%
Improvement using tests 18.04% 14.64% 7.27%

Table 2: Difficulty Levels of Selected CodeChef Problems

Level Range Count


Beginner 0 - 999 100
1* Beginner 1000 - 1199 100
1* Advanced 1200 - 1399 100
2* Beginner 1400 - 1499 100
2* Advanced 1500 - 1599 100
3* Beginner 1600 - 1699 100
3* Advanced 1700 - 1799 100
4* 1800 - 1999 100
5* 2000 - 2199 100
6* 2200 - 2499 100
7* 2500 - 5000 100
1100 Figure 6: Distribution of difficulty levels associated with prob-
lems solved by each strategy

validation issues and remediating the code. In contrast to function-


level generation, where we can expect the function signature and We consider private tests for additional validation to gauge the
input and output schemes to be straightforward, file-level gener- impact of supplied tests. These tests, not disclosed beforehand, are
ation, as required for the curated CodeChef dataset, allows for used by grading systems on platforms like CodeChef to determine
more flexibility in problem-solving approaches. This flexibility can correctness. We leveraged these private test suites to evaluate the
make CodeChef problems harder to solve, even without consid- generated solutions by evaluating the solutions on the platform
ering task complexity. Performance on the CodeChef dataset is itself. The results can be seen in the last column of Table 1. We
markedly lower, with success rates sharply decreasing as prob- note that we still observe a 7.27% improvement in the correctness
lem difficulty escalates. Despite this, 35.72% of these problems were of solutions across the dataset comprising 1100 problems.
solved correctly with just the problem statement. This highlights the Our analysis, as illustrated in Figure 6, delves into the relation-
challenges associated with these problems as compared to function- ship between problem difficulty and the success rate of generated
level generation datasets we have explored so far. We need to parse solutions. The data reveals a clear trend: simpler problems tend to be
the input correctly and provide the output to the console in the solved effectively using tests alone, while more complex problems
expected format, apart from the logical complexity of the tasks. often require additional remediation for successful resolution. No-
Providing test cases led to a consistent increase in the number of tably, the most challenging problems remained unsolved, indicating
problems solved across difficulty levels, increasing the total number a direct correlation between problem difficulty and the likelihood
of problems solved by 7.72%. This underscores the importance of of resolution through our current methodological framework. We
test cases in aiding code generation across all difficulty levels. see that a much less expensive operation of supplying tests quite
Remediation loops further enhance performance in the CodeChef significantly impacts the fraction of solved problems.
dataset, with an improvement of 9.36%. This indicates the LLM’s Lastly, the discrepancy observed between the success of solutions
capacity for self-improvement and its ability to tackle increasingly on public tests versus their failure on CodeChef’s private tests raises
sophisticated challenges, particularly for medium-difficulty prob- important concerns. Our analysis of difficulty scores for problems
lems. The complexities associated with correctly handling input that failed only on private tests yielded an average score of 1627.75,
and output operations likely contribute to the higher impact of situating these problems within the medium difficulty range. The
remediation on full program synthesis required for CodeChef tasks. performance of GPT-4 generated solutions on these tests, where
Most of the CodeChef problems (47.18% or 519 problems) remained 16 were partially accepted, two caused judge errors, six resulted
unsolved in our testing configurations. Allowing more remedia- in runtime errors, and 25 exceeded time limits—underscores the
tion attempts could potentially improve these numbers, but our necessity of comprehensive testing. These findings highlight the
experiments show diminishing returns with increased iterations. importance of developing more robust tests that can anticipate and

1590
Test-Driven Development and LLM-based Code Generation ASE ’24, October 27-November 1, 2024, Sacramento, CA, USA

generally desirable, it’s important to recognize that coverage alone


doesn’t guarantee test quality or comprehensiveness. Tests achiev-
ing full coverage might still miss important edge cases or fail to
capture the full functionality of the code. This could be another
reason why the curve for MBPP is starting to decline with just
three tests. We need a better dataset with more human-written
test cases for more rigorous analysis, but we hypothesize, based
on our results, that adding more diverse test cases aids in the code
generation process. It’s important to note that the "quality" of tests
plays a crucial role in evaluation. While we selected benchmarks
containing human-written tests to include a wide variety of tests,
good and bad, this approach has limitations. For instance, EvalPlus
[19] has highlighted test insufficiency in these datasets, leading to
a reduction in pass@k by 19.3-28.9% when using the refined test
suite. This raises an interesting question about the potential impact
of using lower-quality tests, such as those automatically generated.
Figure 7: Correctness with increasing number of tests sup- However, there is currently no clear consensus on what constitutes
plied ‘low quality’ tests or a standardized dataset to evaluate them. Future
work could explore the comparison between generated tests and
human-written tests, as well as the impact of test quality on code
mitigate such issues despite efficiency not being the primary focus generation performance.
of this investigation.
5.3 Do these results hold with an open model
5.2 How many tests is good enough?
like Llama 3?
The number of test cases available per problem varies widely across
Apart from GPT-4 Turbo, we also experimented with GPT-3.5 Turbo
both function-level generation benchmarks. Considering the scope
v1106 and observed similar trends. However, in our discussion so
of our study, we choose to include only the human-written tests
far, we presented only results from GPT-4, as it represents one of
in these datasets and not those generated by EvalPlus. To explore
the impact of the number of these tests supplied to the LLM, we the most powerful LLMs available for public use. An interesting
work with a subset of both datasets so that as we vary the number question is whether the benefits of TDD can be seen outside Ope-
nAI’s GPT family of models. To address this question, we pick Meta
of tests, we are still commenting on the correctness observed for
Llama 3 70B Instruct, another popular model that shows promise
the same set of problems. The cutoffs are picked so that we have
for code generation, ranking among the best models in code [9, 19]
enough cases to analyze while not dropping too many problems
and non-code LLM benchmark leaderboards [7]. The output from
from our analysis. For MBPP, we look at the subset of problems
TGen for GPT-3.5 Turbo is also included in our replication package.
with at least three tests, resulting in 398 problems, and only one
The results observed by replacing GPT-4 with Llama 3 in the
case is removed. For HumanEval, we look for cases with at least 4
evaluation pipeline can be seen in Figure 8. We see that the trends
tests, leaving us with 143 problems out of the initial 164.
observed with proprietary models extend to open models. Code
We run the evaluation pipeline without remediation for each
generation using Llama 3 benefits from the additional tests and
case in these subsets and repeat the experiment while increasing
remediation setup provided by TGen, and we even see a greater
the number of tests supplied. The variation in the fraction of correct
impact than we observed with GPT-4. As compared to GPT-4, which
solutions generated in each case is shown in Figure 7. We see that
the impact varies across the datasets, but we see an upward trend had over 80% of the problems being solved in the baseline, we see a
in both cases as more tests are added. However, we do not observe lower performance of 52.6% and 67.7% in MBPP and HumanEval
benchmarks, respectively. Despite the worse code generation per-
a plateau in the case of HumanEval, which might be attributed
formance out of the box, we note that the addition of tests and
to the simpler nature of problems in MBPP, which seem to suffer
remediation both significantly improve the results.
more from vagueness than complexity. Also, we note from our
The final measure of improvements observed after evaluating the
experiments that as we continue increasing the number of tests, in
some cases, we end up with incorrect code being generated that correctness of generated solutions using the EvalPlus can be seen
had been correctly solved before. This could be because of the “lost in Table 3. We see that after validating the solutions on private tests,
in the middle” problem reported in the literature, where the LLM we see a 38.6% improvement in MBPP and a 21.95% improvement
in HumanEval as compared to their respective baselines, which is
starts ignoring context in the middle as more and more information
almost double the improvement we saw in the case of GPT-4 shown
is supplied to the model.
Additionally, we look at the sample solution to each of the prob- in Table 1.
lems included in the datasets and use it to compute line coverage
contributed by each test case. We note that 75% of the cases in 6 RELATED WORK
HumanEval and 92.4% of cases in MBPP achieve 100% coverage Evaluating code generation performance is an area that is be-
with the addition of the first case itself. While high coverage is ing actively explored. Liu et al.’s EvalPlus [19] proposes rigorous

1591
ASE ’24, October 27-November 1, 2024, Sacramento, CA, USA Noble Saji Mathews and Meiyappan Nagappan

Generating tests is another way tests for code generation are


being explored recently [23]. Lever [21] attempts to rank genera-
tions based on trained verifiers derived from the prompt. A compre-
hensive exploration is presented by Codet, where they assess the
quality of generated test cases and their impact [5]. Lahiri et al. [15]
explore the use of test cases for user intent formalization, but they
rely on developer feedback for mutating and ranking test and code
suggestions effectively using generated tests. Fakhoury et al. build
upon this by leveraging user feedback through LLM-generated
tests to improve the correctness of generated code presented as
Ticoder [11]. While Lahiri et al. and Fakhoury et al. present ‘test-
driven’ methodologies, they focus on intent formalization during
which both tests and code are generated iteratively, adapting to
user feedback. In contrast, our approach aligns more closely with
traditional TDD principles by starting with user-provided tests that
Figure 8: Llama 3: Problems resolved and strategies used for formalize initial requirements. Their user study [11] highlights the
Function-level code generation benchmarks correct evaluation of AI-generated code for entry-level tasks but,
also shows participants failing to distinguish wrong tests surfaced
by the recommender, which is concerning. We thus refrain from
Table 3: Impact of private tests on Llama 3 based TGen re- leveraging user feedback during generation or commenting on
sults with improvement offered by tests after validation high- the effort and accuracy tradeoffs involved. Similarly, 3DGen [10]
lighted uses LLMs to formalize user intent into test cases, but emphasizes
symbolic techniques for exhaustive test generation and formal ver-
Category MBPP(#399) HumanEval(#164) ification. However, both methods still focus on generating tests
Solved without tests 46.37% 62.20% rather than starting from user-written test cases. A key gap in the
Needs tests 75.94% (+29.57%) 75.61% (+13.41%) current literature is a lack of user studies evaluating the impact and
Needs remediation 84.96% (+9.02%) 84.15% (+8.54%) effectiveness of such approaches in the long run.
Unsolved 15.04% 15.85% The aforementioned works showcase the potential for using
Improvement 38.60% 21.95% verification and validation techniques to improve the performance
of LLMs across different complexity levels and tasks. Further, they
also hint at the efficacy of using LLMs for program repair through
feedback-based techniques. None of these studies, however, answers
benchmark enhancements for evaluating LLM-generated code’s key questions surrounding the impact of including human-written
functional correctness, highlighting the insufficiency of current test information and how these mechanisms can be employed in
benchmarks in depth and breadth. Ding et al. present CrossCodeE- a test-driven workflow. We wish to bridge this gap and structure
val [8], which challenges LLMs with multilingual, cross-file context our experiments to explore how to use tests efficiently and how
code completion. Jimenez et al.’s SWE-bench [14] uses real-world test-driven development principles can be used to improve code
GitHub issues to test LLMs, with results showing current models generation.
solve only a fraction of problems.
Improving code generation using tests has emerged as a
popular trend among other works that have explored other methods, 7 THREATS TO VALIDITY
such as improving the reasoning ability of LLMs by using diverse Prompt engineering, while a powerful tool to guide LLMs, is also
prompts, voting, and other reasoning verification strategies apart subject to limitations. Poorly designed prompts can lead to sub-
from tests [18]. Wang et al. employ test execution feedback while optimal results, skewing the model’s results and potentially mislead-
training the model to enable the model itself to discern incorrect ing it. To minimize this, we keep the prompt structure consistent
code [27]. Chen et al.’s Codex [6] demonstrates the limitations of across our experiments and vary only the parameters under con-
single-sample code generation and suggests that one could similarly sideration. Using popular benchmark datasets, such as MBPP and
exploit multiple test iterations to refine code quality. There have HumanEval, has inherent limitations. These benchmarks do not
also been attempts at making compilable code by incorporating capture the full complexity and variability of real-world program-
feedback from compilers [28]. Shinn et al.’s Reflexion [24] method ming problems but help shed light on the impact of tests. We also
allows agents to learn from past mistakes, emphasizing the need use our own dataset of problems with a clear notion of difficulty,
for approaches that can iterate over code generation and testing but this might not be representative of the difficulty of tasks in
for improved decision-making in software development. Recent software engineering.
agentic systems like Yuntong et al.’s AutoCodeRover have also A significant threat to validity in evaluating LLMs on bench-
explored utilizing test cases for spectrum-based fault localization mark datasets is the potential for data leakage. Given the closed
when retrieving context for code generation [31]. However, most nature of training data for models like GPT-4, we cannot guar-
of these explore the problem in a program repair setting. antee the absence of overlap between the benchmarks and the

1592
Test-Driven Development and LLM-based Code Generation ASE ’24, October 27-November 1, 2024, Sacramento, CA, USA

model’s training data. This is a common concern for all evaluation resulting increase in accuracy. A user study could provide valuable
benchmarks used with LLMs. To partially mitigate this, we curated insights into how this approach affects overall development time
an additional dataset from the competitive programming website and code quality in real-world scenarios.
CodeChef. While this doesn’t completely eliminate the possibil-
ity of leakage, it provides a different source of problems that may 9 ACKNOWLEDGMENTS
be less likely to have been included in the model’s training data. This work was supported in part by the NSERC Discovery Grant. We
It’s important to note that our study focuses on the incremental would also like to thank the reviewers for their time and valuable
improvements achieved by incorporating tests, which we believe feedback.
provides valuable insights into LLM capabilities regardless of the
exact composition of their training data. REFERENCES
This study focused on human-written tests from established [1] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk
benchmarks or competitive programming websites to include a Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le,
et al. 2021. Program synthesis with large language models. arXiv preprint
wide variety of tests. However, these tests might not be represen- arXiv:2108.07732 (2021).
tative of those found in real-world projects. Future work should [2] Kent Beck. 2022. Test driven development: By example. Addison-Wesley Profes-
investigate the quality of the test suite and its impact on code gen- sional.
[3] Vasili Braga. 2023. Decentralised autonomous society through large language
eration. Further, our assumption that test runs serve as ground models’ based agents: a pathway to empower small communities. Journal of
truths for the corresponding problems may not always hold, partic- Engineering Sciences 3 (2023), 88–120.
ularly for non-deterministic problems. We excluded such problems [4] Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-
Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson,
from our curated dataset and filtered out errors through manual Molly Q Feldman, et al. 2023. MultiPL-E: a scalable and polyglot approach to
inspection. Recent literature also suggests that the tests used in benchmarking neural code generation. IEEE Transactions on Software Engineering
(2023).
widely adopted benchmarks may not be robust enough for evaluat- [5] Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou,
ing code generation. To address this, we used the refined versions and Weizhu Chen. 2022. Codet: Code generation with generated tests. arXiv
of the HumanEval and MBPP datasets curated by EvalPlus [19], preprint arXiv:2207.10397 (2022).
[6] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira
which include additional tests for more thorough evaluation. The Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman,
performance of TDD with LLMs might vary with different models, et al. 2021. Evaluating large language models trained on code. arXiv preprint
architectures, or training data. Further, the inherent variability in arXiv:2107.03374 (2021).
[7] Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos,
LLM outputs can introduce inconsistencies. We employ a fixed seed Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E
value and temperature setting to improve the reproducibility of our Gonzalez, et al. 2024. Chatbot arena: An open platform for evaluating llms by
human preference. arXiv preprint arXiv:2403.04132 (2024).
results, but some degree of unpredictability remains. [8] Yangruibo Ding, Zijian Wang, Wasi Uddin Ahmad, Hantian Ding, Ming Tan,
Nihal Jain, Murali Krishna Ramanathan, Ramesh Nallapati, Parminder Bhatia,
Dan Roth, et al. 2023. CrossCodeEval: A Diverse and Multilingual Benchmark
8 CONCLUSION for Cross-File Code Completion. arXiv preprint arXiv:2310.11248 (2023).
[9] Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen,
In the rapidly evolving landscape of LLMs for code, where these Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2024. Evaluating large
models are being increasingly used for code generation and re- language models in class-level code generation. In Proceedings of the IEEE/ACM
46th International Conference on Software Engineering. 1–13.
mediation, embracing TDD emerges as a strategic paradigm shift. [10] Sarah Fakhoury, Markus Kuppe, Shuvendu K Lahiri, Tahina Ramananandro, and
TDD enables us to use tests in the generation phase and for valida- Nikhil Swamy. 2024. 3DGen: AI-Assisted Generation of Provably Correct Binary
Format Parsers. arXiv preprint arXiv:2404.10362 (2024).
tion. By incorporating test cases and employing remediation loops, [11] Sarah Fakhoury, Aaditya Naik, Georgios Sakkas, Saikat Chakraborty, and Shu-
we are able to solve complex problems that the LLM cannot solve vendu K Lahiri. 2024. LLM-based Test-driven Interactive Code Generation: User
normally. Using GPT-4, we observe a significant increase in the Study and Empirical Evaluation. arXiv preprint arXiv:2404.10100 (2024).
[12] Gordon Fraser, Matt Staats, Phil McMinn, Andrea Arcuri, and Frank Padberg. 2015.
correctness of the code generated. We find that this improvement Does automated unit test generation really help software testers? a controlled
is more pronounced for less performant models that can only solve empirical study. ACM Transactions on Software Engineering and Methodology
a much lower fraction of problems initially. We also highlight the (TOSEM) 24, 4 (2015), 1–49.
[13] Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora,
cases in which these approaches work well and the kind of tasks Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob
that present-day LLMs struggle with. Our experiments with GPT-4 Steinhardt. 2021. Measuring Coding Challenge Competence With APPS. NeurIPS
(2021).
and Llama 3 show that merely providing the LLM with tests during [14] Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press,
generation improves correctness by 9.15 to 29.57%. Adding reme- and Karthik Narasimhan. 2023. SWE-bench: Can Language Models Resolve Real-
diation loops results in an additional gain of 5.26 to 9.02% across World GitHub Issues? arXiv preprint arXiv:2310.06770 (2023).
[15] Shuvendu K Lahiri, Aaditya Naik, Georgios Sakkas, Piali Choudhury, Curtis von
popular function-level benchmark datasets after rigorous evalua- Veh, Madanlal Musuvathi, Jeevana Priya Inala, Chenglong Wang, and Jianfeng
tion with private tests. We also find that the proposed approach Gao. 2022. Interactive code generation via test-driven user-intent formalization.
improves correctness by 7.27% on our dataset of 1100 file-level arXiv preprint arXiv:2208.05950 (2022).
[16] Caroline Lemieux, Jeevana Priya Inala, Shuvendu K Lahiri, and Siddhartha Sen.
generation tasks where GPT-4 performs poorly and struggles with 2023. Codamosa: Escaping coverage plateaus in test generation with pre-trained
more difficult problems. Our findings validate the impact of tests, large language models. In 2023 IEEE/ACM 45th International Conference on Soft-
ware Engineering (ICSE). IEEE, 919–931.
and we therefore advocate for the widespread adoption of test- [17] Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi
driven methodologies to maximize the benefits of LLMs in code Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. 2022.
generation. While our results demonstrate the benefits of TDD in Competition-level code generation with alphacode. Science 378, 6624 (2022),
1092–1097.
LLM-based code generation, future work should focus on quantify- [18] Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou, and
ing the tradeoffs between the effort required to write tests and the Weizhu Chen. 2023. Making language models better reasoners with step-aware

1593
ASE ’24, October 27-November 1, 2024, Sacramento, CA, USA Noble Saji Mathews and Meiyappan Nagappan

verifier. In Proceedings of the 61st Annual Meeting of the Association for Computa- [25] Claudio Spiess, David Gros, Kunal Suresh Pai, Michael Pradel, Md Rafiqul Islam
tional Linguistics (Volume 1: Long Papers). 5315–5333. Rabin, Susmit Jha, Prem Devanbu, and Toufique Ahmed. 2024. Quality and Trust
[19] Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2024. Is your in LLM-generated Code. arXiv preprint arXiv:2402.02047 (2024).
code generated by chatgpt really correct? rigorous evaluation of large language [26] Yuvraj Virk, Premkumar Devanbu, and Toufique Ahmed. 2024. Enhancing Trust
models for code generation. Advances in Neural Information Processing Systems in LLM-Generated Code Summaries with Calibrated Confidence Scores. arXiv
36 (2024). preprint arXiv:2404.19318 (2024).
[20] Michael R Lyu, Baishakhi Ray, Abhik Roychoudhury, Shin Hwei Tan, and Patana- [27] Xin Wang, Xiao Liu, Pingyi Zhou, Qixia Liu, Jin Liu, Hao Wu, and Xiaohui
mon Thongtanunam. 2024. Automatic Programming: Large Language Models Cui. 2022. Test-Driven Multi-Task Learning with Functionally Equivalent Code
and Beyond. arXiv preprint arXiv:2405.02213 (2024). Transformation for Neural Code Generation. In Proceedings of the 37th IEEE/ACM
[21] Ansong Ni, Srini Iyer, Dragomir Radev, Veselin Stoyanov, Wen-tau Yih, Sida International Conference on Automated Software Engineering. 1–6.
Wang, and Xi Victoria Lin. 2023. Lever: Learning to verify language-to-code [28] Xin Wang, Yasheng Wang, Yao Wan, Fei Mi, Yitong Li, Pingyi Zhou, Jin Liu,
generation with execution. In International Conference on Machine Learning. Hao Wu, Xin Jiang, and Qun Liu. 2022. Compilable neural code generation with
PMLR, 26106–26128. compiler feedback. arXiv preprint arXiv:2203.05132 (2022).
[22] Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiao- [29] Yixuan Weng, Minjun Zhu, Shizhu He, Kang Liu, and Jun Zhao. 2022. Large lan-
qing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. 2023. Code guage models are reasoners with self-verification. arXiv preprint arXiv:2212.09561
llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023). (2022).
[23] Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2023. An empirical [30] Fengji Zhang, Bei Chen, Yue Zhang, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang
evaluation of using large language models for automated unit test generation. Lou, and Weizhu Chen. 2023. Repocoder: Repository-level code completion
IEEE Transactions on Software Engineering (2023). through iterative retrieval and generation. arXiv preprint arXiv:2303.12570 (2023).
[24] Noah Shinn, Beck Labash, and Ashwin Gopinath. 2023. Reflexion: an autonomous [31] Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury.
agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366 2024. AutoCodeRover: Autonomous Program Improvement. arXiv preprint
(2023). arXiv:2404.05427 (2024).

1594

You might also like