Test-Driven Development and LLM-based Code Generation: Noble Saji Mathews Meiyappan Nagappan
Test-Driven Development and LLM-based Code Generation: Noble Saji Mathews Meiyappan Nagappan
ABSTRACT misleadingly erroneous [19, 25, 26]. This issue becomes especially
Recent Large Language Models (LLMs) have demonstrated signifi- critical when the cost of failure is high. The potential pitfalls of
cant capabilities in generating code snippets directly from problem LLMs, such as generating syntactically correct but logically flawed
statements. This increasingly automated process mirrors traditional code or failing to adhere to specific requirements, underscore the
human-led software development, where code is often written in need for stringent validation mechanisms. Ensuring correctness
response to a requirement. Historically, Test-Driven Development is not just about preventing errors. It is about building trust in
(TDD) has proven its merit, requiring developers to write tests machine-generated code and ensuring it meets the rigorous stan-
before the functional code, ensuring alignment with the initial dards required in real-world applications. Programmers will need
problem statements. Applying TDD principles to LLM-based code to act as quality assurance specialists, spending more time checking
generation offers one distinct benefit: it enables developers to verify the validity of autogenerated code, utilizing both traditional testing
the correctness of generated code against predefined tests. This tools and advanced automated program repair techniques to ensure
paper investigates if and how TDD can be incorporated into AI- code quality [20]
assisted code-generation processes. We experimentally evaluate our Generating tests has gained additional momentum with the emer-
hypothesis that providing LLMs like GPT-4 and Llama 3 with tests gence of LLMs, and recent techniques boast testing performance
in addition to the problem statements enhances code generation similar to human-written tests. Past work, however, has questioned
outcomes. We experimented with established function-level code how coverage as a goal may not be an effective target to set [12].
generation benchmarks such as MBPP and HumanEval. Our results If LLMs fail to handle edge cases regarding code, generated tests
consistently demonstrate that including test cases leads to higher derived from the implementation can be as wrong as the gener-
success in solving programming challenges. We assert that TDD is ated code itself. While LLMs might be useful in helping developers
a promising paradigm for helping ensure that the code generated formalize intent, this paper does not seek to answer the question
by LLMs effectively captures the requirements. "When and how should AI write tests?". This work rather seeks to
explore whether there might be a better paradigm for utilizing tests
CCS CONCEPTS in LLM-based code generation.
Test-driven development (TDD) [2] offers a structured approach
• Software and its engineering → Software development tech-
to mitigate the risks associated with code generation. This software
niques; • Computing methodologies → Artificial intelligence.
development process revolves around the repetitive cycle of writing
KEYWORDS a test for a specific function or feature, developing code to pass
that test, and subsequently refactoring the code to meet the desired
Code Generation, LLM, TDD, Testing, Software Engineering standards of quality and efficiency. Hence, TDD could serve as a
ACM Reference Format: critical framework for validating the correctness and functionality
Noble Saji Mathews and Meiyappan Nagappan. 2024. Test-Driven Develop- of LLM-generated code.
ment and LLM-based Code Generation. In 39th IEEE/ACM International Con- There is extensive research into generating test cases from code,
ference on Automated Software Engineering (ASE ’24), October 27-November whether human or machine-authored [16, 23]. However, the recip-
1, 2024, Sacramento, CA, USA. ACM, New York, NY, USA, 12 pages. https:
rocal process has not been systematically evaluated, particularly
//doi.org/10.1145/3691620.3695527
regarding how human-written tests (created as part of the TDD pro-
cess) can improve LLM-generated code. By examining how LLMs
1 INTRODUCTION
respond to the additional context and constraints of test cases, we
Large Language Models (LLMs) are transforming software devel- can gain insights into their potential to enhance code correctness
opment. However, the importance of correctness in this domain through a more informed code generation process. While it may
remains a challenge. Despite their advanced capabilities, LLMs seem obvious to assume that including tests alongside problem
often exude deceptive confidence in their outputs, which can be statements will inherently enhance the correctness of machine-
Permission to make digital or hard copies of all or part of this work for personal or generated code, this assumption requires careful examination. The
classroom use is granted without fee provided that copies are not made or distributed actual impact of human-written tests on improving the quality
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the
of LLM-generated code remains underexplored. Specifically, the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or extent to which LLMs can interpret and utilize these tests to re-
republish, to post on servers or to redistribute to lists, requires prior specific permission fine their code outputs has not been thoroughly investigated. This
and/or a fee. Request permissions from [email protected].
ASE ’24, October 27-November 1, 2024, Sacramento, CA, USA gap highlights a critical area for empirical study: measuring the
© 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM. effectiveness of test cases in guiding LLMs toward generating more
ACM ISBN 979-8-4007-1248-7/24/10
https://doi.org/10.1145/3691620.3695527
1583
ASE ’24, October 27-November 1, 2024, Sacramento, CA, USA Noble Saji Mathews and Meiyappan Nagappan
accurate code that captures requirements which may not be explic- upfront, TDD ensures that all new code functions as intended
itly stated (much like the goals of TDD). and adheres strictly to the specified requirements. This rig-
In our empirical study, we create a simple framework TGen to orous approach to software development could be transfor-
evaluate our research questions. TGen is rooted in TDD principles, mative when applied to the code-generation capabilities of
requiring developers to precisely articulate their coding intents LLMs. We hypothesize that by providing LLMs like GPT-4
through tests. TGen operates within a continuous self-improving with both problem statements and corresponding tests, the
loop. Developers begin by detailing their requirements and intended generated code will achieve the required functionality and
functionalities through a combination of problem statements and exhibit an enhanced adherence to the intended specifications
test cases. These test cases then serve as a definitive guide for the and requirements.
expected behaviour of the code, allowing TGen to generate code RQ3: Can failed tests help LLMs fix their mistakes? We inves-
that adheres to these requirements. Integral to the framework is tigate the ability of LLMs to learn from their errors through
an LLM-based feedback mechanism designed to iteratively refine remediation loops. This involves assessing whether LLMs
and correct the generated code, ensuring it passes all provided can iteratively refine and correct their code outputs based
unit tests. By leveraging test cases as a benchmark for validation, on feedback or error detection. The remediation loop in our
TGen attempts to generate code satisfying these requirements. In study closely mirrors the iterative cycle at the heart of TDD.
cases where TGen cannot create correct code, by design, developers In traditional TDD, the process involves initially writing
know that the code generated does not pass all the test cases and, tests, which the new code then fails, leading developers to
therefore, can correct the code before including it in their projects. refine the function until it passes the tests. We aim to ex-
We investigate whether incorporating tests in the code genera- plore the potential of LLMs to self-correct as reported in past
tion process improves the model’s ability to generate correct code literature [29]and the ability to align better the code they
and also discuss the impacts and challenges of this approach. While generate with the provided specifications.
we use TGen in our study for evaluation, it’s important to note that RQ4: Do the generated solutions only satisfy the supplied
our contribution is not the framework itself but the empirical study tests or even unseen tests? This research question ex-
we conduct. We systematically explore whether the TDD paradigm plores a critical aspect of AI-generated code: its robustness
works in AI-driven code generation, under what conditions it is and generalizability beyond the explicitly defined test cases.
effective, and how tests can be useful. Our findings can support or In traditional software development, the ability of code to
challenge the use of TDD in this context. perform well under a variety of scenarios—especially those
By integrating TDD, developers can continuously test the out- not explicitly anticipated during testing—is a hallmark of
puts of LLMs against predefined specifications and use cases. This high-quality software. TDD is grounded in the principle that
helps identify and correct errors early in the development process tests are pivotal in improving design and functionality. How-
and aligns the model’s outputs with the specific requirements of ever, there remain questions about the extent to which this
the task at hand. Thus, the adoption of TDD for code generation principle applies when code is generated by LLMs.
could stand as a promising strategy to enhance the practical utility
In addition to these research questions, we delve deeper into the
of these advanced AI systems in software development.
complexities of incorporating tests, discussing our findings further
through the following additional explorations:
2 RESEARCH QUESTIONS
EXP1: How does problem difficulty affect these results? We
Our goal is to examine the baseline capabilities of neural code gen-
hypothesize that a problem statement’s difficulty could sig-
eration and the incremental benefits introduced by incorporating
nificantly impact an LLM’s performance in code generation.
tests. Specifically, we explore if and how tests are helpful through
We aim to explore how varying levels of problem complexity
the following four research questions:
influence the model’s ability to generate correct code that
RQ1: How good is basic code generation? Basic neural code satisfies the requirements expressed by the statement and
generation, particularly with LLMs, has shown promising supplied test cases. We curate our own dataset consisting
results in various tasks. However, its efficacy in complex of competitive programming problems of different difficulty
coding scenarios without any test information remains a levels and attempt to understand how increasing difficulty
subject of exploration. We aim to evaluate the fundamen- impacts our approach.
tal capabilities of these models in generating syntactically EXP2: How many tests is good enough? Determining the opti-
and logically correct code, especially in standard program- mal number of tests necessary for effective code generation
ming tasks. This investigation will provide insights into the by LLMs is a key concern. In traditional TDD, the number
baseline performance of LLMs in code generation, setting a and quality of tests are crucial for ensuring robust software.
benchmark for further discussions. We aim to investigate how the quantity of provided test cases
RQ2: Does providing test cases improve code generation? influences the correctness of the code generated by LLMs.
The foundational principle of TDD is that software should While intuitively, adding tests should only help clarify the
be developed primarily by writing tests before any functional requirements. However, when working with LLMs, we are
code is written. This method has historically led to robust, limited by the context length of the model, which controls
well-documented software that is closely aligned with user how much information we can feed the model. By experi-
requirements. By requiring developers to define explicit tests menting with varying numbers of tests, we wish to identify
1584
Test-Driven Development and LLM-based Code Generation ASE ’24, October 27-November 1, 2024, Sacramento, CA, USA
if there exists a threshold that balances thoroughness and • MBPP: Comprises around 1,000 crowd-sourced Python pro-
efficiency, guiding best practices for using tests to enhance gramming problems designed for entry-level programmers.
LLM-based code generation. Each problem includes a task description, a code solution,
EXP3: Do these results hold with an open model like Llama 3? and three automated test cases. The version we use includes
While our approach is LLM agnostic, the LLM itself plays a 35 times the original number of unit tests for each problem. It
major role in our experiments, requiring the ability to write consists of 399 tasks, which are a subset of the original MBPP
correct code and review and fix incorrect code. While there dataset. This subset is selected by the authors of EvalPlus
may exist an ideal combination of different LLMs specialized from the sanitized MBPP (a set of 427 manually filtered by the
for each of these tasks, we wish to explore if this structure original MBPP authors), with further removal of low-quality
would work with other models outside OpenAI’s GPT family and ill-formed tasks for benchmark quality control.
of models as well. By evaluating Meta’s LLama 3 under the We never supply the LLM with any additional test cases that
same conditions, we can assess if similar patterns emerge, EvalPlus added using dynamic program testing techniques and
providing insights into the consistency and reliability of our only use them to test correctness. However, existing work in the
findings across different LLMs. literature has questioned the accuracy of function-level benchmarks
for evaluating LLMs [4, 22]. We also look at file-level generation
3 METHODOLOGY to address this concern and validate our findings. For this, we
We design our experiments to help evaluate the effectiveness of curate our own dataset of 1100 problems from a popular competitive
integrating TDD with LLM-based code generation. We aim to de- programming website, CodeChef, for the following reasons:
termine if tests affect LLMs and when they might be useful. • We also wish to explore how difficulty impacts our results,
and existing datasets do not have a balanced set of problems
3.1 Datasets Used spanning various difficulty levels
Evaluating code generation typically employs benchmark datasets • CodeContests v2 is not public, and the older dataset is out-
such as MBPP [1], HumanEval [6], CodeContests [17], or APPS dated and filled primarily with Python 2 solutions and prob-
[13]. Recently, newer datasets have explored class-level [9] and lems that no longer accept solutions on the competitive pro-
repository-level code completion [30], as well as automated soft- gramming platforms they were extracted from
ware engineering [14]. However, we focus on function-level gener- • APPS dataset, which is a more recent dataset of coding chal-
ation datasets like MBPP and HumanEval for two reasons. lenge problems, does not give a clear difficulty measure but
rather categorizes problems into abstract buckets, which
(1) MBPP and HumanEval datasets are standard benchmarks doesn’t work well for our purposes
widely used to evaluate LLMs on code generation leader-
boards [9, 19]. 3.2 TGen Framework
(2) Repository-level code generation requires context about the
Figure 1 provides an overview of the proposed TGen framework’s
codebase, making it complex and less controlled. Class-level
pipeline, which we use for evaluation. Each part of the pipeline is
benchmarks can be evaluated in various generation configu-
described below:
rations, introducing many variables complicating the analy-
sis. Function-level generation datasets like MBPP and Hu- (1) Input Phase: The evaluation pipeline starts by accepting
manEval are relatively more straightforward. These involve two main inputs: the problem statement describing the task
generating the body of a function with a given signature, and the tests containing the public unit tests and the required
providing a clear and manageable scope for evaluation. By signature, if any, to be enforced in the desired solution. These
focusing on MBPP and HumanEval, we can systematically tests serve both as specifications for the code to meet and
explore our research questions in a controlled environment as a verification tool for the output. These tests serve as
where we can focus on TDD and its effects. both specifications and verification tools. Tests provided
to the LLM during generation are labelled as supplied tests,
By establishing the effectiveness of TDD in LLM-based code while the rest are private tests. For competitive programming
generation using these datasets, we lay the groundwork for ex- problems, the challenge description is the problem statement,
ploring more complex datasets focused on real-world TDD data and available public tests are supplied tests.
in future studies. Conversely, if TDD proves ineffective with these (2) LLM Engine: The core reasoning engine, powered by state-
benchmarks, it may indicate limited applicability in this context. We of-the-art LLMs (e.g., Meta Llama 3, GPT-3.5 & GPT-4 Turbo),
utilized refined versions of HumanEval and MBPP datasets curated processes the structured input. The foundational model lever-
by EvalPlus [19], which adds additional tests to allow for a more ages its own knowledge and any supplied context to generate
rigorous evaluation of the generated code. an output response. This study uses GPT-4 Turbo v1106 for
• HumanEval: Contains 164 handwritten Python program- all experiments, with results validated using Meta Llama 3.
ming problems. Each problem includes a function signature, (3) LLM Agents: In the context of Large Language Models
docstring, body, and several unit tests. The EvalPlus variant (LLMs), "agents" are systems that utilize LLMs as their under-
we use includes additional unit tests (80 times the original) lying reasoning engine [3]. These agents are designed with
for each of the 164 problems, providing more extensive cov- predefined prompts and thought patterns, enabling them to
erage and thorough testing. interact intelligently with users or other systems. Agents
1585
ASE ’24, October 27-November 1, 2024, Sacramento, CA, USA Noble Saji Mathews and Meiyappan Nagappan
Problem 3.a
Coder Agent
Statement
Remediation
6
Supplied Loop
Tests 4 Agent Configuration
Generated
Code
1 Verifier 2 RQ1 (Baseline): Code generation with
problem statement alone
LLM RQ3
5
RQ2 (Prompt + Test): Code generation
Crash with problem statement and test cases
Logs
RQ4
3.b Iteration Limit: Loop "N" times or until
Remediation
"M" repeated test failures for RQ3
Agent
Private Tests
in our pipeline utilize the intent and available context from loop runs for a fixed number of iterations or until the same
the input phase to guide the LLM engine in generating code, tests fail repeatedly, showing no progress. We limit the loop
reviewing snippets, and suggesting fixes. to five iterations and three repeated failures for our exper-
(a) Coder Agent: This agent takes the supplied information iments. We observed that the remediation advice often be-
and generates a code snippet from it. The system prompt comes repetitive beyond three to four iterations, yielding
used when tests are supplied for function-level generation diminishing returns on further attempts to fix the code.
cases is shown in Figure 2. The first part provides a rough To conduct our experiments in a cost and compute-efficient
thought process to instruct the LLM on how to utilize the manner, our pipeline executes in the following fashion:
tests, and the second part contains instructions specific
• Solved without tests: The first solution is attempted with
to the structure of tasks in a dataset. Though the overall
the problem statement alone and verified for correctness
structure of the prompt remains the same, variants of
• Needs tests: If the problem statement alone fails to generate
this are used in different cases, namely, when no tests are
a valid solution, the pipeline is rerun, supplying all public
supplied or when a file-level generation task is performed.
tests along with the problem statement. The newly generated
(b) Remediation Agent: This agent looks at failures reported
code is then verified against all tests
by the verifier and attempts to suggest how to fix the
• Needs remediation: If the previous step results in an invalid
code, which is then taken up by the “Coder Agent” in the
solution, the failure data is collected from the verifier and
next iteration. The structure of the prompts is shown in
the remediation loop is executed until a correct solution is
Figure 3.
generated
(4) Output Phase: The Generated Code is extracted from the
• Unsolved: If the remediation loop exits without generating
LLMs output, which is then subject to validation. The output
a valid solution, the problem is considered unsolvable by
is expected to satisfy both the functional requirements and
TGen
the constraints imposed by the tests, if any.
(5) Verification: The output of the coder agent is always run We provide a replication package 1 , which includes output and
through the verifier that uses PyTest, a Python unit-testing runtime details from TGen for all cases we experimented with and
framework, to verify the generated code against all available code scripts for the experiments. Output from LLMs is often unpre-
tests. The verifier collects crashes or test failures and creates dictable. We use a seed value of "1106" and a temperature of 0 in
a comprehensive list of failures for the LLM to inspect and our experiments to improve the replicability of our results.
address issues. The verifier identifies any discrepancies and
ensures only correct solutions are accepted and returned to 4 IMPACT OF TDD IN CODE GENERATION
the user. In this section, we seek to answer the “IF” question: Do tests help
(6) Remediation Loop: The system enters a remediation loop the code generation process? We also manually explore the data to
if the generated code fails the tests. The remediation agent assess the impact of tests and determine which types of problems
uses failure information (e.g., stack trace, detected issues, benefit from tests and which do not. The fraction of correct solutions
executed lines) to refine feedback for the coder agent. This 1 https://osf.io/e3jy6/?view_only=bc67e33bebd3435abf5537d56767401d
1586
Test-Driven Development and LLM-based Code Generation ASE ’24, October 27-November 1, 2024, Sacramento, CA, USA
Your task is only to write a Python function to satisfy requirements System: You are an expert Python programmer, your job is to
specified by the users Prompt. look at code and reason why it doesn’t work as intended. Once
Only generate a single complete Python code snippet. Start you reason why it doesn’t work, you will provide a prompt that
generating after the [PYTHON] tag. Your python solution should highlights the key points that need to be fixed in the code. You
end with a [/PYTHON] tag. will not write the code yourself. IMPORTANT: Tests cannot be
modified. Never make a suggestion outside the scope of modifying
| Multi-step prompting for case with tests supplied the solution code snippet
Assistant: <Code Snippet under remediation>
1. Look at the "Prompt" and "Tests" provided to understand the User: This code is not correct as it led to the following issues:
users requirements. <List of issues following test execution extracted by log-analyzer>
2. Generate code based on requirements in the prompt and ensure
the logic is such that it would pass the corresponding tests provided.
3. Do not attempt to programmatically verify the tests. The test Figure 3: Configuration of the Remediation Agent
should not be part of the generated code in any way.
4. Do not write code in order to satisfy the tests, primarily focus on
satisfying the prompt.
5. Include necessary imports in the generated code.
6. Your code is expected to read input from stdin and write output
to stdout as per the requirements in the prompt.
7. Define additional helper functions if needed
8. Your output should be as shown in the sample tests. Ensure you
do not return or print any additional information that can cause
the tests to fail.
1587
ASE ’24, October 27-November 1, 2024, Sacramento, CA, USA Noble Saji Mathews and Meiyappan Nagappan
MBPP dataset and 15 additional problems in the HumanEval dataset. solved. Manually inspecting the data, we find the following cases
Manually inspecting the results of the evaluation with and without that are benefited by remediation:
tests, several key insights emerged:
• Problems where the initial implementation had cor- • Edge Case Handling: Many problems initially failed due
rect logic but incorrect function signatures: benefited to inadequate handling of edge cases. Remediation loops
from tests as they helped identify mismatches and ensure allowed the system to incorporate specific checks for these
adherence to the required function signatures (e.g., MBPP/6, cases, leading to correct code generation (e.g., MBPP/92,
MBPP/90, MBPP/233). MBPP/113, MBPP/137)
• Problems involving mathematical formulas or geomet- • Issues related to data type handling and logical errors
ric calculations saw improvements as tests helped catch were effectively addressed through remediation loops (e.g.,
errors in formula application and ensured the correct imple- MBPP/115, HumanEval/22)
mentation of mathematical logic (e.g., MBPP/14, MBPP/430). • Problems requiring precise mathematical computa-
• Problems requiring precise string manipulations or tions saw significant improvements through remediation.
regular expressions benefited from tests highlighting edge Initial attempts often failed due to issues like integer vs.
cases and incorrect pattern matching, leading to more robust floating-point division, but remediation advice helped cor-
solutions (e.g., MBPP/16, HumanEval/48). rect these errors (e.g., MBPP/300, HumanEval/77)
• Problems involving complex data structures such as • Problems involving text manipulation, such as punctu-
nested dictionaries or tuples. Tests helped ensure that the ation handling, benefited greatly from remediation loops.
functions correctly handled various input formats and edge Initial attempts often failed due to improper handling of
cases. (e.g., MBPP/106, MBPP/299) punctuation, but iterative refinements led to progressively
• Edge Case Handling: Many problems initially failed due to better solutions (e.g., MBPP/7)
unhandled edge cases. Tests were instrumental in identifying • Problems requiring multiple remediation attempts high-
these edge cases, prompting the system to refine its logic to light the importance of iterative refinement. Errors with
handle all possible scenarios (e.g., MBPP/170, HumanEval/5). boundary checks, simple arithmetic corrections, or basic in-
• Problems requiring specific algorithmic approaches, put validation were often resolved through iterative remedi-
such as finding the maximum difference or sorting crite- ation. Each attempt built on the previous one, incorporating
ria, benefited from tests that guided the system to correct new insights and corrections until the correct solution was
logical flaws and align with the problem requirements (e.g., achieved (e.g., MBPP/160, HumanEval/57)
MBPP/63, HumanEval/127).
The general trajectory from initial to final remediation attempts
LLMs seem to be able to improve their understanding of the re-
involved a progressive refinement process. In most cases, the first
quirements presented in the problem statement using the provided
remediation led to basic fixes and the handling of straightforward
tests. The tests also seem to have played a significant role in validat-
errors highlighted by the crashes. This often addressed the most
ing mathematical logic, ensuring the correct application of formulas
obvious errors, such as missing imports or missing type conver-
to produce results indicated by the tests. Tests also seem to prompt
sions. In the case of a second remediation attempt, more detailed
the system to refine its logic to accommodate a broader range of in-
adjustments were involved, addressing deeper logical issues and
puts and edge cases. Consistency in data types, especially in output
additional edge cases. A few examples of cases remediation at-
formats, was another area where tests highlighted discrepancies.
tempts to fix are adjusting modulo operations to map to the correct
In some cases, adding tests led to simplifying logic, replacing initial
character range, correcting base cases and loop logic in recursive
complex approaches presumably derived from misunderstanding
functions, and handling edge cases like empty arrays or specific
the statement with more straightforward solutions.
input formats.
The remediation agent did not have access to the history to pre-
4.3 Can failed tests help LLMs fix their vent the context length of the models from being exceeded during
mistakes? the evaluation, which means the LLM could not review past fixes.
The remediation loop’s role as a corrective mechanism can also be There were cases where the remediation advice misled the gener-
observed, with further improvements noted in problem-solving. For ator and often failed to fully implement the remediation advice,
MBPP and HumanEval, the loop seems to effectively refine the code, leading to repeated errors. We ignore cases like MBPP/462 and
with an additional 2.8% and 3% of problems solved, respectively. HumanEval/145, which occurred due to verifier and environment-
This corresponds to the correct code for 21 additional problems related issues, as these are limitations of the evaluator as imple-
for MBPP, out of which 11 were remediated in the first attempt, mented presently. An interesting question is what kind of problems
seven on the second, two on the third and one on the fourth. For the remain unsolved, manually inspecting we find the following:
HumanEval dataset, nine additional problems were solved correctly,
of which seven were remediated on the first attempt and one each • Misunderstanding core logic or requirements often led
was remediated on the second and third attempts. It should be noted to persistent errors across multiple remediation attempts
that in our evaluation pipeline, we allowed for up to five remediation (e.g., MBPP/83, HumanEval/38, MBPP/138, HumanEval/83,
attempts. Although a few cases utilized all five attempts, none were MBPP/765).
1588
Test-Driven Development and LLM-based Code Generation ASE ’24, October 27-November 1, 2024, Sacramento, CA, USA
1589
ASE ’24, October 27-November 1, 2024, Sacramento, CA, USA Noble Saji Mathews and Meiyappan Nagappan
Table 1: Impact of private tests on GPT-4 based TGen results with improvement offered by tests after validation highlighted
1590
Test-Driven Development and LLM-based Code Generation ASE ’24, October 27-November 1, 2024, Sacramento, CA, USA
1591
ASE ’24, October 27-November 1, 2024, Sacramento, CA, USA Noble Saji Mathews and Meiyappan Nagappan
1592
Test-Driven Development and LLM-based Code Generation ASE ’24, October 27-November 1, 2024, Sacramento, CA, USA
model’s training data. This is a common concern for all evaluation resulting increase in accuracy. A user study could provide valuable
benchmarks used with LLMs. To partially mitigate this, we curated insights into how this approach affects overall development time
an additional dataset from the competitive programming website and code quality in real-world scenarios.
CodeChef. While this doesn’t completely eliminate the possibil-
ity of leakage, it provides a different source of problems that may 9 ACKNOWLEDGMENTS
be less likely to have been included in the model’s training data. This work was supported in part by the NSERC Discovery Grant. We
It’s important to note that our study focuses on the incremental would also like to thank the reviewers for their time and valuable
improvements achieved by incorporating tests, which we believe feedback.
provides valuable insights into LLM capabilities regardless of the
exact composition of their training data. REFERENCES
This study focused on human-written tests from established [1] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk
benchmarks or competitive programming websites to include a Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le,
et al. 2021. Program synthesis with large language models. arXiv preprint
wide variety of tests. However, these tests might not be represen- arXiv:2108.07732 (2021).
tative of those found in real-world projects. Future work should [2] Kent Beck. 2022. Test driven development: By example. Addison-Wesley Profes-
investigate the quality of the test suite and its impact on code gen- sional.
[3] Vasili Braga. 2023. Decentralised autonomous society through large language
eration. Further, our assumption that test runs serve as ground models’ based agents: a pathway to empower small communities. Journal of
truths for the corresponding problems may not always hold, partic- Engineering Sciences 3 (2023), 88–120.
ularly for non-deterministic problems. We excluded such problems [4] Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-
Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson,
from our curated dataset and filtered out errors through manual Molly Q Feldman, et al. 2023. MultiPL-E: a scalable and polyglot approach to
inspection. Recent literature also suggests that the tests used in benchmarking neural code generation. IEEE Transactions on Software Engineering
(2023).
widely adopted benchmarks may not be robust enough for evaluat- [5] Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou,
ing code generation. To address this, we used the refined versions and Weizhu Chen. 2022. Codet: Code generation with generated tests. arXiv
of the HumanEval and MBPP datasets curated by EvalPlus [19], preprint arXiv:2207.10397 (2022).
[6] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira
which include additional tests for more thorough evaluation. The Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman,
performance of TDD with LLMs might vary with different models, et al. 2021. Evaluating large language models trained on code. arXiv preprint
architectures, or training data. Further, the inherent variability in arXiv:2107.03374 (2021).
[7] Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos,
LLM outputs can introduce inconsistencies. We employ a fixed seed Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E
value and temperature setting to improve the reproducibility of our Gonzalez, et al. 2024. Chatbot arena: An open platform for evaluating llms by
human preference. arXiv preprint arXiv:2403.04132 (2024).
results, but some degree of unpredictability remains. [8] Yangruibo Ding, Zijian Wang, Wasi Uddin Ahmad, Hantian Ding, Ming Tan,
Nihal Jain, Murali Krishna Ramanathan, Ramesh Nallapati, Parminder Bhatia,
Dan Roth, et al. 2023. CrossCodeEval: A Diverse and Multilingual Benchmark
8 CONCLUSION for Cross-File Code Completion. arXiv preprint arXiv:2310.11248 (2023).
[9] Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen,
In the rapidly evolving landscape of LLMs for code, where these Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2024. Evaluating large
models are being increasingly used for code generation and re- language models in class-level code generation. In Proceedings of the IEEE/ACM
46th International Conference on Software Engineering. 1–13.
mediation, embracing TDD emerges as a strategic paradigm shift. [10] Sarah Fakhoury, Markus Kuppe, Shuvendu K Lahiri, Tahina Ramananandro, and
TDD enables us to use tests in the generation phase and for valida- Nikhil Swamy. 2024. 3DGen: AI-Assisted Generation of Provably Correct Binary
Format Parsers. arXiv preprint arXiv:2404.10362 (2024).
tion. By incorporating test cases and employing remediation loops, [11] Sarah Fakhoury, Aaditya Naik, Georgios Sakkas, Saikat Chakraborty, and Shu-
we are able to solve complex problems that the LLM cannot solve vendu K Lahiri. 2024. LLM-based Test-driven Interactive Code Generation: User
normally. Using GPT-4, we observe a significant increase in the Study and Empirical Evaluation. arXiv preprint arXiv:2404.10100 (2024).
[12] Gordon Fraser, Matt Staats, Phil McMinn, Andrea Arcuri, and Frank Padberg. 2015.
correctness of the code generated. We find that this improvement Does automated unit test generation really help software testers? a controlled
is more pronounced for less performant models that can only solve empirical study. ACM Transactions on Software Engineering and Methodology
a much lower fraction of problems initially. We also highlight the (TOSEM) 24, 4 (2015), 1–49.
[13] Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora,
cases in which these approaches work well and the kind of tasks Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob
that present-day LLMs struggle with. Our experiments with GPT-4 Steinhardt. 2021. Measuring Coding Challenge Competence With APPS. NeurIPS
(2021).
and Llama 3 show that merely providing the LLM with tests during [14] Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press,
generation improves correctness by 9.15 to 29.57%. Adding reme- and Karthik Narasimhan. 2023. SWE-bench: Can Language Models Resolve Real-
diation loops results in an additional gain of 5.26 to 9.02% across World GitHub Issues? arXiv preprint arXiv:2310.06770 (2023).
[15] Shuvendu K Lahiri, Aaditya Naik, Georgios Sakkas, Piali Choudhury, Curtis von
popular function-level benchmark datasets after rigorous evalua- Veh, Madanlal Musuvathi, Jeevana Priya Inala, Chenglong Wang, and Jianfeng
tion with private tests. We also find that the proposed approach Gao. 2022. Interactive code generation via test-driven user-intent formalization.
improves correctness by 7.27% on our dataset of 1100 file-level arXiv preprint arXiv:2208.05950 (2022).
[16] Caroline Lemieux, Jeevana Priya Inala, Shuvendu K Lahiri, and Siddhartha Sen.
generation tasks where GPT-4 performs poorly and struggles with 2023. Codamosa: Escaping coverage plateaus in test generation with pre-trained
more difficult problems. Our findings validate the impact of tests, large language models. In 2023 IEEE/ACM 45th International Conference on Soft-
ware Engineering (ICSE). IEEE, 919–931.
and we therefore advocate for the widespread adoption of test- [17] Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi
driven methodologies to maximize the benefits of LLMs in code Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. 2022.
generation. While our results demonstrate the benefits of TDD in Competition-level code generation with alphacode. Science 378, 6624 (2022),
1092–1097.
LLM-based code generation, future work should focus on quantify- [18] Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou, and
ing the tradeoffs between the effort required to write tests and the Weizhu Chen. 2023. Making language models better reasoners with step-aware
1593
ASE ’24, October 27-November 1, 2024, Sacramento, CA, USA Noble Saji Mathews and Meiyappan Nagappan
verifier. In Proceedings of the 61st Annual Meeting of the Association for Computa- [25] Claudio Spiess, David Gros, Kunal Suresh Pai, Michael Pradel, Md Rafiqul Islam
tional Linguistics (Volume 1: Long Papers). 5315–5333. Rabin, Susmit Jha, Prem Devanbu, and Toufique Ahmed. 2024. Quality and Trust
[19] Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2024. Is your in LLM-generated Code. arXiv preprint arXiv:2402.02047 (2024).
code generated by chatgpt really correct? rigorous evaluation of large language [26] Yuvraj Virk, Premkumar Devanbu, and Toufique Ahmed. 2024. Enhancing Trust
models for code generation. Advances in Neural Information Processing Systems in LLM-Generated Code Summaries with Calibrated Confidence Scores. arXiv
36 (2024). preprint arXiv:2404.19318 (2024).
[20] Michael R Lyu, Baishakhi Ray, Abhik Roychoudhury, Shin Hwei Tan, and Patana- [27] Xin Wang, Xiao Liu, Pingyi Zhou, Qixia Liu, Jin Liu, Hao Wu, and Xiaohui
mon Thongtanunam. 2024. Automatic Programming: Large Language Models Cui. 2022. Test-Driven Multi-Task Learning with Functionally Equivalent Code
and Beyond. arXiv preprint arXiv:2405.02213 (2024). Transformation for Neural Code Generation. In Proceedings of the 37th IEEE/ACM
[21] Ansong Ni, Srini Iyer, Dragomir Radev, Veselin Stoyanov, Wen-tau Yih, Sida International Conference on Automated Software Engineering. 1–6.
Wang, and Xi Victoria Lin. 2023. Lever: Learning to verify language-to-code [28] Xin Wang, Yasheng Wang, Yao Wan, Fei Mi, Yitong Li, Pingyi Zhou, Jin Liu,
generation with execution. In International Conference on Machine Learning. Hao Wu, Xin Jiang, and Qun Liu. 2022. Compilable neural code generation with
PMLR, 26106–26128. compiler feedback. arXiv preprint arXiv:2203.05132 (2022).
[22] Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiao- [29] Yixuan Weng, Minjun Zhu, Shizhu He, Kang Liu, and Jun Zhao. 2022. Large lan-
qing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. 2023. Code guage models are reasoners with self-verification. arXiv preprint arXiv:2212.09561
llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023). (2022).
[23] Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2023. An empirical [30] Fengji Zhang, Bei Chen, Yue Zhang, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang
evaluation of using large language models for automated unit test generation. Lou, and Weizhu Chen. 2023. Repocoder: Repository-level code completion
IEEE Transactions on Software Engineering (2023). through iterative retrieval and generation. arXiv preprint arXiv:2303.12570 (2023).
[24] Noah Shinn, Beck Labash, and Ashwin Gopinath. 2023. Reflexion: an autonomous [31] Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury.
agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366 2024. AutoCodeRover: Autonomous Program Improvement. arXiv preprint
(2023). arXiv:2404.05427 (2024).
1594