0% found this document useful (0 votes)
101 views23 pages

Large Language Models For Software Engineering: Survey and Open Problems

This paper surveys the emerging field of Large Language Models (LLMs) in Software Engineering (SE) and identifies open research challenges related to their application. It highlights the potential benefits of LLMs across various SE activities while also addressing significant issues such as hallucinations that can lead to incorrect outputs. The authors emphasize the importance of hybrid techniques and automated testing to ensure the reliability and effectiveness of LLM-based solutions in software engineering.

Uploaded by

Jahidul Alam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
101 views23 pages

Large Language Models For Software Engineering: Survey and Open Problems

This paper surveys the emerging field of Large Language Models (LLMs) in Software Engineering (SE) and identifies open research challenges related to their application. It highlights the potential benefits of LLMs across various SE activities while also addressing significant issues such as hallucinations that can lead to incorrect outputs. The authors emphasize the importance of hybrid techniques and automated testing to ensure the reliability and effectiveness of LLM-based solutions in software engineering.

Uploaded by

Jahidul Alam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Large Language Models for Software Engineering:

Survey and Open Problems


Angela Fan Beliz Gokkaya Mark Harman
Generative AI Team PyTorch Team Instagram Product Foundation
Meta Platforms Inc. Meta Platforms Inc. Meta Platforms Inc.
New York, NY, USA Menlo Park, CA, USA London, UK

Mitya Lyubarskiy Shubho Sengupta Shin Yoo Jie M. Zhang


Developer Infrastructure FAIR School of Computing Department of Informatics
Meta Platforms Inc. Meta Platforms Inc. KAIST King’s College London
arXiv:2310.03533v4 [cs.SE] 11 Nov 2023

London, UK Menlo Park, CA, USA Daejeon, Korea London, UK

Abstract—This paper provides a survey of the emerging area In particular, we are already able to discern important
of Large Language Models (LLMs) for Software Engineering connections to (and resonance with) existing trends and well-
(SE). It also sets out open research challenges for the application established approaches and subdisciplines within Software En-
of LLMs to technical problems faced by software engineers.
LLMs’ emergent properties bring novelty and creativity with gineering. Furthermore, although we find considerable grounds
applications right across the spectrum of Software Engineering for optimism, there remain important technical challenges,
activities including coding, design, requirements, repair, refac- which are likely to inform the research agenda for several
toring, performance improvement, documentation and analytics. years. Many authors have highlighted, both scientifically and
However, these very same emergent properties also pose signif- anecdotally, that hallucination is a pervasive problem for
icant technical challenges; we need techniques that can reliably
weed out incorrect solutions, such as hallucinations. Our survey LLMs [1] and also that it poses specific problems for LLM-
reveals the pivotal role that hybrid techniques (traditional SE based SE [2]. As with human intelligence, hallucination means
plus LLMs) have to play in the development and deployment of that the LLM can create fictitious output. In the context of
reliable, efficient and effective LLM-based SE. software engineering, it means that the engineering artefacts
Index Terms—Automated Program Repair, Documentation created could be incorrect, yet appear plausible; LLMs may
generation, Generative AI, Genetic Improvement, Human-
Computer Interaction, Large Language Models, Refactoring,
introduce bugs.
Requirements engineering, Search Based Software Engineering However, unlike many other applications of LLMs, software
(SBSE), Software Analytics, Software Engineering Education, engineers are typically blessed with automatable ground truth
Software Processes, Software Maintenance and Evolution, Soft- (software execution), against which most software engineering
ware Testing. artefacts can be evaluated. Also, the software engineering
research community has already devoted a great deal of time
I. I NTRODUCTION to producing automated and semi-automated techniques for
checking the potentially incorrect results produced by humans.
This paper surveys the recent developments, advances and This means that, for the discipline and the research community,
empirical results on LLM-based SE; the application of Large there is a great deal of experience and expertise on which
Language Models (LLMs) to Software Engineering (SE) ap- to draw, when tackling the challenges posed by issues like
plications. We use the survey to highlight gaps in this rapidly hallucination.
developing, but as yet embryonic, research literature. Based Clearly, automated testing techniques [3]–[5] will have a
on gaps in the literature and technical opportunities, we central role to play in ensuring correctness, just as they already
also identify open problems and challenges for the software do for human-engineered artefacts. When generating entirely
engineering research community. new features and systems, automated test data generation
While any survey of such a rapidly expanding area can suffers from the lack of an automatable oracle [6] (an au-
neither aspire nor claim to be comprehensive, we hope that this tomated technique for determining whether output behaviour
survey will provide a useful and relatively complete snapshot is correct for a given input stimulus). Given LLMs’ propensity
of the early universe of this exciting new subdiscipline of to hallucinate, the Oracle Problem will remain highly relevant,
Software Engineering: LLM-based Software Engineering. Al- and solutions to it will become all the more impactful [7].
though the scientific and technical structure of the field is still However, some SE applications concern adaption, improve-
emerging, it is already possible to identify trends, productive ment and development of existing software systems, for which
avenues for future research, and important technical challenges there is a readily-available automatable oracle: the functional
that need to be addressed. behaviour of the original system.

1
In this paper, we call this the ‘Automated Regression Ora- For example, well-studied techniques, such as parametric
cle’, an approach that has already proved advantageous in the and non-parametric inferential statistics, are now routinely
field of Genetic Improvement [8]. The Automated Regression used to provide robust scientific conclusions in the presence
Oracle simply uses the existing version of the software system of highly non-deterministic algorithms in the SBSE discipline.
as a reference against which to benchmark output from any
subsequent adaptions and changes. TABLE I
Of course, there is a risk of ‘baking in’ functional incor- A (ALL) DENOTES ALL PREPRINTS THAT ARE CATEGORISED UNDER CS
rectness, since the Automated Regression Oracle cannot detect (C OMPUTER S CIENCE ). L (LLM) DENOTES PREPRINTS WHOSE TITLE OR
ABSTRACT INCLUDES “LLM”, “L ARGE L ANGUAGE M ODEL”, OR “GPT”.
what the system should do, but only capture what it currently L ∩ S DENOTES PREPRINTS IN CS .SE OR CS .PL CATEGORY WHOSE TITLE
does. Therefore, the Automated Regression Oracle can test OR ABSTRACT INCLUDES THE SAME KEYWORDS . N OTE THAT THE YEAR
only for functional regressions so it is best suited to use cases 2023 ONLY INCLUDES DATA UP TO 27 J ULY 2023.
where the existing functionality is to be maintained. For ex- |L| |L∩S|
Year |A| |L| |L ∩ S| (%) (%)
ample, for non-functional improvements such as performance |A| |L|

optimisation and for semantics-preserving refactoring. 2007 2,238 0 0 0.00 0.00


The input provided to an LLM will be a natural focus of 2008 3,645 0 0 0.00 0.00
2009 4,873 0 0 0.00 0.00
growing research, and we can expect a rapid development of 2010 7,543 0 0 0.00 0.00
the literature on prompt engineering and prompt optimisation 2011 9,114 0 0 0.00 0.00
[9]. In this survey, we highlight existing work and open 2012 12,316 0 0 0.00 0.00
2013 14,933 0 0 0.00 0.00
challenges for prompt engineering with regard to several 2014 16,320 0 0 0.00 0.00
specific aspects of software engineering. 2015 18,818 0 0 0.00 0.00
The output from an LLM need not be confined purely to 2016 23,707 0 0 0.00 0.00
2017 30,746 0 0 0.00 0.00
code, but can also include other software engineering arte- 2018 41,927 0 0 0.00 0.00
facts, such as requirements, test cases, design diagrams, and 2019 55,325 36 0 0.00 0.00
documentation. In general, the language-based nature of an 2020 71,431 99 5 0.00 5.05
2021 77,520 192 13 0.25 6.77
LLM, allows it to generate any linguistically-defined software 2022 81,964 434 45 0.53 10.36
engineering artefact. 2023 52,547 1,665 181 3.17 10.87
We typically think of the software engineering artefact as
the primary output of the LLM, but it is not the only output. In order to understand the growth trends within LLM-based
The explanation provided with the primary output is also an Software Engineering, we performed a manual analysis of data
important output of any LLM. Our survey highlights the need on the number of publications on specific topics from arXiv.
for much more research, not only into optimising prompt Table I contains the raw data1 , which was manually extracted
engineering (which focuses on the input to the LLM) but also from the arXiv metadata dump made publicly available via
the need for work on the optimisation of explanations provided Kaggle (https://www.kaggle.com/datasets/Cornell-University/
with the primary output. arxiv), accessed on the 27th. July 2023. We first filtered
LLMs are inherently nondeterministic: the same prompt out publications for which the classification code does not
produces different answers on different inference executions start with the cs prefix (i.e., Computer Science), resulting in
(unless the temperature is set to zero, which has often been column A.
found to be suboptimal over multiple executions) [10]. Further- To identify Computer Science papers that are relevant to
more, irrespective of the temperature setting, subtle changes LLMs, we filtered the publications into subcategories on
in the prompt can lead to very different outputs [10]. As well artificial intelligence (cs.AI), machine learning (cs.LG), neural
as motivating ‘prompt engineering’ and output processing, this and evolutionary computation (cs.NE), software engineering
nondeterministic behaviour raises challenges for the scientific (cs.SE), and programming language (cs.PL) using the queries
evaluation of LLM-based Software Engineering: “Large Language Model”, “LLM”, and “GPT” in either the
If results can vary each time we run the process, title or the abstract (we manually excluded instances of over-
how can we determine whether a proposed technique loaded acronyms such as GPT for General Planning Tool),
achieves an advance over the state of the art? resulting in column L. Finally, we used the same queries to
This is a problem that has already been well studied in the identify LLM-based Software Engineering papers in software
context of Empirical Software Engineering [11] and Search engineering (cs.SE) and programming language (cs.PL). These
Based Software Engineering (SBSE) [12]. In particular, SBSE queries are inherently approximate, so we confine ourselves
bears many similarities to LLM-based Software Engineering, only to conclusions based on overall trends for which there
sharing with it the need to achieve robust scientific evaluation is strong evidence rather than specific details of the numbers
in the presence of noisy, non-deterministic, and incomplete observed. Nevertheless, we report the raw numbers observed
results [13], [14]. There is, therefore, already a mature soft- to support replication by others.
ware engineering literature on just the kind of robust scientific
evaluation techniques needed to cater for LLM-based scientific 1 The numbers for 2023 are underestimated since the data was accessed in
evaluation. July 2023.

2
Code Generation Models
Prompt Engieering for Improved Code Generation
Requirement Engineering Hybrids of LLMs and other Techniques
Section III: Requirement Engineering & Design
Design & Planning Scientific Evaluation of LLM-based Code Generation
Code Implementation Section IV: Code Generation & Completion
Software Generating New Tests Using LLMs
development Testing Section V: Software Testing
Test Adequacy Evaluation
activities Maintainance Section VI: Maintainance, Evolution, & Deployment Test Minimisation

Deployment Section VII: Document Generation Test Output Prediction


Test Flakiness

Section VIII: Software Analytics and Repository Debugging and Repair


Mining Performance Improvement
Section IX: Human Computer Interaction
Research
Clone Detection and Re-use
Domains Section X: Software Engineering Process
Refactoring
Section XI: Software Engineering Education

Section XI: Crosscutting Open Research Topic

Paper Structure
Fig. 1. A mapping between software development activities, research domains, and the paper structure

est
on
Also, given the recent upsurge in attention for LLMs, the
105
exponential rise in the number of papers on LLMs is relatively
104
unsurprising.
103
# of preprints

# in CS category Perhaps more interesting is the rapid uptake of Software


102 # w/ LLM in title or abstract
# in cs.SE or cs.PL w/ LLM in title or abstract
Engineering applications of LLMs, as revealed by the growth
101
trend, pictured in green on this figure. In order to examine
100
0 this trend in more detail, we plot the proportion of LLM pub-
2008 2010 2012 2014 2016 2018 2020 2022 lications (L) to all CS publications (A) in blue, as well as the
proportions of LLM-based software engineering publications
Fig. 2. Trends in number of arXiv preprints. The blue line denotes the number
of preprints categorised under “CS”. The orange line denotes the number of (L ∩ S) to all LLM publications in orange in Figure 3. As
preprints in AI (cs.AI), Machine Learning (cs.LG), Neural and Evolutionary can be seen, the proportion of LLM papers on LLM-based
Computing (cs.NE), Software Engineering (cs.SE), and Programming Lan- Software Engineering has been rising dramatically since 2019.
guage (cs.PL) whose title or abstract contains either “Large Language Model”,
“LLM”, or “GPT”. The green line denotes the number of preprints in SE and Already, more than 10% of all papers on LLMs are concerned
PL categories whose title or abstract contains either “Large Language Model”, with LLM-based Software Engineering.
“LLM”, or “GPT” As a result of this growth, we can expect many other surveys
of LLM-Based SE. The rapid expansion of the literature makes
it unlikely that further comprehensive SE-wide studies will fit
10 % of LLM papers in CS category the space constraints of a single paper, but we can expect many
% of SE papers out of LLM papers
8 specific comprehensive surveys of sub-areas of interest, and
% of preprints

6 also Systematic Literature Reviews (SLRs) that tackle SE-wide


final bot
response 4 crosscutting issues by asking specific research questions of
2 the primary literature in the systematic review. Already, such
0 SLRs are appearing. For example, Hou et al. [15] provided
2008 2010 2012 2014 2016 2018 2020 2022 an excellent recent SLR covering 229 research papers from
2017 to 2023 reporting SE tasks tackled, data collection and
Fig. 3. Proportions of LLM papers and SE papers about LLMs. By “about
LLMs”, we mean that either the title or the abstract of a preprint contains preprocessing techniques, and strategies for optimising LLM
“LLM”, “Large Language Model”, or “GPT”. The blue line denotes the performance (such as prompt engineering).
percentage of the number of preprints about LLMs out of the number of The remainder of this paper is organised to follow the top-
all preprints in the CS category. The orange line denotes the percentage of
the number of preprints about LLMs in cs.SE and cs.PL categories out of all level software development activities and research domains as
preprints about LLMs depicted in Figure 1.

II. P RELIMINARIES
Figure 2, shows the growth in the number of arXiv-
published papers on Computer Science (|A|, in Blue), and on A. Large Language Models
LLMs (|L|, in orange). Those papers specifically on Software A Large Language Model (LLM) refers to an Artificial
Engineering and LLMs are depicted in Green (|L ∩ S|). Intelligence (AI) model that has been trained on large amounts
Given the rapid rise in overall publication volumes, we use of data and is able to generate text in a human-like fashion.
a logarithmic scale for the vertical axis. Unsurprisingly, we Table III provides a glossary of LLM terminology to make the
see an overall rise in the number of CS publications. paper self-contained.

3
LLMs are typically based on deep learning techniques, such Popular examples of decoder-only models are the GPT
as transformers, and have the capability to generate useful (Generative Pre-trained Transformer) series developed by Ope-
language output. As a result, they have been found capable of nAI, LLaMA from Meta, Claude from Anthropic, and PaLM
performing a wide range of language-related tasks, including from Google [1].
text generation [16], answering questions [17], translation [18],
summarization [19], and sentiment analysis [20]. C. Large Language Models for Software Engineering
Rumelhart et al. [21] introduced the concept of Recurrent While LLMs have been widely applied to tasks involving
Neural Network, opening up the possibility of processing natural languages, their application to software development
sequential data. Long Short Term Memory (LSTM) archi- tasks, involving programming languages, has also gained sig-
tecture, an extension of the RNN architecture introduced by nificant recent attention.
Hochreiter and Schmidhuber [22], significantly improved their In 2021, OpenAI introduced CodeX, a fined-tuned descen-
performance in many applications. dant of GPT-3. CodeX is used by GitHub’s Copilot, which
In 2017, Vaswani et al. [23] introduced the Transformer provides users of Visual Studio Code, Visual Studio, Neovim,
architecture, which captures word relationships with the self- and JetBrains with code completion. The new version of Copi-
attention mechanism. The transformer architecture had a pro- lot, GitHub Copilot X2 , is based on GPT-4. In February 2023,
found impact on language modelling and triggered an explo- GitHub reported that, on average, 46%3 of the developers’
sion of activity on LLMs. code was written by Copilot [25]. For Java only, that number
In 2018, OpenAI released the Generative Pre-trained Trans- is 62%. Thomas Dohmke, CEO of GitHub, said Copilot will
former (GPT) model, followed by subsequent iterations (GPT- write 80% of code “sooner than later” in June 2023 [26].
2, GPT-3, GPT-3.5, and GPT-4). With GPT-3 and 3.5, many In 2022, DeepMind introduced AlphaCode [27], trained
observers noticed a significant step change in generative with 40B parameters on selected public GitHub repositories. It
performance, thereby attracting a great deal of interest in GPT achieved on average a ranking in the top 54% in competitions
(and ChatGPT) in particular, and also in LLMs more generally. with more than 5,000 participants in simulated evaluations.
LLMs achieve this performance, in part, due to the large The most recent GPT model, GPT-4, also performs code
corpora on which they are trained: For example, GPT-3 is generation. According to GPT-4’s technical report [28], the
trained on 45TB of text data and has 175 billion parameters. zero-shot pass@1 accuracy is 67% with GPT-4 on HumanEval,
Meta launched open-sourced LLaMA in February 2023, which an open-source dataset from OpenAI consisting of 164 pro-
is trained on 1.4 trillion tokens with a variety of model sizes gramming problems.
ranging from 7 billion to 65 billion parameters [24]. On a benchmark of 100 LeetCode problems, GPT-4 has
comparable performance with human developers [29]. On
B. Categories of Large Language Models the 24th. August 2023, Meta released open-sourced Code
Llama [30], a state-of-the-art for publicly available LLMs on
There are three categories of large language models: coding tasks.
1) Encoder-only model: also known as an autoencoder,
Table II lists the representative LLMs that are designed
consists of an encoder network but does not have a separate
for code generation/completion based on natural language
decoder network. It takes an input sequence and maps it to a
descriptions.
lower-dimensional representation. The purpose of an autoen-
coder is to learn an encoding of the input data. Examples of III. R EQUIREMENTS E NGINEERING AND D ESIGN
Encoder-only LLMs are BERT from Google, RoBERTa from
Meta, and DeBERTa from Microsoft [1]. Requirements engineering is an important discipline in
2) Encoder-decoder model: in addition to the encoder net- software engineering. It forms the fundamental link between
work, there is a decoder network that generates an output the technical attributes of the system software engineers build,
sequence by iteratively generating tokens or symbols based on and the purpose for which the systems are built. There is a
the context vector and previously generated tokens. It can be mature literature, and a large research community concerned
adopted for tasks like machine translation or text generation. specifically with problems associated with requirements engi-
Examples of Encoder-decoder LLMs are T5 from Google and neering problems [31].
BART from Meta [1]. There has also been previous work on artificial intelligence
3) Decoder-only model: Unlike the previous two types of approaches to support requirements engineering, notably in
LLMs, decoder-only LLMs do not have an encoder component the form of computational search for requirements engineering
to process the input data, but only a decoder component [32]. However, hitherto, the discipline of requirements engi-
that directly generates an output sequence based on a given neering has received less attention from the emerging literature
context or input. Decoder-only models are often based on on LLM-based software engineering.
architectures such as autoregressive models, where the output
2 GitHub Copilot X is under technical preview at the time we accessed it
is generated token-by-token. Each token generated by the
on July 17th 2023.
decoder is conditioned on the previous tokens generated and 3 In this paper, all percentages are reported with a precision of 2 significant
the context. digits.

4
TABLE II
E XISTING L ARGE L ANGUAGE M ODELS FOR C ODE G ENERATION

Name Release date Produced by Parameters Open-sourced Price Supported languages Type
CodeBERT February 2020 Microsoft 125M YES free 6 Encoder-decoder
InCoder April 2022 Meta 6.7B, 1.3B YES free 30 Decoder-only
AlphaCode February 2022 DeepMind 300M, 1B, 3B, 9B, and 41B NO free Python or C++ Encoder-decoder
CodeX August 2021 OpenAI 12B NO free >11 Decoder-only
Copilot October 2021 Github and OpenAI 12B NO free for individual developers and organisations >11 Decoder-only
CodeT5 Nov 2021 Salesforce Research 60M, 220M, 770M YES free 6 Encoder-decoder
CodeT5+ May 2023 Salesforce Research 2B, 6B, 16B YES free 9 Encoder-decoder
PolyCoder Oct 2022 Carnegie Mellon University 160M, 400M, 2.7B YES free >11 Decoder-only
CodeWhisperer April 2023 Amazon unknown NO free for individual developers 15 unknown
WizardCoder June 2023 Microsoft 15B YES free unknown Encoder-only
CodeGeeX Sep 2022 Tsinghua University et al. 13B YES free 23 Decoder-only
CodeGen March 2022 Salesforce Research 350M, 1B, 3B, 7B, 16B YES free Python Decoder-only
StarCoder May 2023 BigCode 15B YES free >80 Encoder-only
phi-1 June 2023 Microsoft 1.3B NOT YET free Python Decoder-only
Code Llama August 2023 Meta 7B, 13B, 34B YES free >7 Decoder-only

TABLE III
K EY T ERMINOLOGY R ELATED TO L ARGE L ANGUAGE M ODELS

Term Explanation
Chain of Thoughts (CoT) In the context of LLMs, chain of thought represents the logical flow of ideas and reasoning within the text generated by LLMs.
Encoder & Decoder Encoders are components of LLMs that map any given input of a specific type (such as image, audio, text) into a latent vector
space. Decoders perform the reversal: they can take an input from a latent vector space, and (re)constructdata of the original type.
Few-shot learning A machine learning technique that aims to train models to perform well on new tasks or classes with only a few new items of
labelled training data. It is also known as in-context learning. With LLMs, few-shot examples are typically included in the prompt.
Fine-tuning A process by which a model, trained on a large dataset or a related task, is further trained on a smaller or more specific dataset
to improve its performance on the target task or domain.
Generative AI A category of artificial intelligence that focuses on generating or creating new content, such as images, text, music, and videos.
Parameters Parameters are the numerical values inside LLMs that are adjusted during training, and primarily include weights and biases.
Weights dictate the strength of connections between neurons and serve as coefficients to the input values or activation thresholds
for preceding neurons. Biases shift the weighted sum of inputs, before this sum is fed into the activation function. The number of
parameters is often used as a measure of the size of an LLM.
Prompt The input provided to the LLM to stimulate the generation of a response.
Prompt engineering The process of designing and optimising prompts to achieve desired outcomes.
ReAct The ReAct (Reasoning and Acting) prompting framework allows LLMs to generate reasoning traces as well as actions specific to
the given task. Once an LLM generates an action, it can be carried out externally, and the observation of the output of the action
can be included in the next prompt, providing information to the LLM. This enables LLMs to use external tools.
Temperature A parameter that affects the randomness of the generated content. A higher value (e.g., 1.0) yields more diverse and creative
content, while a lower value (e.g., 0.2) yields more deterministic content.
Token A token is the atomic unit with which an LLM represents its input and output. Tokens are enumerations, and can represent words,
characters, subwords or other segments of text and/or code.
Top-N, Pass@N For many applications, LLMs will typically generate a number of candidate solutions in a ranking. Top-N metrics count the number
of problems correctly solved by an LLM with an answer among its Top N candidates. Similarly, Pass@N counts the number of
programming questions for which a candidate program within the Top N rank has passed the test case.
Zero-shot learning A machine learning technique that enables models to generalize and make predictions on classes or tasks that were not seen during
the training phase. There is no labelled data available for these new classes.

Zhang et al. [33] conducted a preliminary evaluation of A. Open Problems in LLMs for Requirement Engineering
ChatGPT’s zero-shot requirement retrieval performance on Unlike other software development activities, we did not
two requirements analysis tasks over four data sets. Although find much work on LLM-based requirements engineering or
these results are only preliminary, they provide optimism that on LLM-based design. Indeed, there was even evidence that
LLMs can be used as a support for efficient and effective practising engineers are reluctant to rely on LLMs for higher-
requirements engineering. Luo et al. [34] conducted prompt level design goals [36]. There is thus a great opportunity to
engineering with BERT for automatic requirement classifica- expand on this open field of research.
tion. Luitel et al. [35] focused on requirements completeness The majority of LLM applications are focused on tasks such
and used BERT to generate predictions for filling masked slots as code generation, testing, and repair. These tasks benefit
in requirements. from LLM’s capability to generate code. Nevertheless, LLMs
also have significant potential to support requirements engi-
neering activities, thanks to their natural language processing
capabilities.

5
For example, traceability is a long-standing, cross-cutting Their empirical study provided evidence for the efficacy of
concern in software engineering. In particular, identifying this scavenging approach, but also underlined the repetitive
traceability links between requirements and other engineering and predictable nature of software. In a larger repository
artefacts, such as code and tests, are especially challenging (sourceforge), Gabel and Su [47] found that a programmer
because requirements are often written in natural language; a would have to write more than six lines of code in order to
natural fit for LLMs. create a novel code fragment.
These research findings on code naturalness, reusability
IV. C ODE G ENERATION AND C OMPLETION and predictability, make it unsurprising that LLMs have been
Of all the Software Engineering application areas for LLMs, able to exploit that same predictable reusability to produce
code completion is the area that has been most thoroughly effective recommendations for code generation. These ob-
explored hitherto. Even prior to the advent of LLMs, it was servations have underpinned the growth of generate-and-test
suggested that learning from existing code repositories is the approaches to repair and genetic improvement [8], [46]. The
key to successful and intelligent code completion [37]: pre- generate-and-test approach offers greater code transformation
trained LLMs deliver on these early aspirations for code freedom (compared to more traditional correct-by-construction
completion. While hallucination has been pointed out as the approaches [48]), precisely because the generated code may
weakness of LLMs more generally, the specific task of code not preserve strict, mathematically-defined (and not always
completion sidesteps hallucination problems by acting as a appropriate, nor useful) interpretations of correctness.
recommender system to the developer. The developer thus This freedom to explore a wider space of “semantic near
bears the responsibility to weed out any hallucinated LLM neighbours” allows Genetic Improvement to find dramatic
output before it leaks into the code base. optimisations (see Section VI-C). The Genetic Improvement
Of course, a high degree of hallucination would have approach, nomenclature, and evaluation methodologies also
made code completion recommender systems ineffective. The provide a scientific framework within which to understand and
widespread and rapid adoption, and the positive results already evaluate LLM-based code generation. Both technologies share
reported for code completion, provide early indications that the ‘generate-and-test’ approach to program transformation
this has not happened. For example, Murali et al. [38] reported and code generation, potentially making much of the existing
the experience of deploying CodeCompose, a code completion work on genetic improvement directly applicable to LLM-
tool based on the Incoder LLM [39], at Meta. During 15 based code generation.
days, 4.5 million code completion suggestions were made by In 2021, Chen et al. [49] introduced CodeX, a GPT language
CodeCompose, and the acceptance rate from developers was model fine-tuned on publicly available code from GitHub, and
22%. The qualitative feedback was highly positive, with 92% evaluated its Python code-writing capabilities. They released a
positive. Similarly, Peng et al. [40] reported that programmers new evaluation set called ‘HumanEval’ to measure functional
could complete a non-trivial task (implementing an HTTP correctness for synthesizing programs from docstrings, and
server in JavaScript) 56% faster when paired with GitHub found that CodeX outperformed GPT-3 and GPT-J when tack-
Copilot, compared to the control group that did not receive ling these problems. Since then there has been an explosion in
any such support. research on LLM-based code generation and the HumanEval
Many software engineers already appear to have decided dataset has been used in many subsequent studies.
that benefits outweigh any necessary human filtration effort, In 2022, Li et al. [27] introduced AlphaCode, a system for
with enthusiastic levels and rates of adoption already being code generation that creates novel solutions to competitive
reported. Once LLM-based code completion is fully adopted, programming problems. They found that three key components
there are expectations that programmers will spend more time were critical to achieving reliable performance:
reviewing rather than writing code [41]. 1) An extensive programming dataset for training and eval-
uation.
A. Code Generation Models 2) Large and efficient-to-sample transformer-based archi-
Automated code generation has a long history, tracing its tectures.
origins back to early visions of automated program synthesis 3) Large-scale model sampling to explore the search space,
[42], which have continued to develop and have generated followed by behaviour-based filtering.
impressive results [43]. In simulated evaluations on programming competitions on
From the pioneering work of Hindle et al. on the naturalness the Codeforces platform, AlphaCode achieved, on average, a
of software [44], we know that programmers write code (and ranking of the top 54% in competitions with more than 5,000
languages enforce code writing styles), that make code highly participants.
predictable. Furthermore, Barr et al. [45] found that 43% Several papers also introduced code synthesis LLMs [50]–
of commits to a large repository of Java projects could be [53], based on large data sets with little pre-filtering of the
reconstituted from existing code. They called this ‘The Plastic training data. However, in 2023, Gunasekar et al. [54] reported
Surgery Hypothesis’ because of the way automated repair that, by training with only a textbook-quality code corpus,
proceeds by scavenging for existing code to patch up issues LLMs with lower parameter counts could achieve performance
elsewhere [46]. comparable to much larger models.

6
They classified an existing Python code corpus with the Hybrid approaches have also used existing software engi-
GPT-4 model, by prompting it to determine the educational neering and/or AI techniques to select the best candidate from
value of the given code for a student who wants to learn an LLM’s top-n outputs. For example, Chen et al. [71] use test
programming. Second, they used GPT-3.5 to create synthetic generation to choose candidates and reported improvement of
textbooks about Python. Specific code generation use cases approximately 20% on five pre-trained LLMs; Inala et al. [72]
have also been tackled, such as numerical algorithm code use a neural network-based ranker to predict code correctness
generation [55], and generation of code from behavioural and potential faults. Jain et al. [73] proposed Jigsaw, which
descriptions [56]. More examples of the existing LLMs for post-processes the generated code based on program analysis
code generation and the code generation leaderboard can be and synthesis techniques.
found in Table II and Figure 4. Dong et al. [74] treated LLMs as agents, letting multiple
B. Prompt Engineering for Improved Code Generation LLMs play distinct roles in addressing code generation tasks
collaboratively and interactively. They reported improvements
Prompt engineering has been extensively used as a way to
of approximately 30%-47%.
improve code generation. For example, Li et al. [57] reported
pass@1 improvements of between approximately 50% and D. Scientific Evaluation of LLM-based Code Generation
80% on CodeX, CodeGeeX, CodeGen, and InCoder on several There is a pressing need for more thorough scientific
benchmarks (MBPP for Python, MBJP for Java, and MBJSP evaluation. Many authors have anecdotally reported on cases
for JavaScript). Döderlein et al. [58] reported the prompt- where LLMs failed to generate correct, secure, and reliable
engineered improvement of Copilot and CodeX success rates code. Poldrack et al. [75] also highlight the need for substantial
from approximately 1/4 to 3/4 on HumanEval and LeetCode. human validation. In this section, we survey the literature on
He and Vechev [59] used prompt engineering to improve the the empirical evaluation of LLM-based code generation in
security of LLM-generated code, reporting an improvement terms of correctness, robustness, explainability, determinism,
in security from 59% (of cases considered) to 92%. White and security.
et al. [60] provided a catalogue of prompt engineering design 1) Correctness Evaluation: The GPT-4 Technical Re-
patterns for various software engineering tasks, including code port [28] evaluated the correctness of GPT-4’s code generation
generation. Denny et al. [61] argued that prompt engineering on the HumanEval dataset, reporting a zero-shot accuracy of
is a useful learning activity that fosters software engineering 67%, a modest improvement on the (earlier ChatGPT) results
students’ computational thinking. reported by Yetistiren et al. [76].
Other authors have considered ways to decompose prompt Borji [77] presented a rigorous, categorised and systematic
engineering into iterative and multiphase conversations with analysis of LLM code generation failures for ChatGPT. Eleven
the LLM, moving it closer to Chain of Thought reasoning. categories of failures, including reasoning, factual errors,
For example, Li et al. [62], [63] reported an 18% increase in mathematics, coding, and bias, are presented and discussed
ChatGPT Pass@1 using a two-stage sketch-based approach, in their work.
SkCoder, in which the LLM first creates a sketch and then Figure 4 shows the leaderboard of code generation correct-
subsequently implements these sketches. Jiang et. al. [64] and ness in terms of the pass@1 (i.e., the test pass rate for the top-1
Zhang et al. [65] also sought to deploy Chain-of-Thought-style code candidate) on the HumanEval dataset according to Papers
reasoning by prompting LLMs to reflect and self-edit. With Code, a platform that highlights trending AI research and
Existing software engineering analysis techniques can also the code behind the method and models.4 The LLM models
provide additional information for fine-tuning and prompt behind each method are shown in brackets. At the time of
engineering. For example, Ahmed et al. [66] show how simple writing, the best code generation model, Reflexion [78], can
static analysis can be used in the prompt to improve the generate correct code for over 90% of the generation tasks.
performance of code generation with few-shot learning. However, these numbers and the relative rankings of different
Shin et al. [67] compared prompt engineering and fine language models are inherently subject to change in such a
tuning with GPT-4 for code generation tasks, demonstrating rapidly developing field. For example, the figure given for
that fine-tuning works better than prompt engineering. correct code on HumanEval in the original GPT-4 Report [28]
C. Hybrids of LLMs and other Techniques was only 67%, so the updated figure of 80% (at the time of
Throughout our survey of the literature, we found strong writing, which is five months later) retrieved from the Papers-
evidence that some of the most promising results can be With-Code website presumably represents the evolution of
achieved by hybridising; combining LLMs with other existing GPT4 since then.
software engineering techniques. This section surveys work on Despite the promising results in the literature on code
hybrid LLMs for code generation. generation and completion, Din et al. [79] reported that the
Several authors have developed hybrids of LLMs combined performance of code completion dropped by more than 50%
with planning and search. For example, Zhang et al. [68], on HumanEval when the context contains bugs.
[69] reported improvements over baselines of between approx- 4 The actual leaderboard can be found at https://paperswithcode.com/sota/
imately 11% and 27%, while Zhang et al. [70] hybridized code code-generation-on-humaneval/; results in Figure 4 accessed on 24th August
generation with API search techniques. 2023.

7
Reflexion (GPT-4)
Parsel (GPT-4+CodeT)
LLM-based code generation method MetaGPT
GPT-4
CODE-T (code-davinci-002)
Code Llama (unnatural)
WizardCoder
phi-1+1.3B
GPT-3.5 zero-shot
InstructCodeT5+16B
CodeGen
CodeT5+ 16B
Codex+12B
LLaMA+65B
LaMDA+137B

0 10 20 30 40 50 60 70 80 90 100

Pass@1 for code generation on HumanEval

Fig. 4. Code generation leaderboard for the HumanEval benchmark. These methods are either based on LLMs or LLMs themselves.

2) Robustness Evaluation: LLM code generation robust- Mohammadkhani et al. [85]used the attention mechanism
ness is the degree to which similar prompts elicit semantically to study CodeBERT and GraphCodeBERT on tasks including
and syntactically similar code generation. Treude [80] intro- code documentation generation, code refinement, and code
duced GPTCOMPARE, a prototype tool for visually highlight- translation.
ing similarities and differences between LLM code outputs. 4) Determinism Evaluation: LLMs are nondeterministic.
Yan et al. [81] introduced COCO to test the robustness and Ouyang et al. [10] empirically studied the non-determinism of
consistency of LLM-based code generation systems. ChatGPT in code generation, founding that over 60% of tasks
3) Explainability Evaluation: One considerable advantage had zero equal test output across different requests. Neverthe-
of LLMs, over previous machine learning techniques, is the less, their study of the literature in LLM-based code generation
way in which the code generation artefacts are accompa- demonstrate that only 21.1% of these papers consider the non-
nied by explanations. Such explanations have the potential determinism threat in their experiments.
to increase adoption, by providing additional confidence and 5) Security Evaluation: Hajipour et al. [86] proposed a few-
faster understanding. More work is needed to evaluate and shot prompting approach to detecting security vulnerabilities,
optimise explanations that accompany generated code and reporting that their approach automatically finds thousands of
other software engineering artefacts. security vulnerabilities in several models. Khoury et al. [87]
found that the code generated by ChatGPT often fell way
Initial evaluation by MacNeil et al. [82] on their interactive
below even minimal standards of secure coding. Risse and
Web development e-book, suggested that a majority of students
Böme [88] reported results that indicated vulnerability de-
perceived LLM-generated code explanations to be helpful.
tection accuracy may be over-reported, due to the model
Noever and Williams [83] also showed the potential for
overfitting to unrelated training set features .
explanations to help human engineers, particularly where code
In addition, Yetistiren et al. [76] presented a comprehensive
is obfuscated or lacks sufficient existing documentation. In
evaluation of the performance of Copilot, CodeWhisperer, and
this way, the ability to produce insight and explanation may
ChatGPT, covering different aspects including code validity,
go beyond simply justifying the code generated by the LLM
code correctness, code security, code reliability, and Their
itself, but may become a valuable source of education and
results show a wide degree of divergence in performance,
documentation (See Section XI).
motivating the need for further research and investigation. For
Sun et al. [84] focus on users’ explainability needs for gen- example, they found 65%, 46%, and 31% of the programs
erative AI in three software engineering use cases: code gen- generated by ChatGPT, Copilot, and CodeWhisperer (respec-
eration based on natural language description (with Copilot), tively) were correct.
translation between different programming languages (with 6) Benchmarks: As with other scientific evaluations, soft-
Transcoder), and code autocompletion (with Copilot). Their ware engineering evaluation relies on publicly available and
investigation was conducted as 9 workshops with 43 software representative benchmark suites. A number of these have
engineers and identified 11 categories of explainability needs already emerged and can support software engineering evalu-
in the context of Generative AI (GenAI) for code. It also ation of LLM-based applications. The Papers-With-Code plat-
proposed 4 types of features for generative AI: AI documenta- form5 provides a summary of 15 benchmarks for evaluating
tion, model uncertainty, attention distribution, and social trans- code generation.
parency (i.e., making visible the socio-organizational factors
that govern the use of AI). 5 https://paperswithcode.com/task/code-generation

8
Evaluations have often relied on small programming prob- For example, transfer learning has been proposed as a way
lems from programming courses [89], synthetically generated to improve code completion performance when the volume
problem sets [90], and online judging platforms such as of training examples for a specific programming language is
Leetcode [29], [65], [91]. Although results reported naturally inadequate [96].
vary by LLM in training sets, the overall conclusions of these The current focus of research is on the code produced by
evaluations indicate success rates of between 20% and 80%. LLMs. However, the explanations produced by LLMs may
Nevertheless, existing code generation benchmarks tend to turn out to be at least as important. One could imagine many
rely on test suites to automatically judge code correctness, scenarios in which an engineer would prefer to accept a
which can be inadequate, leading to false judgements [92]. (possibly) suboptimal software engineering artefact that comes
This highlights the need for more work on evaluation bench- with a compelling explanation, over a potentially more per-
marks that are specifically tailored to LLM-based code gener- formant solution with a less compelling explanation. After all,
ation evaluation. Liu et al. [93] draw attention to the problem, engineers regularly make the same judgement call for human-
showing how existing test suites can lead to high degrees designed engineering artefacts, so why would we expect it to
of false positive conclusions (also a serious problem for be any different for those produced by machines? As with
online judge platforms [92]). To alleviate this problem, they prompt engineering, which focuses on optimising the input to
propose EvalPlus – a code synthesis benchmarking frame- the LLM, explanation engineering is also likely to become an
work that automatically generates test inputs and rigorously area of study in its own right.
evaluates the functional correctness of LLM-generated code.
Their evaluation of 14 popular LLMs (including GPT-4 and V. S OFTWARE T ESTING
ChatGPT) demonstrated that with the newly generated tests for
Software testing is a well-established research discipline,
HumanEval, the assessment of pass@k drops by up to 15%,
the origins of which can be traced back to Turing’s pioneering
averaged over problems considered.
work in the late 1940s [97]. Much of the focus of this research
Jimenez et al. [94] introduced SWE-bench with the aim of
has been on the automated generation of test suites, able to
evaluating LLMs on code generation problems in a realistic
achieve high fault revelation potential at low computational
software engineering setting. SWE-bench contains 2,294 soft-
cost [3]–[5]. This provides us with, not only techniques able
ware engineering problems, drawn from real GitHub issues.
to weed out incorrect LLM-generated code, but also a mature
The results suggest that Claude 2 and GPT-4 solve only 4.8%
baseline against which to compare novel LLM-based and
and 1.7% of the coding tasks, respectively.
hybrid techniques for test suite generation.
E. Open Problems in Code Generation and Completion There is already a sufficiently large body of work to warrant
Assessing the generated code remains a critical problem for a survey specifically on LLM-based Software Testing: Wang
LLM-based code generation and completion: while much work et al. [98] presented a survey of papers primarily on testing,
already started applying existing software testing knowledge to but also including debugging and repair. They reported on
this problem, we expect closer integration of automated testing 52 papers (33 published since 2022) of which approximately
techniques with code generation and completion techniques. one-third concerned test-based LLM fine-tuning, while the
Fortunately, there is a large body of existing work on remainder relied upon prompt engineering.
automated test data generation [3]–[5], much of which will
A. Generating New Tests Using LLMs
have an important role to play in ensuring the correctness
of the engineering artefacts generated by LLMs. A recurring In this section, we review existing work on LLMs for
theme of the challenges covered in this paper, is that code test data generation, before highlighting open problems and
execution provides precisely the ‘ground truth’ needed to filter challenges for the development of this emerging field. The
hallucinated responses. It can also provide guidance as part of tests generated may not be executable because the LLM is not
interactive reasoning/action (‘ReAct’) dialogue [95], both with guaranteed to generate compilable code. Nie et al. [99] report
and within LLMs. 29% of tests generated using TeCo are executable, while Yuan
Automated test data generation allows the software engineer et al. [100] found that approximately one-quarter of the tests
to target the exploration of the most relevant regions of generated by ChatGPT were executable, rising to one-third
this run-time ground truth. This test-based targeting can help with suitable prompt engineering.
filter, fine-tune and to optimise prompts, thereby minimising Of those tests that do compile, several authors have re-
problems posed by hallucination. LLMs also have considerable ported on the code coverage achieved. For example, Bareiß
potential for automating the process of constructing effective et al. [101] reported an increase from the 10% achieved using
and efficient software test suites. Randoop [102] to 14% with CodeX. Hashtroudi et al. [103]
Another important problem is how to efficiently fine-tune reported a 50% increase in line coverage for the tests they
pre-trained LLMs so that they perform better for a specific generated by fune-tuning CodeT5. Siddiq et al. [104] reported
programming language, codebase, or domain: this is especially 80% coverage on the HumanEval dataset using CodeX, but
important because training an LLM from scratch requires also found that neither the studied LLMs could achieve more
significant computational resources. than 2% coverage on the EvoSuite SF110 dataset.

9
Hybrid approaches that combine existing test generation and Feng and Chen [114] demonstrated a replicability rate of
evaluation techniques, such as fuzz-based testing and search- 80% on bug reports with natural-language-defined steps to
based testing, with LLMs have already demonstrated promis- reproduce, using an LLM out of the box (ChatGPT) with
ing results. For example, Lemieux et al. [105] introduced Chain of Thought prompt engineering alone.
CODAMOSA, an algorithm that combines Search-Based Soft- Several authors have considered prompt engineering to im-
ware Testing (SBST) [5] and CodeX to generate high-coverage prove the results of test generation [115], [116]. For example,
test cases for programs under test. When SBST’s coverage Schafer et al. [116] proposed TESTPILOT, which re-prompts
improvements stall, CODAMOSA asks CodeX to provide with failing tests and associated error messages, achieving
example test cases for under-covered functions. This helps reported average statement coverage of 68%. Xie et al. [117]
SBST redirect its search to more useful areas of the search create prompts for test generation by parsing the project and
space. In an evaluation of 486 benchmarks, CODAMOSA creating an adaptive focal context that includes the focal
achieved significantly higher coverage compared to SBST method and its dependencies. They further used rule-based
and LLM-only baselines. Hu et al. [106] introduced Chat- repair to fix syntactic and simple compile errors in the tests.
Fuzz, which augments the widely studied fuzzer, AFL, with Although the outcomes of LLM-based testing may be un-
ChatGPT, in order to get more format-conforming mutants. certain, researchers have explored cross reference or majority
In an evaluation of 12 target programs chosen from three of votes [118], [119] methods to estimate the confidence of
benchmarks, ChatFuzz achieved higher branch coverage than LLMs, based on the notion of ‘self-consistency’ [120]. For
AFL by 13%. Dakhel et al. [107] used mutation testing to example, the Libro introduced by Kang et al. [113] uses
help LLMs to generate tests. In particular, they augmented CodeX to generate tests from bug reports that can reproduce
prompts for Codex and Llama-2-chat with surviving mutants. failures. If multiple tests show similar failure behavior, Libro
They report that their approach detects 28% more human- estimates that LLM is “confident” in its predictions. Further-
written faults. Xia et al. [108] recently demonstrate that LLMs more, where there is partial oracle information, this can also
can serve as a universal fuzzer for systems across different be used to augment confidence estimates. Such partial oracle
application domains and programming languages, including information is often available when the goal of the overall
C/C++ compilers, JDK, SMT solvers, and even quantum processes to improve on existing code. For example, when
computing systems. improving the efficiency of an existing test, automated partial
Deng et al. [109] propose TitanFuzz, which uses LLMs oracle information can be gathered from observing whether
(i.e., Codex) to generate valid input DL programs to test DL the test behaves similarly to the original (passing and failing
libraries. The results on PyTorch and TensorFlow reveal that in the same situations), and is also faster to execute.
TitanFuzz can achieve 30%/51% higher code coverage than
state-of-the-art fuzzers. Later on, they further introduced Fuz-
zGPT [110], which synthesizes unusual programs for fuzzing B. Test Adequacy Evaluation
DL libraries. Their results indicated that CodeX and CodeGen
could outperform TitanFuzz on PyTorch and TensorFlow when Test effectiveness is typically measured in terms of ‘ade-
re-targeted for fuzz-based testing. quacy criteria’ [121], [122]. Since testing cannot exhaustively
Li et al. [111] used a hybrid of differential testing and explore every possibility, adequacy criteria provide a form of
ChatGPT to elevate the latter’s ability to generate failure- lower bound on the effectiveness achieved by a suite of tests.
inducing test cases of buggy programs. They report a test Mutation testing is a widely-studied technique for assessing
effectiveness improvement from 29% to 78%. the adequacy of software test suites [123], [124], in which
A promising area for LLM-based test generation is GUI synthetic faults (called ‘mutants’), are deliberately injected
testing, because the manipulation of the application state via in order to assess test adequacy. Mutation testing has been
GUI often requires a semantic understanding of both the user shown to provide more stringent adequacy criteria than other
interface as well as the application domain. Sun et al. [112] structural coverage-based criteria such as statement and branch
described user interface via text, and asked ChatGPT which coverage [125].
action it would like to perform next based on the text, then One of the challenging open problems for mutation testing
convert the answer into actual GUI interaction. This resulted in is to generate mutants that faithfully model important classes
32% higher activity coverage compared to the state-of-the-art. of real-world faults. Khanfir et al. [126] used CodeBert to
One particularly important problem that is challenging for generate developer-like mutants and found that their approach
classical techniques is the construction of test cases from user has better fault revelation ability than PiTest. Garg et al. [127]
reports. The user reports are written in natural language. This applied CodeBERT to generate mutants that faithfully capture
has presented considerable challenges for existing techniques, vulnerabilities. They evaluation found that 17% of the mutants
but is ideally suited to LLMs. Kang et al. [113] introduced fail the tests that are failed by 89% of the respective vulner-
Libro, a few-shot learning failure reproduction technique that abilities. Brownlee [128] used GPT-3.5 to generate mutants
automatically generates tests from general bug reports, based for genetic improvemnt and observed that randomly sampled
on CodeX. Libro successfully reproduced approximately one LLM-based edits compiled and passed unit tests more often
third of the failures. compared to standard GI edits.

10
C. Test Minimisation Test augmentation and regeneration can exploit few-shot
Test minimisation improves the efficiency of software test- learning and/or can fine-tune (on an existing suite of test data
ing by removing redundant test cases. Pan et al. [129] ap- and historical faults), to generate augmented test suites.
plied CodeBERT, GraphCodeBERT, and UniXcoder to extract More work is needed on LLMs for generating additional test
embeddings of test code to conduct test minimisation. Their assertions that capture corner cases, historical faults, and likely
approach achieves a 0.84 fault detection rate and runs much programmer errors, drawing on the training data available.
faster (26.73 minutes on average) than the baseline. Hybridization between LLMs and existing automated test
generation techniques is also a productive theme [105].
D. Test Output Prediction 3) Test Correctness: Traditional software test generation
has suffered from the Oracle Problem [6], i.e., they are
Liu et al. [130] proposed CodeExecutor, a pre-trained Trans-
inhibited by the lack of an automated oracle that determines
former model, to predict the program’s whole execution trace.
whether a test outcome is correct. Two cases pertain to AI-
The purpose is to imitate the real-world arbitrary program
generated tests:
execution behaviour. Their evaluation compares CodeExecutor
with CodeX, and shows that CodeExecutor significantly out- 1) The generated test passes on the current release: We
performs Codex in execution trace prediction (e.g., 76% vs. might assume that the functionality is correctly tested
13% output accuracy for the Tutorial dataset). and that the generated test thus acts as a regression test,
against which future changes can be checked.
E. Test Flakiness 2) The generated test fails on the current release: We
need to know whether the assertion is wrong or whether
A test is flaky if it can pass on some occasions and fail
the generated test has found a bug.
on others without any apparent (tester-controllable) change
in the execution context. Test flakiness is one of the most Both cases can have pernicious consequences when they are
pressing and impactful problems that inhibit test effectiveness not imbued with self-regulation. A test case that passes may
in industry [131]. LLMs have been used to predict flakiness merely reflect coincidental correctness [137], [138]. Worse, it
with high accuracy (with 73% F1 score [132], [133] and 97% might be the case that the code is, in fact, incorrect (and that
accuracy [134] reported). the test is equally incorrect yet captures, and thereby enforces,
the incorrect behaviour). In such cases, the generation of
F. Open problems in LLMs for Software Testing the test will tend to inhibit fault remediation, by failing on
future fixes. This problem also affects LLM-generated test
There are many open problems in LLM-based software test
cases, and may be more pernicious in cases where such tests
data generation, most of which lie well within the grasp of
hallucinate oracle properties, baking into the generated tests
existing software testing techniques. We can thus expect an
these incorrect oracle assertions.
exciting explosion in LLM-based software test generation in
On the other hand, when a generated test case fails, this may
the coming years. This section outlines some directions for
indicate a bug. This bug revelation would denote a ‘win’ for
this research agenda.
LLM-based testing. However, should it turn out that the ratio
1) Prompt Engineering: There are many aspects of a good
of false positives to true positives are high, then the cost (e.g.,
software test that could be favoured by suitable prompt engi-
in human assessment) may make the technique impractical,
neering. For example, we need to understand how to engineer
even when it does reveal true positive bugs [131]. More work
prompts that
is needed on self-assessment of confidence, self-checking for
• Predict and reduce generated test flakiness; correctness, consistency, and robustness of generated tests.
• Reveal likely faults, for example via training on historic We need to develop techniques for automatically assessing,
fault data; augmenting and filtering raw outcomes from execution of
• Optimise the balance between mocking and integration LLM-based tests, before presenting the ‘test signal’ to the
testing; developer.
• Make realistic data builders, mock objects, parameters The interaction between LLM hallucination and test correct-
and inputs; ness is an important topic in its own right. Since LLM-based
• Predict tests that are most likely to elicit tests that cover code generation is generally driven by what is most likely,
corner cases; rather than what is most correct, hallucination poses threats
• Tailor test generation to focus behaviour that is prevalent to any questions of correctness. However, interestingly, Feldt
in production. et al. [139] reported a case of hallucination being helpful for
2) Augmenting Existing Tests: Work on LLM-based test software testing, because it may reveal discrepancies between
generation has focused on the automated generation of novel the actual program semantics and the programmer’s perception
test suites. However, given the array of existing test generation of the semantics. They suggested a form of conversational
techniques, there remains an important (and comparatively less testing agents (i.e., any generated tests are filtered by the
well-studied) open problem of augmentation and regeneration programmer via the conversation) to harness this capability
based on existing test suites [135], [136]. without posing any threats to overall test correctness.

11
More work is also required on the scientific foundations on Much of the work on automated repair has used the
which evaluations of LLM-based software testing rest. More generate-and-test approach widely adopted in the field of
care and attention are clearly needed to heed the ‘best practice’ Genetic Improvement and readily applicable to LLM-based
advice for the scientific analysis and reporting from previous techniques. As a result, LLMs are certain to have a positive im-
work on foundations of Empirical and Search Based Software pact on automated software repair, but there remain technical
Engineering [11], [13], [14]. challenges in taming the hallucination problem and managing
4) Mutation Testing: More work is needed to explore the scalability, as we report in this section.
adequacy achievable with LLM-based test generation, and also In order to achieve scalability, all generate-and-test ap-
to use LLM-based techniques to support and enhance test proaches need to address the build time problem [149]. LLM-
adequacy investigation and assessment. LLMs can be fine- based repair is no exception; the propensity to hallucinate
tuned on a fault model, and thereby used to suggest mutants makes it all the more important that the test phase can
that are highly coupled to real faults, and can thus be used to be executed regularly. It is likely that using ReAct deploy-
assess test adequacy. ment models [95] will help to find efficient and effective
engineering trade-offs. When ReAct is applied to repair, the
VI. M AINTENANCE , E VOLUTION AND D EPLOYMENT overall approach would alternate between the ‘Reason’ phase
Software maintenance and evolution have been important (generating candidate fixes) and the ‘Action’ phase (evaluating
topics of study for many decades. They are concerned with fixes through testing, which involves the build problem).
existing code bases from which we seek understanding and To address this issue, we can refer to the well-established
business logic extraction, and for which we seek to re- literature on software repair [46], [150], grounded in over two
engineer, repair and refactor. Maintenance problems, such as decades of the development of search-based approaches to
these, all reside within language-rich problem domains. It is software engineering [12], [151]. This literature provides the
therefore unsurprising that this area finds many applications research community with a firm foundation of experience and
of LLM-based techniques, as we review in this section. expertise, making it very well-placed to develop LLM-based
generate-and-test approaches to the problem.
A. Debugging Recent work on repair has started to use neural AI models,
Kang et al. [140] studied GPT-3.5’s fault localisation ability, such as the seminal work of Tufano et al. [152]. More recently,
and found that LLM could often identify the faulty method since 2022, there has been a rapid development of emergent
on the first try. Wu et al. [141] present a comprehensive embryonic research literature on LLM-based repair. For ex-
investigation into the capability of GPT-3.5 and GPT-4 for fault ample, Xia et al. [153] proposed AlphaRepair. It redefines
localisation accuracy, stability, and explainability. The results the APR problem as a cloze (or infilling) task, where the
demonstrate that GPT-4 achieves 47% higher fault localisation LLMs are leveraged to directly fill-in correct code based on
accuracy over the state-of-the-art, but the performance declines the bi-directional context of the potential buggy code portions.
dramatically with a longer code context. AlphaRepair also demonstrates for the first time that LLMs can
Feng and Chen [142] proposed AdbGPT, which reproduces outperform all prior APR techniques.
Android bugs from bug reports through prompt engineering They further conducted an empirical study [154] on nine
with ChatGPT. With a dataset of 88 bug reports, AdbGPT LLMs across five datasets in three different languages. Their
was able to successfully reproduce 81%, outperforming the findings not only affirmed the superiority of LLM-based APR
baselines and ablations. Joshi et al. [143] focused on mul- (especially the cloze-style approach) but also offered a number
tilingual debugging and proposed RING, which proposes a of practical guidelines. Wei et al. [155] synthesize a patch
prompt-based strategy that conceptualizes repair as localiza- through the interaction between an LLM and a Completion
tion, transformation, and candidate ranking. Engine, and found that the approach surpasses the best-
To address the data leakeage threat in fault localisation and performing baseline by 14 and 16 bugs fixed.
program repair, Wu et al. [144] introduced ConDefects with Program repair naturally fits a conversational model of
1,254 Java bugs and 1,625 Python bugs that were produced prompt engineering. Xia et al. [156] proposed conversational
between October 2021 and September 2023. Researchers APR, which alternates between patch generation and vali-
are allowed to select code samples based on their creation dation in a conversational manner. Their evaluation on ten
period, thereby allowing them to evaluate the effectiveness of LLMs demonstrated that their approach has superiority in both
different LLMs according to their training data cut-off date. In effectiveness and efficiency.
addition, there has been work on predicting bug severity with They further proposed ChatRepair [157], showing that
LLMs [145]. the conversational approach fixes 162 out of 337 bugs for
only $0.42 per bug, thereby also addressing potential con-
B. Program Repair cerns about the computational resources required. Chen et
Repairing bugs has been a topic of much interest for over a al. [158] introduced SELF-DEBUGGING, which teaches an
decade in the software engineering research community [146], LLM to debug its predicted code via few-shot learning, SELF-
[147], and has already found its way into initial industrial DEBUGGING reports baseline accuracy improvements of up
deployment [148]. to 12%.

12
Studies have also reported results for particular classes of C. Performance Improvement
bugs, for example, Pearce et al. [159] reported repair results
from five commercial LLMs on security bugs, Charalambous Since the inception of computer programming, the
et al. [160] combined ChatGPT with with formal verification paramount importance of performance optimisation has been
strategies to verify and automatically repair software vulner- recognised. Indeed, performance optimisation is even men-
abilities. Cao et al. [161] report ChatGPT results for Deep tioned by Ada Lovelace in her nineteenth-century notes on
Learning (DL) program repair. the analytical engine [170]. Much initial practical deployment
of optimisation took place in compiler development, through
Repair does not always start with an existing failing test work on optimising compilers [171]. This is the bedrock on
case, but can start with a natural language description of a which current practical and efficient computation rests, but it is
failure in production. Automation opens the door to faster necessarily a one-size-fits-all approach; widely applicable due
responses to user-generated bug reports. This is a route to to its generality, yet suboptimal for bespoke problem domains
repair that has also been explored for LLMs in the work of for the same reason. There has, therefore, also been much
Fakhoury et al. [162], who generated functionally correct code work on specific source-to-source transformations to improve
edits from natural language issue descriptions. They propose optimisation, dating back to the 1970s [172], [173].
Defects4J-Nl2fix, a dataset of 283 Java programs from the For a long time, the focus of this work was on finding
Defects4J dataset with high-level descriptions of bug fixes. The suitable sets of meaning-preserving transformations, the moti-
state-of-the-art LLMs evaluated on this benchmark achieve up vation being that a correct program can be transformed into a
to 21% Top-1 and 36% Top-5 accuracy. more efficient version of itself, while retaining its correctness.
Automated repair can also reduce the burden on engineers, However, more recently, research on program synthesis took
managing DevOps-style on-call for production systems. For a different turn: Inspired by Genetic Programming [174], and
example, Ahmed et al. [163] studied the use of LLM-based early results from Automated Program Repair [146], [175], it
root causing and remediation of 40,000 incidents on Microsoft considered a wider set of transformations in an approach that
cloud services. The authors evaluated multiple LLMs using has come to be known as ‘Genetic Improvement’ [8], [176].
semantic and lexical metrics in zero-shot, fine-tuned, and mul- The wider set of transformations may produce incorrect
titask settings, showing that fine-tuning significantly improves code, but automated testing can filter these, to ensure suffi-
incident response effectiveness. cient faithfulness to the intended semantics. Furthermore, the
freedom to treat existing code as a kind of ‘genetic material’
The ability to perform fine-tuning for a specific task or produced dramatic improvements in non-functional properties,
domain can significantly improve the model performance in such as execution time, memory and power consumption (e.g.,
program repair. Jiang et al. [164] empirically evaluated the 70x speed up of a non-trivial gene sequencing system [177]).
performance of 10 different Code Language Models (CLMs)
Although the potential for artificial intelligence techniques,
and 4 fault benchmarks, and showed that repair-specific fine-
such as evolutionary algorithms, to improve performance has
tuning could significantly improve success rates. On aver-
been well studied, researchers have only just begun to consider
age, the 10 CLMs already successfully repaired 72% more
the potential for LLM-based performance improvement. In the
faults than state-of-the-art DL-based APR techniques. After
work by Madaan et al. [178], the authors use CODEGEN
fine-tuning, the number increased to 160%. Jin et al. [165]
and CodeX to suggest functionally correct, Performance-
proposed InferFix, which contains a LLM (Codex Cushman)
Improving Edits (PIEs), improving execution time of Python
finetuned on supervised bug-fix data. InferFix achieves a
and C++ (already pre-optimised with the maximally opti-
76% Top-1 repair accuracy on Java, and over 65% on C#
mising compiler option -O3). Similarly, Garg et al. [179]
using the InferredBugs dataset. Berabi et al. [166] introduced
proposed DeepDev-PERF, a performance improvement sug-
TFix, a T5 model fine-tuned on bug-fixing data, reporting
gestion approach for C# applications that. DeepDev-PERF
that it outperformed existing learning-based approaches. Xia
took the English-pretrained BART-large model and further
et al. [167] combines LLM fine-tuning and prompting to
pretrained it on Source code. Kang and Yoo [180] proposed the
automate the plastic surgery hypothesis and demonstrated
use of LLMs to suggest objective-specific mutation operators
that their approach fixes 89 and 44 bugs (outperforming the
for genetic improvement, and provided demonstrations on
baseline by 15 and 8).
improving efficiency and decreasing memory consumption.
LLMs can also help to explain the patches that they Garg et al. [181] proposed RAPGen, which generates zero-
generate. Kang et al. [168] proposed AutoSD to provide shot prompts for LLMs to improve performance. The prompts
debugging explanation with LLMs to help developers judge are generated via retrieving a prompt instruction from a pre-
the correctness of patches. They found that AutoSD produced constructed knowledge base of previous performance improve-
comparable results to existing baselines with high-quality ments. Chen et al. [182] used GPT models as baselines for
repair explanations. Sobania [169] studied the capability of their source code optimisation method, Supersonic, and found
GPT 3.5 in explaining the patches generated a search-based that Supersonic improves running time for 26.0% of the
repair tool, ARJA-e, on 30 bugs from Defects4J. 84% of the programs, compared to only 12.0% for GPT-3.5-Turbo and
LLM explanations are found to be correct. 4.0% for GPT-4.

13
Cummins et al. [183] focused on the performance of com- Despite these correctness challenges, inherent in LLM-
pilers and presented results on LLMs for optimising compiler Based SE, there is a large pool of training data, and LLMs
instructions. Their results demonstrate that a relatively small have a propensity to exhibit emergent behaviour. These obser-
(7B-parameter) LLM, trained to generate instruction counts vations combine to yield surprising results that, although not
and optimized compiler LLVM code, can generate 3% im- guaranteed to be correct, can potentially dramatically change
provements in reducing compiler instruction counts, outper- performance characteristics in useful ways.
forming the state-of-the-art. Their results are also promising Of course, as we increasingly allow more permissive trans-
in terms of correctness, with 91% compilable and 70% func- formation pallets in the hope of optimising multiple non-
tionally correct wrt the original compiler output. functional properties, we simultaneously place far greater
reliance upon the ability of testing to provide reassurance
of functional faithfulness. Testing is also vital to check for
regressions in those non-functional properties that are not
targeted by the improvement process. As a result, software
1970s
testing in general (and automated high coverage test generation
2020s
“Correct by 2010s in particular), will become ever more important.
“Unconstrained”
Construction” “Syntactically Correct”
Transformations
Transformations Transformations
(e.g., Neural Machine Translations, D. Clone Detection and Re-use
(e.g., peephole (e.g., Genetic Improvements,
Large Language Models)
optimisation) Automated Program Repair)
There has been much previous work on managed software
reuse [184] in order to extract value and avoid duplication,
a topic also tackled using LLMs [185]. Software typically
contains large numbers of clones, arising from ad hoc re-use,
resulting in much work on automated clone detection [186],
Fig. 5. The Widening Scope of Program Transformation a topic for which fuzz-based fine-tuned LLMs have also been
applied [187].
Over a period of some 50 years, the software engineering
E. Refactoring
community has evolved its conception of what it means
to transform an existing software system into an equivalent When we refactor code, we generally expect its behaviour to
system that improves performance while retaining functional remain unchanged. This is particularly attractive for automated
behaviour. In the 1970s, the strongest concern was correctness, approaches (such as search-based refactoring [188]) because
so transformation palettes were defined to consist solely of it means that we can simply rely on the Automated Regres-
transformation steps that were (functionally) correct by con- sion Oracle. This ‘automatable oracle for free’ advantage is
struction. significant and will also apply to LLM-based refactoring.
However, by 2010 the community was already exploring Poldrack et al. [75] show that GPT-4 refactoring of existing
the application of considerably more relaxed notions of equiv- code can significantly improve code quality according to long-
alence that merely retain sufficient operational faithfulness established structural metrics such as Halstead [189] and Mc-
to the behaviour of the original. The tight semantic strait- Cabe [190] complexity. Noever and Williams [83] emphasize
jacket of the 1970s was thereby considerably relaxed to allow the value of AI-driven code assistants in refactoring legacy
transformations that might even fail some test cases. During code and simplifying the explanation or functionality of high-
the same period, operational performance became increasingly value repositories.
important. A key underlying principle of this research agenda F. Open Problems in Maintenance and Evolution
is that no overall software system can be deemed functionally Since so many of the subdomains of software maintenance
correct, when it is executed on a system in which inefficiency and evolution are concerned with existing legacy system
has left insufficient remaining resources. This principle applies source code, we can expect rapid growth in the application
even in the (comparatively rare) cases where the software has of LLMs. This section outlines some existing open problems
been fully proven to be functionally correct. As the more pithy in this nascent sub-area of research.
slogan has it: 1) Open Problems in Performance Improvement: Much
“There is nothing correct about a flat battery” [8]. more work is needed on the development of LLM-based tech-
This evolution of the community’s approach to code trans- niques for automatically finding performance improvements.
formation and synthesis is depicted in Figure 5 (red and yellow As with Genetic Improvement, these need not be confined
regions). merely to execution time, but can also consider other non-
In the context of this increasing relaxation of semantic functional attributes such as power consumption [191]–[193]
constraints, we can view LLM-based code optimisation as a and memory footprint [194] as well as multi-objective, trade-
further development of this overall direction of travel: Code offs between sets of non-functional properties [195]. We
optimised by LLMs may not be even syntactically correct, let expect more work on Genetic Improvement-style LLM-based
alone semantically correct (depicted by the green region of code optimisation techniques, with the potential for many
Figure 5). dramatic advances and breakthroughs.

14
2) Open Problems in Refactoring: By definition, refactor- There is a clear limitation to this approach due to the fact
ing does not change semantics, so LLM-based refactoring that the set of summaries that can be generated are constrained
can rely on the Automated Regression Oracle. It is therefore by the training corpus. LLMs may enable automated code
surprising that there is not already more work on LLM-based summarization that is not restricted to this training corpus,
refactoring. In this subsection, we outline possible directions. assisted by their natural language processing capabilities.
Design patterns have played a critical role in practical While this may result in richer and more semantically rele-
software engineering for three decades [196]. LLMs may help vant summaries, we also note that existing evaluation metrics
engineers to refactor existing code to use design patterns, while are often lexical in nature, hindering our ability to compare and
providing developer-friendly explanations and documentation. evaluate richer summaries generated by LLMs [198]. Recent
Refactoring also becomes necessary whenever new tech- advances in ReAct-based approaches [95] may open up other
nologies emerge. For example, when an API is updated or a avenues for greater assurance in the documentation generated,
new API becomes available. Although they can be (sometimes even when it cannot be executed.
automatically [197]) repaired, API misuse remains a common
VIII. S OFTWARE A NALYTICS AND R EPOSITORY M INING
source of software engineering bugs. Automating the process
of refactoring for new APIs is less challenging than other code There is a well-established field of software analytics; how
transformations, because of the presence of the Automated to yield insight for human engineers from existing software
Regression Oracle. artefacts [201]. The large amount of software artefact infor-
Finally, the few-shot learning capabilities of LLMs may mation publicly available online has stimulated the growth
enable more bespoke refactoring. The emergent work on LLM- of scientific insights gained by Mining Software Repositories
based refactoring has focused on global refactoring according (MSR) [202], [203]. While MSR tends to focus on scientific
to well-known refactoring patterns. However, programmers research insights from such mining, software analytics tends
often have project-specific refactoring requirements. Up to to focus on opportunities for organisations to gain insight from
a third of software engineering effort is spent on largely the analysis of their own repositories, which can also benefit
repetitive, tedious, and potentially error-prone refactoring ac- AI understandability [204].
tivities that implement these project-specific refactoring needs. Hitherto, in both cases, much of the collection, curation
The few-shot learning potential of LLMs may automatically and analysis of data has relied upon labour-intensive human
generalise from specific examples, automating what we call analysis. We found no work on the use of LLMs to support
‘bespoke’ refactoring. More work is needed to develop tech- this activity. Nevertheless, because many LLMs have already
niques for reliable few-shot-learnt bespoke refactorings. ingested this software artefact data, and are capable of provid-
ing reasoning and insight, it seems natural to expect them to
VII. D OCUMENTATION GENERATION play a significant role.
For example, LLMs may identify interesting new MSR
Most of the work on LLM-based software engineering has
research questions, based on their ability to ingest large
focused on the generation of code, but there is also consider-
amounts of data, including research questions and hypotheses
able potential for LLM-based documentation generation.
that have previously proved interesting to researchers. They
Sun et al. [198] explored how ChatGPT performs on code may also assist with traceability, which software engineers
summarisation of Python code. They used CSN-Python and have great difficulty maintaining [205], [206].
compared ChatGPT with NCS, CodeBERT, and CodeT5. They
adopted three widely-used metrics: BLEU, METEOR, and IX. H UMAN C OMPUTER I NTERACTION
ROUGE-L. Surprisingly, the results show that ChatGPT’s Finding productive interfaces between human engineers
performance is significantly worse than the baseline models and software infrastructure has remained a recurring theme
in terms of BLEU and ROUGE-L. throughout the lifetime of the development of software en-
Ahmed et al. [66] conducted prompt engineering for code gineering [207], [208], dating back to the inception of the
summarisation on GPT-3.5, while Geng et al. [199] performed discipline in the 1960s [209].
experiments on two Java language datasets, Funcom and TLC, We found evidence of many interesting research ques-
using Codex: to generate multiple-intent comments. Gent et tions. For example, Vaithilingam et al. [210] reported on
al. [200] demonstrate that pre-trained LLMs already have suf- the difficulties 24 participants had in understanding, editing,
ficient context to generate multiple different code summaries and debugging the Copilot-generated code, while Feldt et
from different technical perspectives. al. [139] proposed a hierarchy of design architecture for LLM-
based software testing agents. Liang et al. [36] surveyed 410
A. Open Problems in Documentation Generation and Code practising software engineers, finding widespread use of LLMs
Summarization to facilitate low-level programming tasks, but also resistance
Many existing code summarization techniques are retrieval- to using LLMs for more design-level software engineering
based: the given code is represented in a vector format using activities. Feng et al. [211] collected 316K tweets and 3.2K
a neural representation, which is subsequently used to retrieve Reddit posts about ChatGPT’s code generation to understand
the most relevant textual summarization from the corpus. social media’s attitudes toward AI-assisted coding tools.

15
They found that fear is the dominant emotion associated More work is needed on new forms of LLMs, specifi-
with ChatGPT’s code generation, overshadowing other emo- cally tailored for software engineering that take advantage of
tions such as happiness and surprise. Ahmad et al. [212] software’s unique properties and distinguish it from natural
explore the way in which a novice software architect could language. Dynamic information is one such key differentiator
interact with ChatGPT. currently missing from most of the work. We expect the next
generation of SE-specific LLMs to address this.
X. S OFTWARE E NGINEERING P ROCESS
An important aspect of building and training LLMs is their
Software engineering concerns, not only software products, energy consumption. LLM capabilities have been associated
but also the process by which they are constructed [213]. with their size [226], resulting in rapid growth of model
Previous research on software assistants [207], [214]–[217] size [227], [228]. The training and developing of larger models
is clearly of particular relevance to LLM-based software may have direct environmental impact [229]. While it has been
engineering, a topic some authors have already started to suggested that the model performance depends not only on
consider. For example, Ross et al. [218], introduced an LLM- model size but also on the volume of training data [230], the
based programmers’ assistant, evaluating its deployment with question of the right model size required to achieve the desired
42 participants while Tian et al. [219] highlighted the attention performance remains unclear.
span limitations of ChatGPT. Lighter models may also widen adoption, thereby leading
XI. S OFTWARE E NGINEERING E DUCATION to enhanced deployability. Recently, techniques such as low-
rank adaptation (lora) [231] and model quantization [232] have
Teachers have expressed concern at the difficulties of
shown potential, but they remain to be empirically evaluated
identifying cases where students have relied on LLMs to
with respect to specific applications.
construct their assignments [220], while other authors have
argued that the long-term impact of LLMs on education will B. The Need for Dynamic Adaptive Prompt Engineering and
be beneficial [221]. However, our present focus rests more Parameter Tuning
narrowly on the specific impact of LLMs on the field of
Initial work on prompt engineering has demonstrated its
software engineering education, where the current literature
potential to considerably improve the software engineering
focuses on LLM-based tutorial support.
artefacts generated by LLMs. However, as already found
For example, Jalil et al. [222] explored opportunities for
[58], the results are highly problem-specific, so a one-size-
(and issues with) ChatGPT in software testing education.
fits-all approach is unrealistic. Furthermore, very few papers
Savelka et al. [223] analysed the effectiveness of three models
report model parameter settings, yet we know that many of
in answering multiple-choice questions from introductory and
these, such as the temperature setting, play a crucial role in
intermediate programming courses at the postsecondary level.
determining the nature of the generated LLM output.
Several other authors [82], [83], [224] explored the capa-
As an immediate starting point, it is imperative that authors
bilities of CodeX for generating programming exercises and
make a point of conspicuously reporting these parameter
code explanations. Their general finding was that the majority
settings to support replication. However, we also need more
of the generated content was novel, sensible, and useful (see
research on dynamic adaptive prompt engineering and model
also Section IV-D3).
parameter tuning. This research agenda may draw inspiration
XII. C ROSSCUTTING O PEN R ESEARCH T OPICS from existing work on parameter tuning for other dynamic
A number of patterns emerge from the embryonic literature adaptive tasks, such as fuzzing [233]. Dynamic prompt opti-
on LLM-based software engineering. In this section, we out- misation may also exploit techniques associated with SBSE
line those that raise open research questions that cut across all [12], reformulating prompt optimisation as a multi-objective
software engineering applications computational search process.

A. Building and Tuning LLMs for SE C. Hybridisation


Most of the previous work has treated LLMs as atomic LLMs are seldom most effective when used in isolation,
components, with a focus on incorporating these in wider soft- but can be highly effective as part of an overall SE process.
ware engineering workflows. While there have been attempts More work is needed to understand the design patterns for
to tailor the behaviour, these have tended to focus on prompt SE workflows into which LLMs can safely, efficiently and
engineering, with a few examples of fine-tuning. effectively reside. We believe that existing SE theory and
A more challenging but potentially impactful problem lies practice associated with generate-and-test approaches, such
in training and fine-tuning models, specifically for software as Automated Repair and Genetic Improvement, are already
engineering tasks. Ding et al. [225] train a BERT-like LLM highly amenable to LLMs.
with execution inputs and dynamic execution traces. They We expect to see much more work incorporating LLMs
show how this dynamic information improves (up to 25%) the into these existing software engineering frameworks. However,
accuracy of the model for downstream software engineering more work is required to tailor and extend these frameworks,
predictive tasks: vulnerability and clone detection and cover- to best take advantage of the opportunities offered by LLM-
age prediction (full execution path and branch coverage). based software engineering.

16
In particular, we expect to see a rapid development of work Nevertheless, LLMs have their own unique properties, such
on static and dynamic analyses for prompt engineering and as the ability to provide explanations, which will require
post-processing of LLM responses. We also expect to see domain-specific theoretical and empirical scientific founda-
hybrid software engineering processes, adapting Continuous tions.
Integration pipelines to incorporate LLMs. LLMs inherently exhibit non-deterministic behaviour. Re-
searchers need to carefully design their experiments, configure
D. Harnessing Hallucination their LLMs (e.g., evaluating the effects of different distribution
While hallucination has widely been regarded as a problem, sampling strategies), and take into account non-determinism
as reported in this survey, it may also prove to provide when drawing their conclusions on LLMs. The SBSE literature
benefits when applied to software engineering domains. LLM provides advice on the inferential statistics required to support
hallucinations are seldom entirely random incorrect responses. such evaluation [13], [14].
Rather, because of their inherent statistical properties, they We will witness a rapid growth in the number and diversity
would be better characterised as ‘plausible futures’, and this of language models for software engineers in the coming years.
may often make them useful when set in the right context. Both practitioners and practising software engineers will need
Hallucination can be repurposed to provide potentially use- reliable, efficient and comprehensive benchmarking systems.
ful suggestions for software enhancement. For example, when Benchmarking platforms such as TESTPILOT [116] and plat-
hallucinating a test case, the LLM may be repurposed to forms such as Papers With Code (https://paperswithcode.com/
suggest new features, while a hallucinated code summarisation sota/code-generation-on-humaneval/) will become increas-
might indicate potential for (human) code misunderstanding; ingly important.
if the LLM ‘misunderstood’ the code, might not a human also As well as generic scientific foundations, benchmarks and
misunderstand it? When the LLM hallucinates an non-existent evaluation platforms, we also expect to see longitudinal stud-
API, it may be repurposed as a way to suggest refactoring to ies of developer behaviour when programming with LLM
simplify or extend existing APIs. More work is needed to assistance, so that we can understand the programmer-LLM
exploit this positive potential, and to harness hallucination for interaction better and design more effective use case scenarios.
software improvement.
F. Thorough Testing
E. Robust, Reliable, and Stable Evaluation The problem of hallucination has already been widely
Hort et al. [234] conducted a review of 293 papers on LLMs studied. It will continue to be a topic of great interest, both
for code generation, to determine the degree to which sufficient within the software engineering community and in the wider
information was shared to support replication. They found that computer science community. While it is likely great progress
only 33% shared source code and 27% shared trained artefacts. will be made, the inherent risk of hallucination is unlikely to
They also evaluated the papers from the perspective of energy be completely eradicated, since it is as germane to the LLM
consumption, assessing the degree to which it was possible technology, as it is to human intelligence. Fortunately, over
for an independent researcher to assess the energy consumed more than six decades, software engineers have developed
during training. They report that approximately 38% (30 robust automated verification and testing technologies that
out of 79 publications which involve model training) shared help to reduce the impact of human mistakes. We expect that
sufficient information to estimate their energy consumption such technologies will also carry over to artificial intelligence
during training. mistakes.
Further evidence that there may be a growing issue with
scientific evaluation quality in the literature on LLM-Based G. Handling Longer Textual Inputs
Software Engineering can be found in the survey of LLM- The performance of LLMs on large-sized input prompts is
Based Testing by Wang et al. [98]. In their survey, they filtered likely to be a topic of great interest in the artificial intelligence
an initial pool of papers on LLM-Based Testing to remove community [236]. Advances in this area will have a strong
those that did not meet standard evaluation quality constraints. impact on software engineering, because of the considerable
These constraints required papers to include a clear, defined, size of software systems and the consequent opportunities that
repeatable evaluation methodology that includes a control or additionally open when larger prompts are to be effectively
baseline against which to measure effectiveness. This filtration handled.
criterion removed more than 90% of the papers that initially
met keyword search criteria. H. Less Well-covered Subdomains of Software Engineering
As these analyses of the literature demonstrate, more work As our survey reveals, some subdomains of software engi-
is clearly needed to establish firm scientific foundations for neering are notably under-represented in the literature; some
the emerging discipline of LLM-based Software Engineering. surprisingly so. For example, Requirements Engineering and
Such foundations may draw on existing foundations for Em- Design (Section III), and Refactoring (Section VI-E) enjoy
pirical Software Engineering in general and, more specifically, very little coverage, yet they are surely ripe for consideration,
on AI-based Software Engineering, such as SBSE (where there since they rely heavily upon linguistic forms of analysis and
is a natural similarity [105], [235]). the recognition and prediction of patterns.

17
R EFERENCES [20] K. Kheiri and H. Karimi, “Sentimentgpt: Exploiting gpt for advanced
sentiment analysis and its departure from current machine learning,”
[1] J. Yang, H. Jin, R. Tang, X. Han, Q. Feng, H. Jiang, B. Yin, and X. Hu, 2023, arXiv:2307.10234.
“Harnessing the Power of LLMs in Practice: A Survey on ChatGPT [21] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning repre-
and Beyond,” Apr. 2023, arXiv:2304.13712. sentations by back-propagating errors,” nature, vol. 323, no. 6088, pp.
[2] W. Ma, S. Liu, W. Wang, Q. Hu, Y. Liu, C. Zhang, L. Nie, and 533–536, 1986.
Y. Liu, “The scope of ChatGPT in software engineering: A thorough [22] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
investigation,” 2023, arXiv:2305.12138. computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[3] S. Anand, A. Bertolino, E. Burke, T. Y. Chen, J. Clark, M. B. Cohen, [23] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
W. Grieskamp, M. Harman, M. J. Harrold, J. Li, P. McMinn, and Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2017,
H. Zhu, “An orchestrated survey of methodologies for automated soft- arXiv:1706.03762.
ware test case generation,” Journal of Systems and Software, vol. 86,
[24] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux,
no. 8, pp. 1978–2001, August 2013.
T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez,
[4] C. Cadar and K. Sen, “Symbolic execution for software testing: Three
A. Joulin, E. Grave, and G. Lample, “Llama: Open and efficient
decades later,” Communications of the ACM, vol. 56, no. 2, pp. 82–90,
foundation language models,” 2023, arXiv:2302.13971.
Feb. 2013.
[5] M. Harman, Y. Jia, and Y. Zhang, “Achievements, open problems and [25] S. Zhao, “Github Copilot now has a better AI model
challenges for search based software testing (keynote paper),” in 8th and new capabilities.” [Online]. Available: https://github.blog/
IEEE International Conference on Software Testing, Verification and 2023-02-14-github-copilot-now-has-a-better-ai-model-and-new-capabilities/
Validation (ICST 2015), Graz, Austria, April 2015. [26] “GitHub CEO says Copilot will write 80% of code sooner than
[6] E. T. Barr, M. Harman, P. McMinn, M. Shahbaz, and S. Yoo, “The later,” https://www.freethink.com/robots-ai/github-copilot, accessed:
oracle problem in software testing: A survey,” IEEE Transactions on 2023-07-27.
Software Engineering, vol. 41, no. 5, pp. 507–525, May 2015. [27] Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond,
[7] J. Shin, H. Hemmati, M. Wei, and S. Wang, “Assessing evaluation met- T. Eccles, J. Keeling, F. Gimeno, A. D. Lago, T. Hubert, P. Choy,
rics for neural test oracle generation,” arXiv preprint arXiv:2310.07856, C. d. M. d’Autume, I. Babuschkin, X. Chen, P.-S. Huang, J. Welbl,
2023. S. Gowal, A. Cherepanov, J. Molloy, D. J. Mankowitz, E. S. Robson,
[8] J. Petke, S. O. Haraldsson, M. Harman, W. B. Langdon, D. R. White, P. Kohli, N. de Freitas, K. Kavukcuoglu, and O. Vinyals, “Competition-
and J. R. Woodward, “Genetic improvement of software: a comprehen- Level Code Generation with AlphaCode,” Science, vol. 378, no. 6624,
sive survey,” IEEE Transactions on Evolutionary Computation, vol. 22, pp. 1092–1097, Dec. 2022, arXiv:2203.07814.
no. 3, pp. 415–432, Jun. 2018. [28] OpenAI, “GPT-4 Technical report,” 2023, arXiv:2303.08774.
[9] A. Cantino, “Prompt engineering tips and tricks with GPT-3,” 2021. [29] S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz,
[Online]. Available: https://blog.andrewcantino.com/blog/2021/04/21/ E. Kamar, P. Lee, Y. T. Lee, Y. Li, S. Lundberg, H. Nori, H. Palangi,
prompt-engineering-tips-and-tricks/ M. T. Ribeiro, and Y. Zhang, “Sparks of Artificial General Intelligence:
[10] S. Ouyang, J. M. Zhang, M. Harman, and M. Wang, “Llm is like a Early experiments with GPT-4,” Apr. 2023, arXiv:2303.12712.
box of chocolates: the non-determinism of chatgpt in code generation,” [30] B. Rozière, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan,
arXiv prnote arXiv:2308.02828, 2023. Y. Adi, J. Liu, T. Remez, J. Rapin, A. Kozhevnikov, I. Evtimov,
[11] S. Easterbrook, “Empirical research methods for software engineering,” J. Bitton, M. Bhatt, C. C. Ferrer, A. Grattafiori, W. Xiong, A. Défossez,
in Proceedings of the 22nd IEEE/ACM International Conference on J. Copet, F. Azhar, H. Touvron, L. Martin, N. Usunier, T. Scialom, and
Automated Software Engineering, ser. ASE ’07. New York, NY, USA: G. Synnaeve, “Code llama: Open foundation models for code,” 2023,
Association for Computing Machinery, 2007, p. 574. arXiv:2308.12950.
[12] M. Harman, A. Mansouri, and Y. Zhang, “Search based software [31] C. J. Neill and P. A. Laplante, “Requirements engineering: the state of
engineering: Trends, techniques and applications,” ACM Computing the practice,” IEEE software, vol. 20, no. 6, pp. 40–45, 2003.
Surveys, vol. 45, no. 1, pp. 11:1–11:61, November 2012. [32] Y. Zhang, A. Finkelstein, and M. Harman, “Search based requirements
[13] A. Arcuri and L. Briand, “A practical guide for using statistical tests optimisation: Existing work and challenges,” in International Working
to assess randomized algorithms in software engineering,” in 33rd Conference on Requirements Engineering: Foundation for Software
International Conference on Software Engineering (ICSE’11). New Quality (REFSQ’08), vol. 5025. Montpellier, France: Springer LNCS,
York, NY, USA: ACM, 2011, pp. 1–10. 2008, pp. 88–94.
[14] M. Harman, P. McMinn, J. Souza, and S. Yoo, “Search based software [33] J. Zhang, Y. Chen, N. Niu, and C. Liu, “A Preliminary Evalua-
engineering: Techniques, taxonomy, tutorial,” in Empirical software tion of ChatGPT in Requirements Information Retrieval,” Apr. 2023,
engineering and verification: LASER 2009-2010, B. Meyer and M. Nor- arXiv:2304.12562.
dio, Eds. Springer, 2012, pp. 1–59, LNCS 7007.
[34] X. Luo, Y. Xue, Z. Xing, and J. Sun, “Prcbert: Prompt learning for
[15] X. Hou, Y. Zhao, Y. Liu, Z. Yang, K. Wang, L. Li, X. Luo,
requirement classification using bert-based pretrained language mod-
D. Lo, J. Grundy, and H. Wang, “Large language models for
els,” in Proceedings of the 37th IEEE/ACM International Conference
software engineering: A systematic literature review,” arXiv prnote
on Automated Software Engineering, 2022, pp. 1–13.
arXiv:2308.10620, 2023.
[16] T. Goyal, J. J. Li, and G. Durrett, “News summarization and evaluation [35] D. Luitel, S. Hassani, and M. Sabetzadeh, “Improving requirements
in the era of gpt-3,” 2023, arXiv:2209.12356. completeness: Automated assistance through large language models,”
[17] R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, C. Hesse, arXiv prnote arXiv:2308.03784, 2023.
S. Jain, V. Kosaraju, W. Saunders, X. Jiang, K. Cobbe, T. Eloundou, [36] “A large-scale survey on the usability of ai programming assistants:
G. Krueger, K. Button, M. Knight, B. Chess, and J. Schulman, Successes and challenges,” in 46th International Conference on Soft-
“Webgpt: Browser-assisted question-answering with human feedback,” ware Engineering (ICSE 2024), April 2024, to appear.
2022, arXiv:2112.09332. [37] M. Bruch, M. Monperrus, and M. Mezini, “Learning from examples
[18] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhari- to improve code completion systems,” in Proceedings of the 7th Joint
wal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, Meeting of the European Software Engineering Conference and the
A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, ACM SIGSOFT Symposium on The Foundations of Software Engi-
D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, neering, ser. ESEC/FSE ’09. New York, NY, USA: Association for
S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, Computing Machinery, 2009, p. 213–222.
I. Sutskever, and D. Amodei, “Language models are few-shot learners,” [38] V. Murali, C. Maddila, I. Ahmad, M. Bolin, D. Cheng, N. Ghor-
in Advances in Neural Information Processing Systems, H. Larochelle, bani, R. Fernandez, and N. Nagappan, “Codecompose: A large-
M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran scale industrial deployment of ai-assisted code authoring,” 2023,
Associates, Inc., 2020, pp. 1877–1901. arXiv:2305.12050.
[19] Q. Xie, Z. Luo, B. Wang, and S. Ananiadou, “A survey for biomedical [39] D. Fried, A. Aghajanyan, J. Lin, S. Wang, E. Wallace, F. Shi, R. Zhong,
text summarization: From pre-trained to large language models,” 2023, W. tau Yih, L. Zettlemoyer, and M. Lewis, “Incoder: A generative
arXiv:2304.08763. model for code infilling and synthesis,” 2023, arXiv:2204.05999.

18
[40] S. Peng, E. Kalliamvakou, P. Cihon, and M. Demirer, “The impact of [63] J. Li, G. Li, Y. Li, and Z. Jin, “Enabling programming thinking in large
AI on developer productivity: Evidence from GitHub Copilot,” 2023, language models toward code generation,” 2023, arXiv:2305.06599.
arXiv:2302.06590. [64] S. Jiang, Y. Wang, and Y. Wang, “Selfevolve: A code evolution
[41] C. Bird, D. Ford, T. Zimmermann, N. Forsgren, E. Kalliamvakou, framework via large language models,” arXiv prnote arXiv:2306.02907,
T. Lowdermilk, and I. Gazit, “Taking flight with copilot: Early insights 2023.
and opportunities of ai-powered pair-programming tools,” Queue, [65] K. Zhang, Z. Li, J. Li, G. Li, and Z. Jin, “Self-edit: Fault-aware code
vol. 20, no. 6, pp. 35–57, 2022. editor for code generation,” 2023, arXiv:2305.04087.
[42] Z. Manna and R. J. Waldinger, “Toward automatic program synthesis,” [66] T. Ahmed, K. S. Pai, P. Devanbu, and E. T. Barr, “Improving Few-
Communications of the ACM, vol. 14, no. 3, pp. 151–164, 1971. Shot Prompts with Relevant Static Analysis Products,” Apr. 2023,
[43] S. Gulwani, O. Polozov, R. Singh et al., “Program synthesis,” Foun- arXiv:2304.06815.
dations and Trends in Programming Languages, vol. 4, no. 1-2, pp. [67] J. Shin, C. Tang, T. Mohati, M. Nayebi, S. Wang, and H. Hemmati,
1–119, 2017. “Prompt engineering or fine tuning: An empirical assessment of large
[44] A. Hindle, E. Barr, Z. Su, P. Devanbu, and M. Gabel, “On the language models in automated software engineering tasks,” arXiv
naturalness of software,” in International Conference on Software preprint arXiv:2310.10508, 2023.
Engineering (ICSE 2012), Zurich, Switzerland, 2012. [68] S. Zhang, Z. Chen, Y. Shen, M. Ding, J. B. Tenenbaum, and C. Gan,
[45] E. T. Barr, Y. Brun, P. Devanbu, M. Harman, and F. Sarro, “The plastic “Planning with large language models for code generation,” arXiv
surgery hypothesis,” in 22nd ACM SIGSOFT International Symposium prnote arXiv:2303.05510, 2023, to appear, ICLR 2023.
on the Foundations of Software Engineering (FSE 2014), Hong Kong, [69] X. Jiang, Y. Dong, L. Wang, Q. Shang, and G. Li, “Self-
China, November 2014, pp. 306–317. planning Code Generation with Large Language Model,” Mar. 2023,
[46] C. L. Goues, M. Pradel, and A. Roychoudhury, “Automated program arXiv:2303.06689.
repair,” Communications of the ACM, vol. 62, no. 12, pp. 56–65, 2019. [70] K. Zhang, G. Li, J. Li, Z. Li, and Z. Jin, “Toolcoder: Teach code
[47] M. Gabel and Z. Su, “A study of the uniqueness of source code,” generation models to use api search tools,” 2023, arXiv:2305.04032.
in 18th ACM SIGSOFT international symposium on foundations of [71] B. Chen, F. Zhang, A. Nguyen, D. Zan, Z. Lin, J.-G. Lou, and W. Chen,
software engineering (FSE 2010). Santa Fe, New Mexico, USA: “CODET: Code Generation With Generated Tests,” 2023.
ACM, 7-11 Nov. 2010, pp. 147–156. [72] J. P. Inala, C. Wang, M. Yang, A. Codas, M. Encarnación, S. K. Lahiri,
[48] J. Darlington and R. M. Burstall, “A transformation system for devel- M. Musuvathi, and J. Gao, “Fault-aware neural code rankers,” 2022,
oping recursive programs,” J. ACM, vol. 24, no. 1, pp. 44–67, 1977. arXiv:2206.03865.
[49] Mark Chen et al., “Evaluating Large Language Models Trained on [73] N. Jain, S. Vaidyanath, A. Iyer, N. Natarajan, S. Parthasarathy, S. Ra-
Code,” Jul. 2021, arXiv:2107.03374. jamani, and R. Sharma, “Jigsaw: large language models meet program
synthesis,” in Proceedings of the 44th International Conference on
[50] E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou,
Software Engineering. Pittsburgh Pennsylvania: ACM, May 2022,
S. Savarese, and C. Xiong, “Codegen: An open large language model
pp. 1219–1231.
for code with multi-turn program synthesis,” 2022, accepted at ICLR
[74] Y. Dong, X. Jiang, Z. Jin, and G. Li, “Self-collaboration Code Gener-
2023.
ation via ChatGPT,” Apr. 2023, arXiv:2304.07590.
[51] E. Nijkamp, H. Hayashi, C. Xiong, S. Savarese, and Y. Zhou,
[75] R. A. Poldrack, T. Lu, and G. Beguš, “AI-assisted coding: Experiments
“Codegen2: Lessons for training llms on programming and natural
with GPT-4,” Apr. 2023, arXiv:2304.13187.
languages,” 2023, arXiv:2305.02309.
[76] B. Yetiştiren, I. Özsoy, M. Ayerdem, and E. Tüzün, “Evaluating the
[52] F. F. Xu, U. Alon, G. Neubig, and V. J. Hellendoorn, “A systematic
Code Quality of AI-Assisted Code Generation Tools: An Empirical
evaluation of large language models of code,” in Proceedings of the 6th
Study on GitHub Copilot, Amazon CodeWhisperer, and ChatGPT,”
ACM SIGPLAN International Symposium on Machine Programming.
Apr. 2023, arXiv:2304.10778.
San Diego CA USA: ACM, Jun. 2022, pp. 1–10.
[77] A. Borji, “A Categorical Archive of ChatGPT Failures,” Apr. 2023,
[53] E. Jiang, E. Toh, A. Molina, K. Olson, C. Kayacik, A. Donsbach, C. J. arXiv:2302.03494.
Cai, and M. Terry, “Discovering the Syntax and Strategies of Natural [78] N. Shinn, F. Cassano, B. Labash, A. Gopinath, K. Narasimhan, and
Language Programming with Generative Language Models,” in CHI S. Yao, “Reflexion: Language Agents with Verbal Reinforcement
Conference on Human Factors in Computing Systems. New Orleans Learning,” May 2023, arXiv:2303.11366.
LA USA: ACM, Apr. 2022, pp. 1–19. [79] T. Dinh, J. Zhao, S. Tan, R. Negrinho, L. Lausen, S. Zha, and
[54] S. Gunasekar, Y. Zhang, J. Aneja, C. C. T. Mendes, A. D. Giorno, G. Karypis, “Large language models of code fail at completing code
S. Gopi, M. Javaheripi, P. Kauffmann, G. de Rosa, O. Saarikivi, with potential bugs,” 2023, arXiv:2306.03438.
A. Salim, S. Shah, H. S. Behl, X. Wang, S. Bubeck, R. Eldan, A. T. [80] C. Treude, “Navigating Complexity in Software Engineering: A Proto-
Kalai, Y. T. Lee, and Y. Li, “Textbooks are all you need,” 2023, type for Comparing GPT-n Solutions,” Jan. 2023, arXiv:2301.12169.
arXiv:2306.11644. [81] M. Yan, J. Chen, J. M. Zhang, X. Cao, C. Yang, and M. Harman,
[55] A. Kashefi and T. Mukerji, “ChatGPT for Programming Numerical “Coco: Testing code generation systems via concretized instructions,”
Methods,” Apr. 2023, arXiv:2303.12093. 2023, arXiv:2308.13319.
[56] L. Chemnitz, D. Reichenbach, H. Aldebes, M. Naveed, K. Narasimhan, [82] S. MacNeil, A. Tran, A. Hellas, J. Kim, S. Sarsa, P. Denny, S. Bern-
and M. Mezini, “Towards code generation from bdd test case specifi- stein, and J. Leinonen, “Experiences from Using Code Explanations
cations: A vision,” 2023, arXiv:2305.11619. Generated by Large Language Models in a Web Software Development
[57] J. Li, Y. Zhao, Y. Li, G. Li, and Z. Jin, “Towards Enhancing In-Context E-Book,” in Proceedings of the 54th ACM Technical Symposium on
Learning for Code Generation,” Mar. 2023, arXiv:2303.17780. Computer Science Education V. 1. Toronto ON Canada: ACM, Mar.
[58] J.-B. Döderlein, M. Acher, D. E. Khelladi, and B. Combemale, “Pi- 2023, pp. 931–937.
loting Copilot and Codex: Hot Temperature, Cold Prompts, or Black [83] D. Noever and K. Williams, “Chatbots as fluent polyglots: Revisiting
Magic?” Feb. 2023, arXiv:2210.14699. breakthrough code snippets,” 2023, arXiv:2301.03373.
[59] J. He and M. Vechev, “Controlling Large Language Models to Generate [84] J. Sun, Q. V. Liao, M. Muller, M. Agarwal, S. Houde, K. Talamadupula,
Secure and Vulnerable Code,” Feb. 2023, arXiv:2302.05319. and J. D. Weisz, “Investigating Explainability of Generative AI for
[60] J. White, S. Hays, Q. Fu, J. Spencer-Smith, and D. C. Schmidt, Code through Scenario-based Design,” in 27th International Confer-
“ChatGPT Prompt Patterns for Improving Code Quality, Refactor- ence on Intelligent User Interfaces. Helsinki Finland: ACM, Mar.
ing, Requirements Elicitation, and Software Design,” Mar. 2023, 2022, pp. 212–228.
arXiv:2303.07839. [85] A. H. Mohammadkhani, C. Tantithamthavorn, and H. Hemmati, “Ex-
[61] P. Denny, V. Kumar, and N. Giacaman, “Conversing with copilot: plainable ai for pre-trained code models: What do they learn? when
Exploring prompt engineering for solving cs1 problems using natural they do not work?” arXiv preprint arXiv:2211.12821, 2022.
language,” in Proceedings of the 54th ACM Technical Symposium on [86] H. Hajipour, T. Holz, L. Schönherr, and M. Fritz, “Systematically Find-
Computer Science Education V. 1, 2023, pp. 1136–1142. ing Security Vulnerabilities in Black-Box Code Generation Models,”
[62] J. Li, Y. Li, G. Li, Z. Jin, Y. Hao, and X. Hu, “Skcoder: A Feb. 2023, arXiv:2302.04012.
sketch-based approach for automatic code generation,” arXiv prnote [87] R. Khoury, A. R. Avila, J. Brunelle, and B. M. Camara, “How Secure
arXiv:2302.06144, 2023. is Code Generated by ChatGPT?” Apr. 2023, arXiv:2304.09655.

19
[88] N. Risse and M. Böhme, “Limits of machine learning for automatic [112] W. Sun, C. Fang, Y. You, Y. Miao, Y. Liu, Y. Li, G. Deng,
vulnerability detection,” 2023, arXiv:2306.17193. S. Huang, Y. Chen, Q. Zhang, H. Qian, Y. Liu, and Z. Chen,
[89] J. Savelka, A. Agarwal, C. Bogart, Y. Song, and M. Sakr, “Can Gen- “Automatic code summarization via ChatGPT: How far are we?” 2023,
erative Pre-trained Transformers (GPT) Pass Assessments in Higher arXiv:2305.12865.
Education Programming Courses?” Mar. 2023, arXiv:2303.09325. [113] S. Kang, J. Yoon, and S. Yoo, “Large language models are few-
[90] A. Liu, X. Hu, L. Wen, and P. S. Yu, “A comprehensive evalu- shot testers: Exploring llm-based general bug reproduction,” in 2023
ation of ChatGPT’s zero-shot Text-to-SQL capability,” Mar. 2023, IEEE/ACM 45th International Conference on Software Engineering
arXiv:2303.13547. (ICSE), 2023, pp. 2312–2323.
[91] N. Nguyen and S. Nadi, “An empirical evaluation of github copilot’s [114] “Prompting is all you need: Automated Android bug replay with Large
code suggestions,” in Proceedings of the 19th International Conference Language Models,” in 46th International Conference on Software
on Mining Software Repositories, ser. MSR ’22. New York, NY, USA: Engineering (ICSE 2024), April 2024, to appear.
Association for Computing Machinery, 2022, p. 1–5. [115] Z. Yuan, Y. Lou, M. Liu, S. Ding, K. Wang, Y. Chen, and X. Peng,
[92] K. Liu, Y. Han, J. Zhang, Z. Chen, F. Sarro, M. Harman, G. Huang, “No more manual tests? evaluating and improving chatgpt for unit test
and Y. Ma, “Who judges the judge: An empirical study on online judge generation,” 2023, arXiv:2305.04207.
tests,” in ACM SIGSOFT International Symposium on Software Testing [116] M. Schäfer, S. Nadi, A. Eghbali, and F. Tip, “Adaptive Test Generation
and Analysis (ISSTA 2023), Jan. 2023. Using a Large Language Model,” Feb. 2023, arXiv:2302.06527.
[93] J. Liu, C. S. Xia, Y. Wang, and L. Zhang, “Is Your Code Generated [117] Z. Xie, Y. Chen, C. Zhi, S. Deng, and J. Yin, “ChatUniTest: a
by ChatGPT Really Correct? Rigorous Evaluation of Large Language ChatGPT-based automated unit test generation tool,” May 2023,
Models for Code Generation,” May 2023, arXiv:2305.01210. arXiv:2305.04764.
[94] C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and [118] Z. Sun, J. M. Zhang, M. Harman, M. Papadakis, and L. Zhang,
K. Narasimhan, “Swe-bench: Can language models resolve real-world “Automatic testing and improvement of machine translation,” in
github issues?” arXiv preprint arXiv:2310.06770, 2023. ICSE ’20: 42nd International Conference on Software Engineering,
[95] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao, Seoul, South Korea, 27 June - 19 July, 2020, G. Rothermel and
“ReAct: Synergizing reasoning and acting in language models,” 2023, D. Bae, Eds. ACM, 2020, pp. 974–985. [Online]. Available:
arXiv:2210.03629. https://doi.org/10.1145/3377811.3380420
[96] W. Zhou, S. Kim, V. Murali, and G. A. Aye, “Improving code autocom- [119] Z. Sun, J. M. Zhang, Y. Xiong, M. Harman, M. Papadakis, and
pletion with transfer learning,” in 2022 IEEE/ACM 44th International L. Zhang, “Improving machine translation systems via isotopic
Conference on Software Engineering: Software Engineering in Practice replacement,” in 44th IEEE/ACM 44th International Conference
(ICSE-SEIP), 2022, pp. 161–162. on Software Engineering, ICSE 2022, Pittsburgh, PA, USA, May
[97] A. M. Turing, “Checking a large routine,” in Report of a Conference on 25-27, 2022. ACM, 2022, pp. 1181–1192. [Online]. Available:
High Speed Automatic Calculating Machines. Cambridge, England: https://doi.org/10.1145/3510003.3510206
University Mathematical Laboratory, Jun. 1949, pp. 67–69. [120] X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowd-
[98] J. Wang, Y. Huang, C. Chen, Z. Liu, S. Wang, and Q. Wang, “Software hery, and D. Zhou, “Self-consistency improves chain of thought rea-
testing with large language model: Survey, landscape, and vision,” soning in language models,” 2023.
2023, arXiv:2307.07221. [121] T. J. Ostrand and E. J. Weyuker, “Data flow-based test adequacy anal-
[99] P. Nie, R. Banerjee, J. J. Li, R. J. Mooney, and M. Gligoric, “Learning ysis for languages with pointers,” in Symposium on Testing, Analysis,
deep semantics for test completion,” 2023, arXiv:2302.10166. and Verification (TAV4), Victoria, BC, Canada, 1991, pp. 74–86.
[100] Z. Yuan, Y. Lou, M. Liu, S. Ding, K. Wang, Y. Chen, and X. Peng, [122] A. Bertolino, “Software testing research: Achievements, challenges,
“No More Manual Tests? Evaluating and Improving ChatGPT for Unit dreams,” in Future of Software Engineering (FOSE’07). IEEE, 2007,
Test Generation,” May 2023, arXiv:2305.04207. pp. 85–103.
[101] P. Bareiß, B. Souza, M. d’Amorim, and M. Pradel, “Code generation [123] Y. Jia and M. Harman, “An analysis and survey of the development of
tools (almost) for free? a study of few-shot, pre-trained language mutation testing,” IEEE Transactions on Software Engineering, vol. 37,
models on code,” arXiv prnote arXiv:2206.01335, 2022. no. 5, pp. 649 – 678, September–October 2011.
[102] C. Pacheco and M. D. Ernst, “Randoop: feedback-directed random [124] M. Papadakis, M. Kintis, J. Zhang, Y. Jia, Y. L. Traon, and M. Harman,
testing for java,” in Companion to the 22nd ACM SIGPLAN conference “Mutation testing advances: An analysis and survey,” Advances in
on Object-oriented programming systems and applications companion, Computers, vol. 112, pp. 275–378, 2019.
2007, pp. 815–816. [125] T. T. Chekam, M. Papadakis, Y. L. Traon, and M. Harman, “An empir-
[103] S. Hashtroudi, J. Shin, H. Hemmati, and S. Wang, “Automated test case ical study on mutation, statement and branch coverage fault revelation
generation using code models and domain adaptation,” arXiv preprint that avoids the unreliable clean program assumption,” in Proceedings
arXiv:2308.08033, 2023. of the 39th International Conference on Software Engineering, ICSE
[104] M. L. Siddiq, J. C. S. Santos, R. H. Tanvir, N. Ulfat, F. A. Rifat, and 2017, Buenos Aires, Argentina, May 20-28, 2017, 2017, pp. 597–608.
V. C. Lopes, “Exploring the Effectiveness of Large Language Models [126] A. Khanfir, R. Degiovanni, M. Papadakis, and Y. L. Traon, “Efficient
in Generating Unit Tests,” Apr. 2023, arXiv:2305.00418. mutation testing via pre-trained language models,” arXiv preprint
[105] C. Lemieux, J. P. Inala, S. K. Lahiri, and S. Sen, “CODAMOSA: arXiv:2301.03543, 2023.
Escaping Coverage Plateaus in Test Generation with Pre-trained Large [127] A. Garg, R. Degiovanni, M. Papadakis, and Y. L. Traon, “Vulnerability
Language Models,” 2023. Mimicking Mutants,” Mar. 2023, arXiv:2303.04247.
[106] J. Hu, Q. Zhang, and H. Yin, “Augmenting greybox fuzzing with [128] A. E. I. Brownlee, J. Callan, K. Even-Mendoza, A. Geiger, C. Hanna,
generative ai,” 2023, arXiv:2306.06782. J. Petke, F. Sarro, and D. Sobania, “Enhancing genetic improvement
[107] A. Moradi Dakhel, A. Nikanjam, V. Majdinasab, F. Khomh, and M. C. mutations using large language models,” in SSBSE 2023: Challenge
Desmarais, “Effective test generation using pre-trained large language Track. San Francisco, USA: Springer, 8 Dec 2023, to appear.
models and mutation testing,” arXiv e-prints, pp. arXiv–2308, 2023. [129] R. Pan, T. A. Ghaleb, and L. Briand, “LTM: Scalable and Black-box
[108] C. S. Xia, M. Paltenghi, J. L. Tian, M. Pradel, and L. Zhang, “Universal Similarity-based Test Suite Minimization based on Language Models,”
fuzzing via large language models,” arXiv preprint arXiv:2308.04748, arXiv prnote arXiv:2304.01397, 2023.
2023. [130] C. Liu, S. Lu, W. Chen, D. Jiang, A. Svyatkovskiy, S. Fu, N. Sun-
[109] Y. Deng, C. S. Xia, H. Peng, C. Yang, and L. Zhang, “Large Language daresan, and N. Duan, “Code Execution with Pre-trained Language
Models are Zero-Shot Fuzzers: Fuzzing Deep-Learning Libraries via Models,” May 2023, arXiv:2305.05383.
Large Language Models,” Mar. 2023, arXiv:2212.14834. [131] M. Harman and P. O’Hearn, “From start-ups to scale-ups: Opportu-
[110] Y. Deng, C. S. Xia, C. Yang, S. D. Zhang, S. Yang, and L. Zhang, nities and open problems for static and dynamic program analysis
“Large Language Models are Edge-Case Fuzzers: Testing Deep Learn- (keynote paper),” in 18th IEEE International Working Conference on
ing Libraries via FuzzGPT,” Apr. 2023, arXiv:2304.02014. Source Code Analysis and Manipulation (SCAM 2018), Madrid, Spain,
[111] T.-O. Li, W. Zong, Y. Wang, H. Tian, Y. Wang, and S.-C. Cheung, September 23rd-24th 2018, pp. 1–23.
“Finding Failure-Inducing Test Cases with ChatGPT,” Apr. 2023, [132] A. Akli, G. Haben, S. Habchi, M. Papadakis, and Y. Le Traon,
arXiv:2304.11686. “FlakyCat: Predicting flaky tests categories using few-shot learning,” in

20
2023 IEEE/ACM International Conference on Automation of Software [157] ——, “Keep the Conversation Going: Fixing 162 out of 337 bugs for
Test (AST), 2023, pp. 140–151. $0.42 each using ChatGPT,” Apr. 2023, arXiv:2304.00385.
[133] S. Fatima, T. A. Ghaleb, and L. Briand, “Flakify: A black-box, [158] X. Chen, M. Lin, N. Schärli, and D. Zhou, “Teaching Large Language
language model-based predictor for flaky tests,” IEEE Transactions on Models to Self-Debug,” Apr. 2023, arXiv:2304.05128.
Software Engineering, 2022. [159] H. Pearce, B. Tan, B. Ahmad, R. Karri, and B. Dolan-Gavitt, “Exam-
[134] S. Fatima, H. Hemmati, and L. Briand, “Black-box prediction of flaky ining zero-shot vulnerability repair with large language models,” 2022,
test fix categories using language models,” 2023, arXiv:2307.00012. arXiv:2112.02125.
[135] R. A. Santelices, P. K. Chittimalli, T. Apiwattanapong, A. Orso, and [160] Y. Charalambous, N. Tihanyi, R. Jain, Y. Sun, M. A. Ferrag, and
M. J. Harrold, “Test-suite augmentation for evolving software,” in 23rd L. C. Cordeiro, “A new era in software security: Towards self-healing
Automated Software Engineering (ASE ’08). L’Aquila, Italy: IEEE, software via large language models and formal verification,” arXiv
2008, pp. 218–227. preprint arXiv:2305.14752, 2023.
[136] S. Yoo and M. Harman, “Test data regeneration: Generating new test [161] J. Cao, M. Li, M. Wen, and S.-c. Cheung, “A study on Prompt Design,
data from existing test data,” Journal of Software Testing, Verification Advantages and Limitations of ChatGPT for Deep Learning Program
and Reliability, vol. 22, no. 3, pp. 171–201, May 2012. Repair,” Apr. 2023, arXiv:2304.08191.
[137] R. Abou Assi, C. Trad, M. Maalouf, and W. Masri, “Coincidental [162] S. Fakhoury, S. Chakraborty, M. Musuvathi, and S. K. Lahiri, “Towards
correctness in the defects4j benchmark,” Software Testing, Verification Generating Functionally Correct Code Edits from Natural Language
and Reliability, vol. 29, no. 3, p. e1696, 2019. Issue Descriptions,” Apr. 2023, arXiv:2304.03816.
[138] K. Androutsopoulos, D. Clark, H. Dan, M. Harman, and R. Hierons, [163] T. Ahmed, S. Ghosh, C. Bansal, T. Zimmermann, X. Zhang, and
“An analysis of the relationship between conditional entropy and S. Rajmohan, “Recommending Root-Cause and Mitigation Steps
failed error propagation in software testing,” in 36th International for Cloud Incidents using Large Language Models,” Feb. 2023,
Conference on Software Engineering (ICSE 2014), Hyderabad, India, arXiv:2301.03797.
June 2014, pp. 573–583. [164] N. Jiang, K. Liu, T. Lutellier, and L. Tan, “Impact of code language
[139] R. Feldt, S. Kang, J. Yoon, and S. Yoo, “Towards autonomous models on automated program repair,” 2023, arXiv:2302.05020.
testing agents via conversational large language models,” 2023, [165] M. Jin, S. Shahriar, M. Tufano, X. Shi, S. Lu, N. Sundaresan, and
arXiv:2306.05152. A. Svyatkovskiy, “InferFix: End-to-End Program Repair with LLMs,”
[140] S. Kang, G. An, and S. Yoo, “A preliminary evaluation of llm-based Mar. 2023, arXiv:2303.07263.
fault localization,” arXiv preprint arXiv:2308.05487, 2023. [166] B. Berabi, J. He, V. Raychev, and M. T. Vechev, “TFix: Learning to
[141] Y. Wu, Z. Li, J. M. Zhang, M. Papadakis, M. Harman, and Y. Liu, fix coding errors with a text-to-text transformer,” in Proceedings of
“Large language models in fault localisation,” 2023, arXiv:2308.15276. the 38th International Conference on Machine Learning, ICML 2021,
[142] S. Feng and C. Chen, “Prompting Is All Your Need: Automated 18-24 July 2021, Virtual Event, ser. Proceedings of Machine Learning
Android Bug Replay with Large Language Models,” Jun. 2023, Research, M. Meila and T. Zhang, Eds., vol. 139. PMLR, 2021, pp.
arXiv:2306.01987. 780–791.
[143] H. Joshi, J. C. Sanchez, S. Gulwani, V. Le, G. Verbruggen, and [167] C. S. Xia, Y. Ding, and L. Zhang, “Revisiting the plastic surgery
I. Radiček, “Repair is nearly generation: Multilingual program repair hypothesis via large language models,” in ASE 2023, 2023.
with LLMs,” in Proceedings of the AAAI Conference on Artificial [168] S. Kang, B. Chen, S. Yoo, and J.-G. Lou, “Explainable automated
Intelligence, vol. 37, no. 4, 2023, pp. 5131–5140. debugging via large language model-driven scientific debugging,” 2023,
[144] Y. Wu, Z. Li, J. M. Zhang, and Y. Liu, “Condefects: A new dataset to arXiv:2304.02195.
address the data leakage concern for llm-based fault localization and [169] D. Sobania, A. Geiger, J. Callan, A. Brownlee, C. Hanna, R. Moussa,
program repair,” arXiv preprint arXiv:2310.16253, 2023. M. Zamorano Lopez, J. Petke, and F. Sarro, “Evaluating explanations
[145] E. Mashhadi, H. Ahmadvand, and H. Hemmati, “Method-level bug for software patches generated by large language models,” in SSBSE
severity prediction using source code metrics and llms,” 2023. 2023: Challenge Track, ser. LNCS. San Francisco, USA: Springer, 8
[146] C. Le Goues, T. Nguyen, S. Forrest, and W. Weimer, “GenProg: A Dec 2023, to appear.
generic method for automatic software repair,” IEEE Transactions on [170] A. A. Lovelace, “Sketch of the analytical engine invented by Charles
Software Engineering, vol. 38, no. 1, pp. 54–72, 2012. Babbage by L. F. Menabrea of Turin, officer of the military engineers,
[147] K. Huang, Z. Xu, S. Yang, H. Sun, X. Li, Z. Yan, and Y. Zhang, “A sur- with notes by the translator,” 1843, translation with notes on article in
vey on automated program repair techniques,” 2023, arXiv:2303.18184. italian in Bibliothèque Universelle de Genève, October, 1842, Number
[148] A. Marginean, J. Bader, S. Chandra, M. Harman, Y. Jia, K. Mao, 82.
A. Mols, and A. Scott, “SapFix: Automated end-to-end repair at scale,” [171] A. V. Aho, R. Sethi, and J. D. Ullman, Compilers: Principles, tech-
in International Conference on Software Engineering (ICSE) Software niques and tools, 1986.
Engineering in Practice (SEIP) track, Montreal, Canada, 2019. [172] J. Darlington and R. M. Burstall, “A system which automatically
[149] M. Harman, “Scaling genetic improvement and automated program improves programs,” Acta Informatica, vol. 6, pp. 41–60, 1976.
repair (keynote paper),” in 3rd IEEE/ACM International Workshop on [173] H. Partsch, The CIP Transformation System, 1984, pp. 305–322, peter
Automated Program Repair, APR@ICSE 2022, Pittsburgh, PA, USA, Pepper (ed.).
May 19, 2022. IEEE, 2022, pp. 1–7. [174] J. R. Koza, Genetic Programming: On the Programming of Computers
[150] M. Monperrus, “Automatic software repair: a bibliography,” ACM by Means of Natural Selection. Cambridge, MA: MIT Press, 1992.
Computing Surveys (CSUR), vol. 51, no. 1, p. 17, 2018. [175] J. H. Perkins, S. Kim, S. Larsen, S. P. Amarasinghe, J. Bachrach,
[151] M. Harman and B. F. Jones, “Search based software engineering,” M. Carbin, C. Pacheco, F. Sherwood, S. Sidiroglou, G. Sullivan, W.-
Information and Software Technology, vol. 43, no. 14, pp. 833–839, F. Wong, Y. Zibin, M. D. Ernst, and M. C. Rinard, “Automatically
Dec. 2001. patching errors in deployed software,” in Proceedings of the 22nd
[152] M. Tufano, C. Watson, G. Bavota, M. D. Penta, M. White, and Symposium on Operating Systems Principles (SOSP’09), Operating
D. Poshyvanyk, “An empirical study on learning bug-fixing patches Systems Review (OSR), October 2009, pp. 87–102.
in the wild via neural machine translation,” 2019, arXiv:1812.08693. [176] M. Harman, Y. Jia, W. B. Langdon, J. Petke, I. H. Moghadam, S. Yoo,
[153] C. S. Xia and L. Zhang, “Less training, more repairing please: revisiting and F. Wu, “Genetic improvement for adaptive software engineering
automated program repair via zero-shot learning,” in Proceedings of (keynote paper),” in 9th International Symposium on Software Engi-
the 30th ACM Joint European Software Engineering Conference and neering for Adaptive and Self-Managing Systems (SEAMS 2014). New
Symposium on the Foundations of Software Engineering, 2022, pp. York, NY, USA: ACM, 2014, pp. 1–4.
959–971. [177] W. B. Langdon and M. Harman, “Optimising existing software with ge-
[154] C. S. Xia, Y. Wei, and L. Zhang, “Practical program repair in the era netic programming,” IEEE Transactions on Evolutionary Computation
of large pre-trained language models,” 2022, arXiv:2210.14179. (TEVC), vol. 19, no. 1, pp. 118–135, Feb 2015.
[155] Y. Wei, C. S. Xia, and L. Zhang, “Copiloting the Copilots: Fusing Large [178] A. Madaan, A. Shypula, U. Alon, M. Hashemi, P. Ranganathan,
Language Models with Completion Engines for Automated Program Y. Yang, G. Neubig, and A. Yazdanbakhsh, “Learning Performance-
Repair,” in FSE 2023, 2023. Improving Code Edits,” Feb. 2023, arXiv:2302.07867.
[156] C. S. Xia and L. Zhang, “Conversational Automated Program Repair,” [179] S. Garg, R. Z. Moghaddam, C. B. Clement, N. Sundaresan, and
Jan. 2023, arXiv:2301.13246. C. Wu, “Deepdev-perf: a deep learning-based approach for improving

21
software performance,” in Proceedings of the 30th ACM Joint European [203] W. Martin, F. Sarro, Y. Jia, Y. Zhang, and M. Harman, “A survey of
Software Engineering Conference and Symposium on the Foundations app store analysis for software engineering,” IEEE Transactions on
of Software Engineering, 2022, pp. 948–958. Software Engineering, vol. 43, no. 9, 2017.
[180] S. Kang and S. Yoo, “Towards objective-tailored genetic improvement [204] T. Menzies and T. Zimmermann, “Software analytics: What’s next?”
through large language models,” 2023, arXiv:2304.09386. IEEE Software, vol. 35, no. 5, pp. 64–70, 2018.
[181] S. Garg, R. Z. Moghaddam, and N. Sundaresan, “Rapgen: An ap- [205] G. Spanoudakis and A. Zisman, “Software traceability: a roadmap,” in
proach for fixing code inefficiencies in zero-shot,” arXiv preprint Handbook of software engineering and knowledge engineering: vol 3:
arXiv:2306.17077, 2023. recent advances. World Scientific, 2005, pp. 395–428.
[182] Z. Chen, S. Fang, and M. Monperrus, “Supersonic: Learning [206] J. Cleland-Huang, O. Gotel, A. Zisman et al., Software and systems
to generate source code optimisations in c/c++,” arXiv preprint traceability. Springer, 2012, vol. 2, no. 3.
arXiv:2309.14846, 2023. [207] C. Lebeuf, M.-A. Storey, and A. Zagalsky, “Software bots,” IEEE
[183] C. Cummins, V. Seeker, D. Grubisic, M. Elhoushi, Y. Liang, B. Roziere, Software, vol. 35, no. 1, pp. 18–23, 2017.
J. Gehring, F. Gloeckle, K. Hazelwood, G. Synnaeve, and H. Leather, [208] M. Lehman and W. Turski, “Essential properties of ipses,” ACM
“Large language models for compiler optimization,” 2023. SIGSOFT Software Engineering Notes, vol. 12, no. 1, pp. 52–55, 1987.
[184] C. W. Krueger, “Software reuse,” ACM Computing Surveys (CSUR), [209] M. H. Hamilton and W. R. Hackler, “Universal systems language:
vol. 24, no. 2, pp. 131–183, 1992. lessons learned from apollo,” Computer, vol. 41, no. 12, pp. 34–43,
[185] Q. Huang, J. Zhu, Z. Li, Z. Xing, C. Wang, and X. Xu, “PCR- 2008.
Chain: Partial code reuse assisted by hierarchical chaining of prompts [210] P. Vaithilingam, T. Zhang, and E. L. Glassman, “Expectation vs. Ex-
on frozen copilot,” in 2023 IEEE/ACM 45th International Conference perience: Evaluating the Usability of Code Generation Tools Powered
on Software Engineering: Companion Proceedings (ICSE-Companion), by Large Language Models,” in CHI Conference on Human Factors
2023, pp. 1–5. in Computing Systems Extended Abstracts. New Orleans LA USA:
[186] M. Zakeri-Nasrabadi, S. Parsa, M. Ramezani, C. Roy, and ACM, Apr. 2022, pp. 1–7.
M. Ekhtiarzadeh, “A systematic literature review on source code [211] Y. Feng, S. Vanam, M. Cherukupally, W. Zheng, M. Qiu, and H. Chen,
similarity measurement and clone detection: Techniques, applications, “Investigating code generation performance of ChatGPT with crowd-
and challenges,” Journal of Systems and Software, p. 111796, 2023. sourcing social data,” in 2023 IEEE 47th Annual Computers, Software,
[187] J. Zhao, Y. Rong, Y. Guo, Y. He, and H. Chen, “Understand- and Applications Conference (COMPSAC), 2023, pp. 876–885.
ing programs by exploiting (fuzzing) test cases,” arXiv prnote [212] A. Ahmad, M. Waseem, P. Liang, M. Fehmideh, M. S. Aktar, and
arXiv:2305.13592, 2023. T. Mikkonen, “Towards Human-Bot Collaborative Software Architect-
[188] T. Mariani and S. R. Vergilio, “A systematic review on search-based ing with ChatGPT,” Feb. 2023, arXiv:2302.14600.
refactoring,” Information and Software Technology, vol. 83, pp. 14–34, [213] R. Pressman, Software Engineering: A Practitioner’s Approach, 3rd ed.
2017. Maidenhead, Berkshire, England, UK.: McGraw-Hill Book Company
[189] M. H. Halstead, Elements of Software Science. Elsevier, 1977. Europe, 1992, european adaptation (1994). Adapted by Darrel Ince.
ISBN 0-07-707936-1.
[190] T. J. McCabe, “A complexity measure,” vol. 2, pp. 308–320, 1976.
[214] K. Bojarczuk, I. Dvortsova, J. George, N. Gucevska, M. Harman,
[191] B. R. Bruce, J. Petke, M. Harman, and E. T. Barr, “Approximate oracles
M. Lomeli, S. Lucas, E. Meijer, R. Rojas, and S. Sapora, “Measure-
and synergy in software energy search spaces,” IEEE Transactions on
ment challenges for cyber cyber digital twins: Experiences from the
Software Engineering, vol. 45, no. 11, pp. 1150–1169, 2019.
deployment of Facebook’s WW simulation system (keynote paper),”
[192] D. Li, A. H. Tran, and W. G. J. Halfond, “Making web applications in ACM/IEEE International Symposium on Empirical Software Engi-
more energy efficient for OLED smartphones,” in 36th International neering and Measurement (ESEM ’21), October 2021, keynote talk
Conference on Software Engineering (ICSE 2014). New York, NY, given jointly by Maria Lomeli and Mark Harman.
USA: ACM, 2014, pp. 527–538. [215] C. Cornes, J. Courant, J.-C. Filliatre, G. Huet, P. Manoury, C. Munoz,
[193] D. R. White, J. Clark, J. Jacob, and S. Poulding, “Searching for C. Murthy, C. Paulin-Mohring, A. Saibi, and B. Werner, “The coq proof
resource-efficient programs: Low-power pseudorandom number gen- assistant, reference manual, version 5.10,” Inria, Institut National de
erators,” in 2008 Genetic and Evolutionary Computation Conference Recherche en Informatique et en Automatique, Technical Report RT-
(GECCO 2008), Atlanta, USA, Jul. 2008, pp. 1775–1782. 0177, Jul. 1995.
[194] F. Wu, M. Harman, Y. Jia, J. Krinke, and W. Weimer, “Deep parameter [216] K. B. Gallagher, “Evaluating the surgeon’s assistant: Results of a pilot
optimisation,” in Genetic and evolutionary computation conference study,” in International Conference on Software Maintenance (ICSE
(GECCO 2015), Madrid, Spain, July 2015, pp. 1375–1382. ’92), Los Alamitos, California, USA, Nov. 1992, pp. 236–244.
[195] M. Harman, W. B. Langdon, Y. Jia, D. R. White, A. Arcuri, and [217] M. Ward, F. W. Calliss, and M. Munro, “The maintainer’s assistant,” in
J. A. Clark, “The GISMOE challenge: Constructing the Pareto program International Conference on Software Maintenance (ICSM 1989), Los
surface using genetic programming to find better programs (keynote Alamitos, California, USA, 1989, p. 307.
paper),” in 27th IEEE/ACM International Conference on Automated [218] S. I. Ross, F. Martinez, S. Houde, M. Muller, and J. D. Weisz,
Software Engineering (ASE 2012), Essen, Germany, September 2012, “The Programmer’s Assistant: Conversational Interaction with a Large
pp. 1–14. Language Model for Software Development,” in Proceedings of the
[196] E. Gamma, R. Helm, R. Johnson, and J. Vlissides, Design Patterns. 28th International Conference on Intelligent User Interfaces. Sydney
Addison-Wesley, 1995. NSW Australia: ACM, Mar. 2023, pp. 491–514.
[197] M. Kechagia, S. Mechtaev, F. Sarro, and M. Harman, “Evaluating [219] H. Tian, W. Lu, T. O. Li, X. Tang, S.-C. Cheung, J. Klein, and T. F.
automatic program repair capabilities to repair API misuses,” IEEE Bissyandé, “Is ChatGPT the Ultimate Programming Assistant – How
Transactions on Software Engineering, vol. 48, no. 7, pp. 2658–2679, far is it?” Apr. 2023, arXiv:2304.11938.
2022. [220] L. Meckler and P. Verma, “Teachers are on alert for inevitable cheating
[198] W. Sun, C. Fang, Y. You, Y. Miao, Y. Liu, Y. Li, G. Deng, S. Huang, after release of ChatGPT,” The Washington post, December 2022.
Y. Chen, Q. Zhang, H. Qian, Y. Liu, and Z. Chen, “Automatic [221] W. Heaven, “ChatGPT is going to change education, not destroy it,”
Code Summarization via ChatGPT: How Far Are We?” May 2023, MIT Technology review, April 2023.
arXiv:2305.12865. [222] S. Jalil, S. Rafi, T. D. LaToza, K. Moran, and W. Lam, “ChatGPT
[199] M. Geng, S. Wang, D. Dong, H. Wang, G. Li, Z. Jin, X. Mao, and and Software Testing Education: Promises & Perils,” Mar. 2023,
X. Liao, “An Empirical Study on Using Large Language Models for arXiv:2302.03287.
Multi-Intent Comment Generation,” Apr. 2023, arXiv:2304.11384. [223] J. Savelka, A. Agarwal, C. Bogart, and M. Sakr, “Large Language
[200] “Large language models are few-shot summarizers: Multi-intent com- Models (GPT) Struggle to Answer Multiple-Choice Questions about
ment generation via in-context learning,” in 46th International Con- Code,” Mar. 2023, arXiv:2303.08033.
ference on Software Engineering (ICSE 2024), April 2024, to appear. [224] S. Sarsa, P. Denny, A. Hellas, and J. Leinonen, “Automatic Genera-
[201] T. Menzies and T. Zimmermann, “Software analytics: so what?” IEEE tion of Programming Exercises and Code Explanations Using Large
Software, vol. 30, no. 4, pp. 31–37, 2013. Language Models,” in Proceedings of the 2022 ACM Conference on
[202] A. E. Hassan, “The road ahead for mining software repositories,” in International Computing Education Research V.1. Lugano and Virtual
2008 Frontiers of Software Maintenance. IEEE, 2008, pp. 48–57. Event Switzerland: ACM, Aug. 2022, pp. 27–43.

22
[225] “TRACED: Execution-aware pre-training for source code,” in 46th
International Conference on Software Engineering (ICSE 2024), April
2024, to appear.
[226] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess,
R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws
for neural language models,” 2020, arXiv:2001.08361.
[227] Aakanksha Chowdhery et al., “PaLM: Scaling language modeling with
pathways,” 2022, arXiv:2204.02311.
[228] Jack W. Rae et al., “Scaling language models: Methods, analysis &
insights from training gopher,” 2022, arXiv:2112.11446.
[229] P. Henderson, J. Hu, J. Romoff, E. Brunskill, D. Jurafsky, and J. Pineau,
“Towards the systematic reporting of the energy and carbon footprints
of machine learning,” J. Mach. Learn. Res., vol. 21, no. 1, jan 2020.
[230] J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai,
E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark,
T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc,
A. Guy, S. Osindero, K. Simonyan, E. Elsen, J. W. Rae, O. Vinyals,
and L. Sifre, “Training compute-optimal large language models,” 2022,
arXiv:2203.15556.
[231] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang,
and W. Chen, “Lora: Low-rank adaptation of large language models,”
2021, arXiv:2106.09685.
[232] A. Polino, R. Pascanu, and D. Alistarh, “Model compression via
distillation and quantization,” 2018, arXiv:1802.05668.
[233] V. J. M. Manès, H. Han, C. Han, S. K. Cha, M. Egele, E. J. Schwartz,
and M. Woo, “The art, science, and engineering of fuzzing: A survey,”
CoRR, vol. abs/1812.00140, 2018, 1812.00140.
[234] M. Hort, A. Grishina, and L. Moonen, “An exploratory literature study
on sharing and energy use of language models for source code,” 2023,
arXiv:2307.02443.
[235] Y. Tang, Z. Liu, Z. Zhou, and X. Luo, “ChatGPT vs SBST: A compara-
tive assessment of unit test suite generation,” 2023, arXiv:2307.00588.
[236] U. Shaham, E. Segal, M. Ivgi, A. Efrat, O. Yoran, A. Haviv, A. Gupta,
W. Xiong, M. Geva, J. Berant, and O. Levy, “SCROLLS: Stan-
dardized CompaRison Over Long Language Sequences,” Oct. 2022,
arXiv:2201.03533.

23

You might also like