0% found this document useful (0 votes)
14 views13 pages

Toolcoder: Teach Code Generation Models To Use Api Search Tools

Uploaded by

mcmarsh.fif
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views13 pages

Toolcoder: Teach Code Generation Models To Use Api Search Tools

Uploaded by

mcmarsh.fif
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

ToolCoder: Teach Code Generation Models to use

API search tools


Kechi Zhang Huangzhao Zhang Ge Li*
Key Lab of High Confidence Software Key Lab of High Confidence Software Key Lab of High Confidence Software
Technology, MoE (Peking University) Technology, MoE (Peking University) Technology, MoE (Peking University)
Beijing, China Beijing, China Beijing, China
[email protected] zhang [email protected] [email protected]

Jia Li ♂ Zhuo Li Zhi Jin*


Key Lab of High Confidence Software Key Lab of High Confidence Software Key Lab of High Confidence Software
arXiv:2305.04032v5 [cs.SE] 11 Sep 2023

Technology, MoE (Peking University) Technology, MoE (Peking University) Technology, MoE (Peking University)
Beijing, China Beijing, China Beijing, China
[email protected] [email protected] [email protected]

Abstract—Automatically generating source code from natural def init_model(args):


language descriptions has been a growing field of research in …...
recent years. However, current large-scale code generation models # inp_size (height, width,
often encounter difficulties when selecting appropriate APIs for extra_dim)
# out_size (height, width)
specific contexts. These models may generate APIs that do not out = np.?
meet requirements or refer to non-existent APIs in third-party
libraries, especially for lesser-known or private libraries. Inspired
by the process of human developers using tools to search APIs,
we propose ToolCoder, a novel approach that integrates API
search tools with existing models to assist in code generation
and API selection. To teach our model to use tools, we introduce
an automated data annotation method using ChatGPT to add
tool usage information into the source code data and fine-tune
code generation models. During inference, we integrate API # inp_size (height, width, extra_dim)
search tools into the generation process so that our model can # out_size (height, width)
automatically use the search tool to get suggestions when selecting out = np.squeeze(inp)
an API. Our experimental results demonstrate that ToolCoder
exhibits excellent performance and generalization across five
Fig. 1. An illustrative example of the process of human programmers selecting
public and private library code generation benchmarks, with at the proper API during coding. Programmers summarize their demands into a
least 6.21% improvement on average pass@1 metrics and 9.64% query (remove single-dimensional entries) and use the search engine tool or
improvement on average pass@10 metrics compared to state- documentation search tool to get the proper API suggestion (np.squeeze).
of-the-art methods. Furthermore, we show that our relatively
small ToolCoder model is comparable to one of the current
best models, GPT-3.5, highlighting the potential of incorporating
programming tools into the code generation process. code. API selection is crucial for accurately expressing pro-
def init_model(): # inp_size
gram semantics and efficiently addressing problems. However,
I. I NTRODUCTION (height,
there are too many existing third-party width,andextra_dim)
libraries their #
Automated code generation has become increasingly im- APIs, and new APIs are constantly out_size (height,
being developed. width) out = n
Existing
portant due to the significant effort required to manually write models often find it challenging to select APIs accurately and
source code, especially for complex software. Deep learning will generate non-existent APIs or APIs that do not meet
techniques, particularly language models, have shown great requirements. For example, according to our preliminary ex-
promise in generating high-quality source code from natural periments on NumpyEval and PandasEval [26], a popular code
language requirements. Currently, pre-trained code generation generation model CodeGen-2B generates more than 26% code
models are considered the state-of-the-art solution for various that contains an incorrect API. Furthermore, for security and
code generation tasks, such as CodeX [3], ChatGPT [4, 16] functionality reasons, industrial companies often build private
and CodeGen [14] models. libraries for internal use only. For these private libraries that
Accurately selecting appropriate application programming are not publicly available, the error rate will increase to more
interfaces (APIs) is essential for pre-trained models to generate than 90%. These third-party public libraries or private libraries
provide so many APIs that code generation models have never
* Corresponding authors seen, resulting in the model being unable to generate API-
oriented code. Therefore, it is worth exploring methods to Case 1: Numpy
improve code generation models to generate accurate source a = np.arange(2*3*2).reshape((2,3,2))
codes using these domain-specific or private library APIs. count_value = a.count(2)
To assist code generation models in selecting appropriate 'numpy.ndarray' object has no attribute 'count'
APIs during the generation process, we draw inspiration Case 2: Pandas
from human programmers’ perspectives. In most programming df = pd.DataFrame({'A': [5, 6, 7], 'B': [7, 8, 9]})
scenarios, programmers can use search tools to get suggestions # sum all values and return a numeric value
from web sources or library documents when selecting an sum_value = df.sum()
API. Figure 1 shows an example of human programmers sum_value is not a numeric value

selecting the proper API during coding. When encountering Case 3: Private library (BeatNum)
an API usage scenario, programmers summarize their needs import beatnum as bn
into a query and use existing API search tools to search for num_str = bn.numstr([0,33,4444522])
suitable APIs, such as Google search engines or documentation module ‘beatnum' has no attribute 'numstr'
search tools for specific libraries. Then, according to the search
results, programmers can choose the proper API. This pro- Fig. 2. Failure Cases of CodeGen-2B model in selecting APIs, including gen-
gramming process using tools to retrieve and determine API erating non-existing APIs on public libraries (Case 1), generating unqualified
usage improves programming efficiency, improves accuracy, APIs (Case 2), and lack of API-related knowledge on private libraries (Case
3).
and reduces the risk of errors. It motivates us to investigate
approaches to teach code generation models to use search tools
to find suitable APIs. • We propose an automatic method for annotating datasets
In this paper, we propose ToolCoder, a low-cost and efficient in software engineering. This low-cost and efficient an-
solution that integrates API search tools into pre-trained code notation framework uses powerful ChatGPT to annotate
generation models, mimicking how programmers solve this API datasets with public source code datasets, reducing
problem. To help models learn to use tools, we propose an the manual effort required to create private annotated
automated data annotation method with ChatGPT to add tool datasets. Our dataset construction method can also be
usage information into the source code data and use the anno- easily transferred to other tasks.
tated dataset to fine-tune code generation models. Specifically, • We propose ToolCoder, which incorporates the ability to
our approach utilizes the in-context learning ability of large use API search tools into pre-trained code generation
models to annotate a special tool-augment dataset at a low models and improves the performance on API-related
cost. We employ parameter-efficient fine-tuning to improve the code generation tasks. Our approach outperforms existing
training efficiency. During inference, we integrate API search API-oriented baselines on multiple popular API-related
tools into the decoding process of our model, allowing the code generation benchmarks.
model to learn to use external tools autonomously.
II. M OTIVATING E XAMPLES
We extensively evaluate our proposed ToolCoder with pass
rate metrics [3]. ❶ We evaluate ToolCoder on three public In this section, we examine the limitations of the current
library benchmarks. Our model achieves significant improve- code generation models when selecting a suitable API and how
ments over state-of-the-art baselines with at least 10.11%, existing search tools can aid in API selection. By exploring
3.26%, and 1.39% pass@1. Our relatively small ToolCoder is these issues, we hope to provide context for our research
even comparable to one of the current best language models, and explain our motivation for proposing a new approach to
GPT-3.5. ❷ We further evaluate our model on two private address these challenges.
library benchmarks. By switching to the appropriate search A. Limitations of current code generation models in selecting
tool, our ToolCoder can be easily transferred to these private suitable APIs
library scenarios and achieves stable improvement. Our model
exhibits better generalization performance and raises at least Application Programming Interfaces (APIs) are essential
6.21% improvement on the average pass@1 metrics for all five to modern software development. APIs allow developers to
benchmarks. ❸ We also conduct an ablation study to analyze integrate pre-existing code and services into their applications.
the different settings in our experiments, including the dataset, Using APIs can significantly reduce the time and effort re-
training, and inference settings. Results prove the effectiveness quired to develop complex software systems.
of different designs in our approach. However, selecting the proper API remains challenging for
code generation models. Due to the proliferation of third-party
Our contributions in this paper can be summarized as
libraries and their associated APIs, existing code generation
follows:
models often need help choosing APIs. Here we choose
• To the best of our knowledge, we are the first to incor- the popular model CodeGen-2B to test its performance on
porate a programming tool into code generation models. API selection. Figure 2 shows three failure examples of the
Our results highlight the importance of models’ ability to generated code for selecting APIs. ❶ Case 1: The CodeGen-
use tools. 2B model has the potential risk of generating APIs that do
TABLE I websites such as StackOverFlow1 and datagy.io2 . They orga-
C OMPARISONS OF TWO TYPES OF SEARCH TOOLS FOR API SELECTION . nize and summarize the API suggestions used for different
problems. Formally, these online API suggestions are usually
Online Search Engine Documentation Search
displayed in the form of programming experience sharing or
Programming Community
Knowledge
or Tutorial Websites
question and answer. When other people encounter similar
Library Documentation
Resources problems, search engines can use this information well, and
(StackOverFlow, datagy.io, etc.)

Public libraries, Any APIs,


programmers only need to provide a question query. These
API Type especially those well-known including public and search engines regard these community websites as knowledge
and widely-discussed private libraries resources and can provide helpful API suggestions, especially
Practical and Accurate Wide coverage for those public libraries that are well-known and widely
Advantages Rich sources Detailed explanation discussed. Since these online programming experiences on
Keep updating Stable
the website are constantly updated and provide a variety
NumPydoc, of practical scenarios, we can often get more accurate API
Google, Bing,
Example Tools Pandasdoc,
DuckDuckGo suggestions from these online search engine tools. These
Private documentations
mature commercial search engine tools such as Google3 ,
DuckDuckGo4 can provide accurate, instant, and fast search
not exist, even on such a popular and common third-party responses and are widely used by human programmers.
library as NumPy. There is no count API in the Numpy 2) Documentation Search Tool: Since lesser-known public
library, but CodeGen2B still generates. ❷ Case 2: It may also libraries or private libraries have few discussions on the
use the wrong API and generate unqualified code. df.sum() online community websites, human programmers also turn to
will return with a Pandas Series type but not the required library documentations for API suggestions. Documentations
numeric value. It shows that these existing code generation are usually available for public libraries and private libraries.
models still have challenges choosing an appropriate API Therefore, it can provide rich information for any API usage
to implement a given requirement. We conduct a statistical scenario. Documentations provide detailed explanations for
experiment to analyze the generated code on two benchmarks, each API, with a broad convergence over the corresponding
NumpyEval and PandasEval [27], and find that more than 26% library. The documentation contains detailed and accurate
of the APIs generated have the problems mentioned above. explanations of the parameters and usage of the API. It is
We also conduct experiments on private library benchmarks the most detailed place to learn how to use an API. Since
such as BeatNumEval [26]. Private libraries are widespread the documentation does not change frequently, its results are
in real code scenarios, and they are not public for security more stable. Formally, API information in the documentation
and functional reasons. Experiments show that for those APIs is usually given in pairs of API and corresponding comments.
in private libraries, the failure rate will increase to more than We can use BM25 [19] or other semantic similarity scores
90%. ❸ Case 3: We find the CodeGen-2B lacks corresponding as search metrics to search for comments that meet the
private knowledge and concocts an incomprehensible API requirements and find the corresponding API as the final
for the private library BeatNum. It shows that existing code suggestion for coding.
generation models have limitations in API generation and are These various search tools are helpful for programming
not specifically optimized for using APIs. In this paper, we and selecting an API. Inspired by the API search process
aim to address these API selection issues. of human developers, we aim to incorporate these two types
of search tools into code generation models. By letting the
B. Existing search tools to aid API selection code generation model learn to use these online search engine
tools or documentation search tools, our models can effectively
Drawing inspiration from the approach programmers em- navigate the vast amount of information available to identify
ploy in selecting an API, our study reveals that existing search the most relevant APIs. This approach enables our models to
tools can provide API recommendations. For example, in more accurately match APIs with specific needs.
Figure 1, the developer needs to use the numpy library to
remove the extra single dimension in input size. The developer III. API S EARCH T OOL
turns to online search engine tools or library documentation
search tools and gets the proper API suggestion np.squeeze. To better present the methodology of our model, we first
These two search tools can play a significant role in selecting provide a brief introduction to the API search tool in this
APIs. A comparison of these two types of search tools is given section. The proposed API search tool is an abstraction of the
in Table I. We will analyze the use of these two types of search existing search tools for API selection and will be used as an
tools. external tool for our ToolCoder.
1) Online Search Engine Tool: Online search engine tools 1 https://stackoverflow.com/
provide rich information on various API usage scenarios. 2 https://datagy.io/
Human programmers share their experience in solving various 3 https://www.google.com/

programming problems on various community and tutorial 4 https://duckduckgo.com/


Following the motivating examples in Section II-B, we TABLE II
develop the API search tool for code generation models based S TATISTICS OF THE ANNOTATION DATASET.
on these two categories of sources. ❶ For those commonly
Statistic
used public libraries such as numpy and pandas, we use
Dataset Size 53,000
DuckDuckgo as the search engine because it provides a more Avg. Annotation API 3.2
convenient and automated method compared to other search Avg. Length (in words) before annotation 186.24
engines. We use the search engine to search the relative Avg. Length (in words) after annotation 211.49
NumPy 24%
content from several online community websites and extract Pandas 13%
Proportion of some third-party libraries
the mentioned APIs with string regex matching. Since these TorchData 0%
contents have a richer introduction to the API, more accurate
API suggestions can be obtained from the search engine. ❷
For lesser-known APIs or those in private libraries, we employ as the base dataset. It is a real-world programming dataset
the BM25 score as our retrieval metric to search from the obtained from GitHub without any additional annotations. This
corresponding API documentation. dataset is already commonly used by many pre-trained code
We then abstract the two types of search tools into a unified generation models, so we can assure as much as possible that
form: we use the notation AP ISearch(query) → answer to our subsequent training will not affect the model’s generaliza-
represent the call of API search tool, where AP ISearch is tion performance on language generation and modeling ability.
the function name that abstracts different API search sources, We use a simple length filtering method and randomly choose
query denotes the search query and answer indicates the nearly 60k function-level source code from this dataset as the
return answer from API search tools that can be referred to base dataset for our annotation method.
for further code generation. In subsequent experiments, we Prompt Selection. Similar to [20], to help generate the
serialize API search tool calls for model input. To differentiate annotated dataset, we need to provide a detailed instruction
from regular text, we surround the tool calls with special for ChatGPT to specify its system role as a data annotator,
tokens by starting with ⟨AP I⟩ and ending with ⟨/AP I⟩. shown in Figure 4. To facilitate the quality of the generated
Examples can be viewed in Figure 3. We set ⟨AP I⟩, ⟨/AP I⟩, datasets, we manually write three human-written input-output
and → as special tokens in our model vocabulary. pairs as part of the prompt with three libraries including
IV. T OOL C ODER numpy, pandas, and matplotlib. We choose these three libraries
as the examples in the prompt because we are skilled in them,
In this section, we present our approach ToolCoder for
and they are also commonly used in the base dataset. Based
selecting and using APIs in coding practices. The goal of our
on our selected prompt and base dataset, we will ask the
approach is to train a model that can effectively select and use
ChatGPT to annotate the tool-augmented dataset. We generate
appropriate APIs based on existing partial code. To achieve
one annotated data for each base sample. The automatic
this goal, we decompose our approach into three modules,
annotation process lasted for four days.
including data annotation, fine-tuning, and inference. The three
Filter and Clean. After getting all the generated results from
modules work in a pipeline as shown in Figure 3. We will
chatGPT, we performed a series of simple filtering operations
describe the details in the following subsections.
on the results to remove those abnormal data samples. We
A. Automatic Data Annotation filter out nested API Search calls, control the number of API
In order for the model to learn to use the API search Search calls in a sample of less than 5, and ensure that at
tool, we first need a dataset that includes the source code least one is an API call from a public library. We filter out
and associated tool call processes. As mentioned in Section those samples that are different from the source code after
III, we abstract the search call process with the notation removing the API Search call. Furthermore, for the generated
⟨AP I⟩ AP ISearch(query) → answer ⟨/AP I⟩. However, API answer in the search call, we check whether it is followed
such datasets are not readily available. To address this issue, by the corresponding API in the generated code to ensure
we propose automatically augmenting an existing source code that the API search call is closely related to the specific code
dataset with the tool call notation using ChatGPT (gpt-3.5- implementation. Finally, we cleaned and obtained the final data
turbo)5 , which has demonstrated excellent few-shot and even set of 53k, which will be used for subsequent fine-tuning.
zero-shot learning ability in many different language learning Table II shows the statistics of the final annotated dataset. We
tasks already. This low-cost and efficient annotation method also count the proportion of some third-party library APIs in
reduces the manual effort required to create private annotated the dataset for reference in subsequent evaluation experiments.
datasets. Our data annotation process can be divided into three In the left part of Figure 3, we also give an example sample
parts: ❶ base dataset selection, ❷ prompt selection, and ❸ of the final dataset.
filter and clean. B. Parameter-efficient Fine-tuning
Base Dataset Selection. For the base dataset, we choose to
We leverage the annotated dataset to fine-tune a pre-trained
use the popular pre-trained dataset CodeSearchNet-Python [8]
language model to teach the model to generate the search tool
5 https://openai.com/ call itself. To address the challenge of limited computational
NL_input:
Public Pretraining Dataset Code Genera)on How do I get the value at an n-th row
(CodeSearchNet…) Pretrained Model of a given column name in Pandas?

Before Annota+on: 🤖 CodeGen-350M



if mean is not None: 🤖 CodeGen-2B output = <API>
samples=mulBvariate_normal(mean, ... APISearch(Selects a single row
matrix, N) of data from a DataFrame.)->

Fine-Tuning h Inference

ChatGPT Frozen
Pretrained
Wup 🔍 Selects a single row of data
Weights Wdown
from a DataFrame
✅ pandas.DataFrame.iloc
A-er Annota+on: x

low-rank adapta-on:
if mean is not None:
freeze most parameters
samples= <API>APISearch(Generates
and only finetune with few of params output = <API>
random samples from a mul9variate APISearch(Selects a single row
normal distribu9on.)->
of data from a DataFrame.)->
mul9variate_normal </API>
⚙ Parameter-efficient Fine-Tuning pandas.DataFrame.iloc </API>
mulBvariate_normal(mean, matrix, N)
can be trained only with a df.iloc[n][column_name]

consumer-level GPU

Fig. 3. The pipeline of our approach ToolCoder. The pipeline has three main parts: (1) Automatically Annotate Tool-augmented Dataset with ChatGPT, (2)
Parameter-efficient Fine-tune existing pre-trained code generation model with the annotated dataset, and (3) Inference of the fine-tuned model enhanced with
API search tools.

For a specific input x to the linear projection in multi-head


Your task is to add calls to a API Search Tool to a piece of
source code. You can use an API Search Tool to lookup attention, LoRA modifies the projection output h as:
important third-party APIs from the document. The API Search
Tool should help you get information required to complete
the source code and select API. Use the format:
"<API>APISearch(query)->answer</API>". In the format, h ← h + s · xWdown Wup , (1)
"query" is the search input that describes the specific role
of the API required in this code, and "answer" is the search
output API. Here are some examples of API calls: where s ≥ 1 is a tunable scalar hyperparameter. The illustra-
Input: B = np.reshape(A, (-1, 2)) tion of LoRA is shown in the middle part of Figure 3.
Output: B = <API>APISearch(Gives a new shape to an array
without changing its data.)->np.reshape</API>np.reshape(A, In our training setting, we freeze most of the parameters in
(-1, 2))
the pre-trained model and only apply LoRA on the query and
(...another two Input-Output pairs...)
value projections in the attention module for each transformer
Input: {code}
Output: layer. As a result, we only need to train 0.18% parameters
in CodeGen-350M and 0.09% for CodeGen-2B. It makes it
Fig. 4. An exemplary prompt used to generate API-augmented datasets for possible to efficiently fine-tune models on a consumer-level
the API search tool. In our setting, We selected a total of three human-written GPU, such as Nvidia GeForce RTX 2080 (11GB RAM).
input-output pairs as part of the prompt, using three libraries: numpy, pandas, The parameter-efficient tuning strategy significantly reduces
and matplotlib.
the training computational burden in our experiments. It can
achieve results comparable to full-parameter training with less
computational resources and time. We will give a detailed
resources and improve the training efficiency, we propose analysis of the ablation experiment in Section VI-C.
restricting the number of meta-trainable parameters and layers
in the pre-trained model and adopting a parameter-efficient C. Inference enhanced with Tools
fine-tuning approach that can efficiently adapt pre-trained After training with the annotation dataset, the model can
models to new task types. In particular, we apply LoRA [7] generate the API search calls during the code generation
to reduce trainable parameters. process. The pseudo-code description of the decoding process
Low-Rank Adaptation (LoRA) is a low-dimensional with API search tool procedure is in Algorithm 1.
representation-based parameter-efficient tuning method. It in- During inference, we perform regular decoding until the
jects trainable low-rank matrices into transformer layers to model produces the ⟨AP I⟩ token, indicating that it next
approximate the weight updates. For a pre-trained weight expects the response for an API call. At this point, we continue
matrix W ∈ Rd×k , LoRA represents its update with a low- the decoding process and record the following generated
rank decomposition W + δW = W + Wdown Wup , where tokens to get the query between AP ISearch( and ) →. Then
Wdown ∈ Rd×r , Wup ∈ Rr×k are tunable parameters. LoRA we interrupt the decoding process and call the API search tool
generally applies this update to the attention linear projection to get a response, and continue the decoding process after
matrices in the multi-head attention sub-layer in Transformer. inserting both the response and the ⟨/AP I⟩ token.
Algorithm 1 Inference with API Search Tool compare ToolCoder’s performance with existing SOTA code
1: procedure I NFERW ITH TOOL(model, input nl, maxlen) generation baselines.
2: Pass input nl to the model and get predicted token
3: output ← [token] RQ2. How does ToolCoder perform on private library
4: i←0 code generation? We select two private library benchmarks
5: while i < maxlen do
6: token ← the last token of output where the pre-trained language models have never encountered
7: if token = ⟨AP I⟩ then any private library APIs, and there is no relevant information
8: query ← the following generated tokens between APISearch( and )→
9: response ← Call API search tool with query available online. We evaluate ToolCoder’s performance on
10: Append ⟨AP I⟩APISearch(query)→response⟨/AP I⟩ to output these private libraries to demonstrate its generalization and
11: i ← i+ length of the call process
12: else versatility.
13: Pass token to the model and get predicted token RQ3. What are the contributions of different modules
14: Append predicted token to output
15: i←i+1 in our approach? Our approach pipeline consists of three
16: end if modules: data annotation, fine-tuning, and inference. To ana-
17: end while
18: return output lyze the effectiveness of our approach, we conduct an ablation
19: end procedure study by varying settings in our pipeline, including the dataset,
training, and inference search settings.
RQ4. How is the quality of our generated code with
As mentioned in Section III, we adopt different API search
ToolCoder? We evaluate the quality of generated code using
sources for different types of API usage. For those commonly
ToolCoder by performing a case study analysis. Additionally,
used public libraries, we use the DuckDuckGo, a popular
we analyze the effectiveness of our method and explain why
online website search engine, to adopt content in-site search
our model works.
in several selected websites. For those lesser-known or private
library APIs, there is no relevant online information. So we B. Datasets
employ the BM25 score as our retrieval metric to search Our experiments are conducted on three public library
from the corresponding API documents. We encapsulate these benchmarks, PandasEval, NumpyEval, and TorchDataEval,
search interfaces so that our ToolCoder can call search tools and two private library benchmarks, including MonkeyEval
with high performance. In our experiment, we control the and BeatNumEval. We choose these benchmarks to ensure
search delay within 0.6s to ensure high efficiency during the our proposed method can be used in various API selection
code generation process. scenarios.
After the entire inference process is over, we use the regular 1) Public library benchmarks: PandasEval [27] is a
matching method to remove the API search part from the domain-specific method or block generation benchmark for the
generated code, that is, the part between ⟨AP I⟩ and ⟨/AP I⟩ Pandas library in Python. PandasEval contains 101 test exam-
to get the generated code. By using API search tools in this ples. Each example corresponds to a programming problem of
way, we can effectively address the challenge of selecting Pandas, containing the context code, the target method body
appropriate APIs and reduce the time and effort required for (or block), and multiple test cases. NumpyEval [27] is almost
developers to find suitable APIs. the same as PandasEval, apart from the domain. NumpyEval
V. E XPERIMENTAL S ETUP specifically targets the Numpy library in Python. The bench-
To assess the effectiveness of our approach, we perform mark also contains 101 test examples. TorchDataEval [26]
a large-scale study to answer four research questions. In this is based on the TorchData library in Python. TorchData is a
section, we describe the details of our study, including datasets, newly released library, which is more likely to be unseen to
metrics, and baselines. the pre-trained models. Therefore, this benchmark is proposed
to evaluate the model against the unseen library containing
A. Research Question 50 test examples. In our experiments, our annotated dataset
Our study aims to answer four research questions. In RQ1, does not contain API code related to TorchData as shown
we compare our ToolCoder to SOTA code generation models in Table II, and our base pre-trained model does not contain
on three public library benchmarks. In RQ2, we conduct ex- these data during the pre-training phase, so this benchmark
periments on two private library benchmarks to show the gen- can also be used to demonstrate the generalization ability of
eralization of our proposed model on those private libraries. In our method on those APIs that are public but never seen by
RQ3, we conduct an ablation study to prove the contributions the code generation model.
of different modules. In RQ4, we conduct a series of quality 2) Private library benchmarks: MonkeyEval [26], mod-
measures on the generated results and analyze the effectiveness ified from PandasEval, is designed to evaluate the method
and limitations of our method through detailed case studies. generation model against the unseen library. The Monkey li-
RQ1. How does ToolCoder perform compared to SOTA brary is crafted by modifying all Pandas-related keywords. e.g.,
baselines on public library code generation? To evaluate “pandas” is converted to “monkey”, “dataframe” is converted
ToolCoder’s performance on public library code generation, to “knowledgeframe”, etc.. The library construction process
we conduct experiments on three public library code genera- ensures that no information about the API names of these
tion benchmarks, including numpy, pandas, and torchdata. We libraries is leaked in the online materials or any training
datasets. MonkeyEval converts all examples in PandasEval, TABLE III
leading to 101 test examples. BeatNumEval [26] is mod- PASS RATE OF MODELS ON P UBLIC LIBRARY BENCHMARKS
ified from NumpyEval, in the same way as PandasEval to NumpyEval PandasEval TorchDataEval
MonkeyEval. BeatNumEval also has 101 test examples. The Model Para.
pass@1 pass@10 pass@1 pass@10 pass@1 pass@10

pre-trained model has not seen the API in MonkeyEval and General Models
BeatNumEval, and the online search resources cannot provide CodeT5 220M 0 0.1 0 0 0 0
PyCodeGPT 110M 18.04 38.61 12.75 37.62 3.80 14.00
any API-related information. So the API selection on these CodeGen350M 350M 18.51 43.56 16.73 29.70 4.60 14.00
benchmarks will only rely on the API search tool we built on CodeGen2B 2B 29.10 53.46 30.69 42.57 7.00 18.00
GPT3.5 - 58.41 66.21 30.09 33.16 6.00 24.00
the documentation of these private libraries.
API-oriented
C. Metrics CERT-numpy 220M 31.47 46.42 16.03 27.72 2.20 14.00
CERT-pandas 220M 18.81 33.66 28.42 48.04 2.80 6.00
Following the previous work, we use the metric pass rate CodeGenAPI 350M 16.55 29.48 13.58 34.95 7.19 16.93
CodeGenAPI-retrieval 475M 12.67 27.32 11.25 28.61 10.41 23.50
pass@k [3] for performance evaluation and take advantage of CodeGen-retrieval 475M 18.30 35.12 9.54 29.02 7.52 16.36
the provided unit tests to determine the functional correctness Ours
of code solutions. For each problem, we submit k code ToolCoder-OnlineTool
350M 35.64 50.50 22.77 37.62 7.40 20.00
2B 41.58 55.44 31.68 47.52 11.80 24.00
solutions for evaluation. If any of the k code solutions passes
all ground truth test cases, the problem is considered solved.
Then pass@k is the percentage of solved problems. In our
experiments, we set k = {1, 10}. given an NL description, CodeGenAPI firstly uses a retriever
model initialized with BERT [5] to find APIs from documents.
D. Baselines Then it uses a generator initialized with CodeGen-350M to
We select six series of recent code generation models as generate the complete code based on the retrieved API and
baselines, including one of the most powerful models, GPT- problem description. We use the three released settings in their
3.5. These models can be divided into two categories: general paper: CodeGenAPI, CodeGen-retrieval, and CodeGenAPI-
models and API-oriented models. retrieval. The first setting only uses the trained generator
1) General Models: CodeT5 [2] is an encoder-decoder pre- without retrieval, and the latter two use the best-performing
trained model for code-related tasks. It uses the identifier- top2 retrieval results to assist generation.
aware pre-training task and has achieved SOTA results on
E. Implementation Details
many general code generation benchmarks. We use CodeT5-
base with 220M parameters in our experiments. PyCodeGPT Training. Our model is implemented in the Pytorch frame-
[27] is a decoder-only pre-trained code generation model with work, and we perform all the experiments on four RTX 2080-
110M parameters. It is initialized with the GPT-Neo and 11GB GPUs. We initialize our ToolCoder by leveraging pre-
is continually pre-trained with a large-scale code corpus in trained weights of CodeGen-350M and CodeGen-2B. The
Python. CodeGen [14] is a series of decoder-only pre-trained training batch size is set to 8, and the total training epoch is set
code generation models with parameters varying from 350M to 10. We use validation loss to determine the best checkpoint
to 16B. It casts code generation as a multi-turn conversation as the final model.
between a user and a system. CodeGen has shown strong Tool. When implementing the API search tool, we adopt in-
ability on a variety of complex code generation tasks. Due to site online search in datagy.io as well as NumPy7 , Pandas8 and
computational limitations, we use 350M and 2B versions in TorchData websites9 using the DuckDuckGo for public library
our experiments. GPT-3.5 [4, 16] is one of the most powerful benchmarks. For private library benchmarks, we use provided
generation models from OpenAI. We use the “gpt-3.5-turbo“ Monkey and BeatNum library documentations to design an API
model as it is the most cost-effective and performant model in search tool based on the BM25 algorithm. The tool’s response
the GPT3.5 family. As OpenAI states, it can be complemented for inference is considered as the first retrieved API.
with flexible natural language and programming language Inference. During the model generation process, we use
capabilities6 . temperature sampling with T = 0.8 and limit the sample
2) API-oriented models: CERT [27] is a generation ap- budget to 10. Each experiment is run three times with random
proach designed for API-related code. CERT contains two seeds and then averaged for the final results.
modules: the sketcher and generator, each of which is fine-
tuned independently with PyCodeGPT. It first predicts a sketch VI. R ESULTS AND A NALYSES
based on the NL description and generates the complete A. RQ1: Results for Public library API Code Generation
code based on the sketch. For each library, CERT requires To answer RQ1, we evaluate baselines and our ToolCoder
a specially trained weight for generation. We use the released on NumpyEval, PandasEval and TorchDataEval and results
weight as two independent models: CERT-numpy, CERT- are shown in Table III. ToolCoder-OnlineTool represents the
pandas. CodeGenAPI [26] is another API-oriented code gen-
7 https://numpy.org/doc/
eration model. It uses a two-stage pipeline to generate code:
8 https://pandas.pydata.org/docs/
6 https://platform.openai.com/docs/models/gpt-3-5 9 https://pytorch.org/data/
TABLE IV TABLE V
PASS RATE OF MODELS ON P RIVATE LIBRARY BENCHMARKS A BLATION STUDIES ON DATASET S ETTINGS . W E CONDUCT EXPERIMENTS
ON T OOL C ODER -350M.
MonkeyEval BeatNumEval
Model Para.
pass@1 pass@10 pass@1 pass@10 NumpyEval PandasEval TorchDataEval
General Models Dataset Setting
pass@1 pass@10 pass@1 pass@10 pass@1 pass@10
CodeT5 220M 0 0 0 0
CodeGen350M 350M 0.95 4.90 5.15 11.96 ToolCoder-350M 35.64 50.50 22.77 37.62 7.40 20.00
CodeGen2B 2B 1.59 5.94 5.94 11.88 original dataset 19.40 39.60 19.92 38.61 6.00 14.00
GPT3.5 - 2.47 8.91 6.68 17.82 annotation w/o query 14.05 43.56 11.68 33.66 3.80 6.00

API-oriented CodeGen-350M 18.51 43.56 16.73 29.70 4.60 14.00


CodeGenAPI 350M 1.19 4.68 4.44 8.24
CodeGenAPI-retrieval 475M 3.41 8.33 5.90 11.79
CodeGen-retrieval 475M 2.46 6.35 6.65 13.68
Ours
shows comparable performance. Combining the excellent per-
ToolCoder-DocTool
350M 2.98 5.94 6.73 12.87 formance of our method on the public library benchmarks, the
2B 3.02 7.92 6.93 13.86 average pass@1 on five benchmarks of our two series of Tool-
Coder is 15.10%, 19.00%. For this average pass@1 metric, our
ToolCoder outperforms the best baseline CodeGen-retrieval,
performance of our model with the online search engine tool which is only 8.89%, raising at least 6.21% improvement.
to generate code. As for the average pass@10, our model outperforms all API-
We notice that some general code generation models, such oriented baselines by at least 9.64%. It is confident that our
as CodeT5, have achieved poor results, which proves that the ToolCoder shows the overall best performance on various API
selection of public library API has particular challenges for selection scenarios.
code generation models. Results show that ToolCoder achieves Compared with the base pre-trained model CodeGen-350M
the best results among general code generation baselines and and CodeGen-2B, our model greatly improves. ToolCoder-
API-oriented baselines. Even compared with the extremely 350M outperforms the base CodeGen-350M by 2.03%, 1.58%
large model GPT3.5, our model can achieve comparable on pass@1 and 1.04%, 0.91% on pass@10. ToolCoder-2B
performance with these public library benchmarks. also achieves a similar improvement compared with CodeGen-
Compared with the state-of-the-art API-oriented baselines, 2B. It shows that documentation search tools can help code
our model achieves 10.11%, 3.26%, and 1.39% pass@1 im- generation models select proper APIs during inference, thus
provement over the best baseline on three benchmarks. Even improving the quality of the generated code. Compared with
when we control our model parameters to be smaller than the most powerful model GPT3.5, our ToolCoder can still
the baselines as ToolCoder-350M, our model can still achieve achieve better results in some inference settings. Results show
excellent overall performances. Existing API-oriented models that our proposed ToolCoder can assist the API selection
mainly focus on training and inference on a library API process and enhance the ability of the code generation model.
code dataset, resulting in the failure of the same model to
achieve good results on multiple API benchmarks, such as C. RQ3: Ablation Studies
CERT-numpy and CERT-pandas. Our model shows stronger To answer RQ3, we investigate the impact of different
generalization ability and can be applied to various API designed modules in our pipeline. We conduct ablation studies,
libraries. Our method can achieve excellent results even on including changing the dataset, training, and inference settings
the unseen TorchData library. Our model is trained based on in our experiments.
CodeGen models. The performance of our ToolCoder models 1) Dataset Setting: We perform ablation experiments on the
is significantly higher than that of the corresponding base dataset construction in Table V. We replace our training dataset
CodeGen model, indicating that our training process and tool with the original dataset, which only contains the regular
assistant can help models learn to generate API-related code source code and without annotation, referring as original
better. dataset. We also add an experiment to remove the content
of the query in the search call so that its form becomes
B. RQ2: Results for Private library API Code Generation APISearch()→answer. During inference, we use the question
To answer RQ2, we evaluate baselines and our ToolCoder description to search the API directly. We refer to this ablation
on MonkeyEval and BeatNumEval. Results are shown in Table as annotation w/o query. We also add the original CodeGen-
IV. ToolCoder-DocTool represents the performance of our 350M model for comparison, which is not trained on the new
model with the documentation search tool to generate code dataset.
as these private do not have relevant online resources. Results show that our dataset annotation is essential for
These private library benchmarks are extremely hard for improvement. Compared with the model trained on the original
general code generation models, which we can see by the dataset, our ToolCoder-350M shows a stable improvement on
smaller pass@1 and pass@10 scores. With the documentation almost all metrics. The annotation dataset enables our model to
search tool enhanced, our ToolCoder shows stable generaliza- use the external search tool for API selection and thus improve
tion ability on these two new benchmarks. When compared the quality of the generated code. Results also show that it is
with the state-of-the-art API-oriented baselines, our model essential to generate the search query. When we discard the
TABLE VI TABLE VII
A BLATION STUDIES ON T RAINING S ETTINGS . W E CONDUCT A BLATION STUDIES ON I NFERENCE S ETTINGS .
EXPERIMENTS ON T OOL C ODER -350M.
(a) On Public library benchmarks
Training Training NumpyEval PandasEval TorchDataEval NumpyEval PandasEval TorchDataEval
Training Setting Inference Setting
Time Para. pass@1 pass@10 pass@1 pass@10 pass@1 pass@10 pass@1 pass@10 pass@1 pass@10 pass@1 pass@10

ToolCoder-350M 6h 0.65M 35.64 50.50 22.77 37.62 7.40 20.00 OnlineTool-350M 35.64 50.50 22.77 37.62 7.40 20.00
full-training 29h 350M 35.35 58.41 22.67 40.59 6.00 22.00 NoTool-350M 33.76 46.53 20.19 35.64 6.00 16.00
OnlineTool-2B 41.58 55.44 31.68 47.52 11.80 24.00
NoTool-2B 38.71 54.45 31.38 44.55 7.50 20.00

search query in the data construction and use the problem


(b) On Private library benchmarks
description for API search tools, we observe a drastic drop in
MonkeyEval BeatNumEval
the final results as annotation w/o query in the Table V. We Inference Setting
pass@1 pass@10 pass@1 pass@10
attribute it to the fact that the problem description is still far OnlineTool-350M 2.98 5.94 6.73 12.87
from the use of the specific API, so it is still difficult to select NoTool-350M 0.29 0.99 1.68 4.95
the appropriate API using the existing API search tools. We OnlineTool-2B 3.02 7.92 6.93 13.86
can also confirm that only fine-tuning on the original source NoTool-2B 0.79 2.97 2.77 8.91

code dataset can not help the model learn to select APIs.
We compare the CodeGen-350M with the model trained on
the original dataset. Results show that additional training on
suggestions. When considering private library benchmarks, the
the code dataset does not significantly improve the model’s
improvement is more significant. We find the model itself
performance. The key to our improvement is to annotate the
works poorly on private libraries. However, with the assistance
API tool into the code dataset to teach the model to use
of the documentation search tool, our model can choose a
external API search tools.
suitable private library API.
2) Training Setting: We performed ablation experiments
Another interesting observation is that the NoTool also
with ToolCoder-350M on the training setting in Table VI. Our
achieves relatively good performance on public library bench-
experiments compare the performance of two approaches: full
marks. We believe that the improvement comes from our
parameter training, referred to as full-training. Our proposed
dataset annotation process. The additional tool call process
method utilizes LoRA for parameter-efficient training. We
in the dataset can be seen as a way to think about and choose
evaluate their performance on public library benchmarks and
the API. The chain of thought in the annotation dataset can
recorded their training costs, including training time and
assist the code generation model in better understanding the
parameters, using 2*2080 GPUs.
functions and application scenarios of different APIs, thus
Results show that our fine-tuning strategy has almost no directly improving the model to select the API. However, for
performance penalty compared with the regular full-training. private libraries, since the knowledge of private libraries is
On the public library benchmarks, the difference between the not seen by the code generation model, this form of dataset
two pass@1 results is within 0.4%. The gap in these results annotation is challenging to bring improvements to the model.
is acceptable, considering the huge savings in training costs. With proper API search tools enhanced, our ToolCoder can
In our experiment settings, our parameter-efficient fine-tuning select API more accurately and improve further.
strategy can reduce the training time from 29h to 6h and the
training parameters from more than 350M to 0.65M. We only
need to train 0.18% parameters in CodeGen-350M and 0.09% D. RQ4: Qualitative analysis
for CodeGen-2B. It makes it possible to efficiently fine-tune To answer RQ4, we perform a case study analysis to assess
models on a consumer-level GPU, such as Nvidia GeForce the generated code’s quality. Figure 5 represents code snippets
RTX 2080 (11GB RAM). generated on public and private library benchmarks. From the
3) Inference Setting: We perform ablation experiments on examples, we obtain the following findings: ❶ The generated
the inference setting in Table VII. We add experiments to search query provides more fine-grained technical planning for
disable the tool in our model. NoTool represents that we the solution. The NumpyEval case requires summing values
disable the tool for inference and use our trained model in a dataframe, and the generated query breaks down the
to directly generate an API based on the search query and requirements, focusing first on summing arrays. It fills the
complete the code. We compare with our original inference gap between requirements and concrete APIs. ❷ The response
setting on public and private library benchmarks. of the search tools both play a crucial role in the generated
Experiments show that our external tools are essential code. The online search engine tool finds the proper API from
in improving performance. On public library benchmarks, the correct websites, and the documentation search tool finds
the online search engine tool improves pass@1 by 1.88%, the proper API by searching over the API comments. ❸ Our
2.57%, 0.4% for ToolCoder-350M, and 2.87%, 0.29%, 4.3% ToolCoder also can make necessary modifications based on
for ToolCoder-2B. The online search engine tool can search the tool response. For example, the online search tool returns
for similar API usage scenarios and provide accurate API the response as cumsum, not directly defined in the input code.
Input: NumpyEval/99 It involves automatically creating source code based on func-
import numpy as np tional requirements, such as natural language descriptions [9]
import pandas as pd
df = pd.DataFrame({'A': [5, 6, 7], 'B': [7, 8, 9]})
or pseudo code algorithms [10, 15, 25]. Recently pre-trained
# What's the best way to sum all values in a language models have shown impressive capabilities in code
Pandas dataframe? generation tasks. Lu et al. [11] adapt GPT-2 [18] model on
# the result is a numeric value
Output: the source code, resulting in CodeGPT. Chen et al. [3] fine-
sum_value = <API>APISearch(Calculate the sum of tune GPT-3 [4] models on the code to produce CodeX and
the values in the given array.)->
🔍 Calculate the sum of the values in the given array
GitHub Copilot. OpenAI also produces the GPT3.5 series of
✅ h8ps://datagy.io/numpy-cumsum/ models, which have shown strong generation capabilities in
-> cumsum
natural language and programming languages. Neither CodeX
sum_value = <API>APISearch(Calculate the sum of
the values in the given array.)->cumsum</API> nor GPT3.5 is open-sourced, which leads to several attempts
np.cumsum(df.values)[-1] to replicate CodeX in industry and academia, resulting in
Input: BeatNumEval/92 GPT-Neo [1], GPT-J [21], CodeParrot [22], PolyCoder [23],
import beatnum as bn
master = bn.numset([1,2,3,4,5])
PyCodeGPT [27], InCoder [6], and CodeGen [14]. In our
search = bn.numset([4,2,2,3]) experiments, we choose the CodeGen series of models as our
# Find indices of a list of values in a beatnum base model for further exploration.
numset
Output: Recently, some work has focused on selecting APIs during
out = <API>APISearch(Find indices where code generation. As discussed in Section II-A, existing code
elements should be inserted to maintain
order.)-> generation models still struggle with selecting appropriate
APIs for a given context, especially for private or lesser-
🔍 Find indices where elements should be inserted to
maintain order known APIs. Existing work [26, 27, 29] has proposed some
✅ BeatNum Documenta?on
find_sorted:
API-oriented code generation methods. They typically use a
find the indices into a sorted numset a such that if the two-stage pipeline, where the first stage involves searching
corresponding elements in v were inserted before the
indices the order of a would be preserved or generating related APIs and then using them to generate
out = <API>APISearch(Find indices where elements code. We pursue this research line and propose to leverage pre-
should be inserted to maintain order.)- trained models and API search tools to automate API selection
>bn.find_sorted</API>bn.find_sorted(master, search)
in coding practices. In comparison, our approach has two
Fig. 5. Case Studies of the ToolCoder-2B, with online search engine tool on
advantages: ❶ Our method shows strong generalization ability.
NumpyEval and documentation search tool on BeatNumEval. By setting an appropriate API search tool, our method can
quickly adapt to any API-related code generation scenario. ❷
Our method does not require multi-stage generation. Instead,
Our ToolCoder can add some components not in the response we integrate the API search tool into the decoding process,
and generate the correct API np.cumsum. making our approach more flexible and allowing the API
VII. T HREATS TO VALIDITY selection process to be closer to the specific code fragment
being generated.
Threats to internal validity are related to the roles of
the model architecture and hyper-parameters setting. In our B. Tool-Augmented Large Language Models
experiments, we do a small-range grid search on learning rate
Recent research in language modeling has explored using
and batch size settings. Our ToolCoder-350M model tries to
external tools to supplement the knowledge stored in the
keep the hyper-parameters the same as baseline models for a
model’s weights [12]. These external tools can include other
fair comparison.
neural networks or even the language model itself, allowing
Threats to external validity are mainly related to the tasks for the composition of different pre-trained models on various
and datasets we choose in this paper. We counter this by modalities, such as the Socratic Model [28]. Alternatively,
evaluating our model on five different benchmarks of two natural language knowledge can be retrieved from external
types of API, including public and private library API code sources, as demonstrated by WebGPT [13] and ReAct [24]
generation. through the use of search APIs. Other approaches, such as
Threats to construct validity include the evaluation metrics Toolformer [20] and ART [17], leverage a combination of
we used in this work. We utilize pass rates to evaluate search tools, question-answering tools, machine translation
the correctness of generated code accurately. This metric is tools, calculators, and other tools to solve various NLP tasks.
adequate for corresponding tasks and has been adopted by ChatGPT Plugins10 further demonstrate the potential for lan-
many previous studies. guage models to integrate with thousands to millions of tools.
VIII. R ELATED W ORK However, incorporating programming tools into code-related
models has not been explored yet. Our paper addresses this gap
A. Code Generation
by abstracting the process of human programmers selecting
Code generation aims to generate the source code that
satisfies a given natural language description or requirement. 10 https://openai.com/blog/chatgpt-plugins
APIs into a programming tool that augments code generation [4] Zekai Chen, Mariann Micsinai Balan, and Kevin Brown.
models. 2023. Language Models are Few-shot Learners for Prog-
nostic Prediction. CoRR abs/2302.12692 (2023). https:
IX. C ONCLUSION //doi.org/10.48550/arXiv.2302.12692 arXiv:2302.12692
[5] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
In this paper, we propose ToolCoder, a novel approach in-
Kristina Toutanova. 2019. BERT: Pre-training of Deep
corporating API search tools into the code generation process
Bidirectional Transformers for Language Understanding.
to assist models in selecting appropriate APIs. We categorize
In Proceedings of the 2019 Conference of the North
API search tools into two types, including online search engine
American Chapter of the Association for Computational
tools and documentation search tools, and abstract them into
Linguistics: Human Language Technologies, NAACL-
a unified form. We propose an automatic dataset annotation
HLT 2019, Minneapolis, MN, USA, June 2-7, 2019,
method to add tool usage information to the source code
Volume 1 (Long and Short Papers), Jill Burstein, Christy
data. The parameter-efficient strategy is used to fine-tune
Doran, and Thamar Solorio (Eds.). Association for Com-
the model. During inference, the model decoding process
putational Linguistics, 4171–4186. https://doi.org/10.
is enhanced with external API search tools for proper API
18653/v1/n19-1423
selection. Experiments on public and private library code
[6] Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang,
generation benchmarks show that our ToolCoder outperforms
Eric Wallace, Freda Shi, Ruiqi Zhong, Wen-tau Yih,
state-of-the-art methods, with at least a 6.21% improvement
Luke Zettlemoyer, and Mike Lewis. 2022. InCoder: A
on average pass@1 metrics. Our experiments also demonstrate
Generative Model for Code Infilling and Synthesis. CoRR
the potential of incorporating programming tools into the code
abs/2204.05999 (2022). https://doi.org/10.48550/arXiv.
generation process, shedding light on this line of future work.
2204.05999 arXiv:2204.05999
[7] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan
R EFERENCES
Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and
[1] Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of
Biderman. 2021. GPT-Neo: Large Scale Autoregressive Large Language Models. In The Tenth International
Language Modeling with Mesh-Tensorflow. If you use Conference on Learning Representations, ICLR 2022,
this software, please cite it using these metadata 58 Virtual Event, April 25-29, 2022. OpenReview.net. https:
(2021). //openreview.net/forum?id=nZeVKeeFYf9
[2] Nghi Bui, Yue Wang, and Steven C. H. Hoi. 2022. [8] Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis
Detect-Localize-Repair: A Unified Framework for Learn- Allamanis, and Marc Brockschmidt. 2019. CodeSearch-
ing to Debug with CodeT5. In Findings of the Asso- Net Challenge: Evaluating the State of Semantic Code
ciation for Computational Linguistics: EMNLP 2022, Search. CoRR abs/1909.09436 (2019). arXiv:1909.09436
Abu Dhabi, United Arab Emirates, December 7-11, http://arxiv.org/abs/1909.09436
2022, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang [9] Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and
(Eds.). Association for Computational Linguistics, 812– Luke Zettlemoyer. 2018. Mapping Language to Code
823. https://aclanthology.org/2022.findings-emnlp.57 in Programmatic Context. In Proceedings of the 2018
[3] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Conference on Empirical Methods in Natural Language
Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harri- Processing, Brussels, Belgium, October 31 - November
son Edwards, Yuri Burda, Nicholas Joseph, Greg Brock- 4, 2018, Ellen Riloff, David Chiang, Julia Hockenmaier,
man, Alex Ray, Raul Puri, Gretchen Krueger, Michael and Jun’ichi Tsujii (Eds.). Association for Computational
Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Linguistics, 1643–1652. https://doi.org/10.18653/v1/
Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, d18-1192
Alethea Power, Lukasz Kaiser, Mohammad Bavarian, [10] Sumith Kulal, Panupong Pasupat, Kartik Chandra, Mina
Clemens Winter, Philippe Tillet, Felipe Petroski Such, Lee, Oded Padon, Alex Aiken, and Percy S Liang. 2019.
Dave Cummings, Matthias Plappert, Fotios Chantzis, Spoc: Search-based pseudocode to code. Advances in
Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Neural Information Processing Systems 32 (2019).
Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, [11] Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey
Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Svyatkovskiy, Ambrosio Blanco, Colin B. Clement,
Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Li-
Joshua Achiam, Vedant Misra, Evan Morikawa, Alec dong Zhou, Linjun Shou, Long Zhou, Michele Tufano,
Radford, Matthew Knight, Miles Brundage, Mira Mu- Ming Gong, Ming Zhou, Nan Duan, Neel Sundaresan,
rati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Shao Kun Deng, Shengyu Fu, and Shujie Liu. 2021.
Amodei, Sam McCandlish, Ilya Sutskever, and Woj- CodeXGLUE: A Machine Learning Benchmark Dataset
ciech Zaremba. 2021. Evaluating Large Language Mod- for Code Understanding and Generation. In Proceedings
els Trained on Code. CoRR abs/2107.03374 (2021). of the Neural Information Processing Systems Track
arXiv:2107.03374 https://arxiv.org/abs/2107.03374 on Datasets and Benchmarks 1, NeurIPS Datasets and
Benchmarks 2021, December 2021, virtual, Joaquin Van- CoRR abs/2302.04761 (2023). https://doi.org/10.48550/
schoren and Sai-Kit Yeung (Eds.). arXiv.2302.04761 arXiv:2302.04761
[12] Grégoire Mialon, Roberto Dessı̀, Maria Lomeli, [21] Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B: A 6
Christoforos Nalmpantis, Ramakanth Pasunuru, Billion Parameter Autoregressive Language Model. https:
Roberta Raileanu, Baptiste Rozière, Timo Schick, //github.com/kingoflolz/mesh-transformer-jax.
Jane Dwivedi-Yu, Asli Celikyilmaz, Edouard Grave, [22] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Yann LeCun, and Thomas Scialom. 2023. Augmented Chaumond, Clement Delangue, Anthony Moi, Pierric
Language Models: a Survey. CoRR abs/2302.07842 Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe
(2023). https://doi.org/10.48550/arXiv.2302.07842 Davison, Sam Shleifer, Patrick von Platen, Clara Ma,
arXiv:2302.07842 Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao,
[13] Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and
Long Ouyang, Christina Kim, Christopher Hesse, Shan- Alexander M. Rush. 2020. Transformers: State-of-the-
tanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Art Natural Language Processing. In Proceedings of the
Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin 2020 Conference on Empirical Methods in Natural Lan-
Button, Matthew Knight, Benjamin Chess, and John guage Processing: System Demonstrations. Association
Schulman. 2021. WebGPT: Browser-assisted question- for Computational Linguistics, Online, 38–45. https:
answering with human feedback. CoRR abs/2112.09332 //www.aclweb.org/anthology/2020.emnlp-demos.6
(2021). arXiv:2112.09332 https://arxiv.org/abs/2112. [23] Frank F. Xu, Uri Alon, Graham Neubig, and Vincent Jo-
09332 sua Hellendoorn. 2022. A systematic evaluation of large
[14] Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, language models of code. In MAPS@PLDI 2022: 6th
Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming ACM SIGPLAN International Symposium on Machine
Xiong. 2022. A Conversational Paradigm for Program Programming, San Diego, CA, USA, 13 June 2022,
Synthesis. CoRR abs/2203.13474 (2022). https://doi. Swarat Chaudhuri and Charles Sutton (Eds.). ACM, 1–
org/10.48550/arXiv.2203.13474 arXiv:2203.13474 10. https://doi.org/10.1145/3520312.3534862
[15] Yusuke Oda, Hiroyuki Fudaba, Graham Neubig, Hideaki [24] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak
Hata, Sakriani Sakti, Tomoki Toda, and Satoshi Naka- Shafran, Karthik Narasimhan, and Yuan Cao. 2022. Re-
mura. 2015. Learning to generate pseudo-code from Act: Synergizing Reasoning and Acting in Language
source code using statistical machine translation. In 2015 Models. CoRR abs/2210.03629 (2022). https://doi.org/
30th IEEE/ACM International Conference on Automated 10.48550/arXiv.2210.03629 arXiv:2210.03629
Software Engineering (ASE). IEEE, 574–584. [25] Pengcheng Yin and Graham Neubig. 2018. TRANX:
[16] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, A transition-based neural abstract syntax parser for se-
Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, mantic parsing and code generation. arXiv preprint
Sandhini Agarwal, Katarina Slama, Alex Ray, John arXiv:1810.02720 (2018).
Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, [26] Daoguang Zan, Bei Chen, Zeqi Lin, Bei Guan, Yongji
Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Wang, and Jian-Guang Lou. 2022. When Language
Christiano, Jan Leike, and Ryan Lowe. 2022. Training Model Meets Private Library. In Findings of the As-
language models to follow instructions with human feed- sociation for Computational Linguistics: EMNLP 2022,
back. CoRR abs/2203.02155 (2022). https://doi.org/10. Abu Dhabi, United Arab Emirates, December 7-11,
48550/arXiv.2203.02155 arXiv:2203.02155 2022, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang
[17] Bhargavi Paranjape, Scott M. Lundberg, Sameer Singh, (Eds.). Association for Computational Linguistics, 277–
Hannaneh Hajishirzi, Luke Zettlemoyer, and Marco Túlio 288. https://aclanthology.org/2022.findings-emnlp.21
Ribeiro. 2023. ART: Automatic multi-step reason- [27] Daoguang Zan, Bei Chen, Dejian Yang, Zeqi Lin, Minsu
ing and tool-use for large language models. CoRR Kim, Bei Guan, Yongji Wang, Weizhu Chen, and Jian-
abs/2303.09014 (2023). https://doi.org/10.48550/arXiv. Guang Lou. 2022. CERT: Continual Pre-training on
2303.09014 arXiv:2303.09014 Sketches for Library-oriented Code Generation. In Pro-
[18] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario ceedings of the Thirty-First International Joint Con-
Amodei, and Ilya Sutskever. 2019. Language Models are ference on Artificial Intelligence, IJCAI 2022, Vienna,
Unsupervised Multitask Learners. Austria, 23-29 July 2022, Luc De Raedt (Ed.). ijcai.org,
[19] Stephen Robertson and Hugo Zaragoza. 2009. The 2369–2375. https://doi.org/10.24963/ijcai.2022/329
Probabilistic Relevance Framework: BM25 and Beyond. [28] Andy Zeng, Adrian Wong, Stefan Welker, Krzysztof
Foundations and Trends in Information Retrieval 3 (01 Choromanski, Federico Tombari, Aveek Purohit,
2009), 333–389. https://doi.org/10.1561/1500000019 Michael S. Ryoo, Vikas Sindhwani, Johnny Lee,
[20] Timo Schick, Jane Dwivedi-Yu, Roberto Dessı̀, Roberta Vincent Vanhoucke, and Pete Florence. 2022.
Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Can- Socratic Models: Composing Zero-Shot Multimodal
cedda, and Thomas Scialom. 2023. Toolformer: Lan- Reasoning with Language. CoRR abs/2204.00598
guage Models Can Teach Themselves to Use Tools. (2022). https://doi.org/10.48550/arXiv.2204.00598
arXiv:2204.00598
[29] Shuyan Zhou, Uri Alon, Frank F. Xu, Zhengbao Jiang,
and Graham Neubig. 2023. DocPrompting: Generat-
ing Code by Retrieving the Docs. In The Eleventh
International Conference on Learning Representations.
https://openreview.net/forum?id=ZTCxT2t2Ru

You might also like