Toolcoder: Teach Code Generation Models To Use Api Search Tools
Toolcoder: Teach Code Generation Models To Use Api Search Tools
Technology, MoE (Peking University) Technology, MoE (Peking University) Technology, MoE (Peking University)
Beijing, China Beijing, China Beijing, China
[email protected] [email protected] [email protected]
selecting the proper API during coding. When encountering Case 3: Private library (BeatNum)
an API usage scenario, programmers summarize their needs import beatnum as bn
into a query and use existing API search tools to search for num_str = bn.numstr([0,33,4444522])
suitable APIs, such as Google search engines or documentation module ‘beatnum' has no attribute 'numstr'
search tools for specific libraries. Then, according to the search
results, programmers can choose the proper API. This pro- Fig. 2. Failure Cases of CodeGen-2B model in selecting APIs, including gen-
gramming process using tools to retrieve and determine API erating non-existing APIs on public libraries (Case 1), generating unqualified
usage improves programming efficiency, improves accuracy, APIs (Case 2), and lack of API-related knowledge on private libraries (Case
3).
and reduces the risk of errors. It motivates us to investigate
approaches to teach code generation models to use search tools
to find suitable APIs. • We propose an automatic method for annotating datasets
In this paper, we propose ToolCoder, a low-cost and efficient in software engineering. This low-cost and efficient an-
solution that integrates API search tools into pre-trained code notation framework uses powerful ChatGPT to annotate
generation models, mimicking how programmers solve this API datasets with public source code datasets, reducing
problem. To help models learn to use tools, we propose an the manual effort required to create private annotated
automated data annotation method with ChatGPT to add tool datasets. Our dataset construction method can also be
usage information into the source code data and use the anno- easily transferred to other tasks.
tated dataset to fine-tune code generation models. Specifically, • We propose ToolCoder, which incorporates the ability to
our approach utilizes the in-context learning ability of large use API search tools into pre-trained code generation
models to annotate a special tool-augment dataset at a low models and improves the performance on API-related
cost. We employ parameter-efficient fine-tuning to improve the code generation tasks. Our approach outperforms existing
training efficiency. During inference, we integrate API search API-oriented baselines on multiple popular API-related
tools into the decoding process of our model, allowing the code generation benchmarks.
model to learn to use external tools autonomously.
II. M OTIVATING E XAMPLES
We extensively evaluate our proposed ToolCoder with pass
rate metrics [3]. ❶ We evaluate ToolCoder on three public In this section, we examine the limitations of the current
library benchmarks. Our model achieves significant improve- code generation models when selecting a suitable API and how
ments over state-of-the-art baselines with at least 10.11%, existing search tools can aid in API selection. By exploring
3.26%, and 1.39% pass@1. Our relatively small ToolCoder is these issues, we hope to provide context for our research
even comparable to one of the current best language models, and explain our motivation for proposing a new approach to
GPT-3.5. ❷ We further evaluate our model on two private address these challenges.
library benchmarks. By switching to the appropriate search A. Limitations of current code generation models in selecting
tool, our ToolCoder can be easily transferred to these private suitable APIs
library scenarios and achieves stable improvement. Our model
exhibits better generalization performance and raises at least Application Programming Interfaces (APIs) are essential
6.21% improvement on the average pass@1 metrics for all five to modern software development. APIs allow developers to
benchmarks. ❸ We also conduct an ablation study to analyze integrate pre-existing code and services into their applications.
the different settings in our experiments, including the dataset, Using APIs can significantly reduce the time and effort re-
training, and inference settings. Results prove the effectiveness quired to develop complex software systems.
of different designs in our approach. However, selecting the proper API remains challenging for
code generation models. Due to the proliferation of third-party
Our contributions in this paper can be summarized as
libraries and their associated APIs, existing code generation
follows:
models often need help choosing APIs. Here we choose
• To the best of our knowledge, we are the first to incor- the popular model CodeGen-2B to test its performance on
porate a programming tool into code generation models. API selection. Figure 2 shows three failure examples of the
Our results highlight the importance of models’ ability to generated code for selecting APIs. ❶ Case 1: The CodeGen-
use tools. 2B model has the potential risk of generating APIs that do
TABLE I websites such as StackOverFlow1 and datagy.io2 . They orga-
C OMPARISONS OF TWO TYPES OF SEARCH TOOLS FOR API SELECTION . nize and summarize the API suggestions used for different
problems. Formally, these online API suggestions are usually
Online Search Engine Documentation Search
displayed in the form of programming experience sharing or
Programming Community
Knowledge
or Tutorial Websites
question and answer. When other people encounter similar
Library Documentation
Resources problems, search engines can use this information well, and
(StackOverFlow, datagy.io, etc.)
ChatGPT Frozen
Pretrained
Wup 🔍 Selects a single row of data
Weights Wdown
from a DataFrame
✅ pandas.DataFrame.iloc
A-er Annota+on: x
…
low-rank adapta-on:
if mean is not None:
freeze most parameters
samples= <API>APISearch(Generates
and only finetune with few of params output = <API>
random samples from a mul9variate APISearch(Selects a single row
normal distribu9on.)->
of data from a DataFrame.)->
mul9variate_normal </API>
⚙ Parameter-efficient Fine-Tuning pandas.DataFrame.iloc </API>
mulBvariate_normal(mean, matrix, N)
can be trained only with a df.iloc[n][column_name]
…
consumer-level GPU
Fig. 3. The pipeline of our approach ToolCoder. The pipeline has three main parts: (1) Automatically Annotate Tool-augmented Dataset with ChatGPT, (2)
Parameter-efficient Fine-tune existing pre-trained code generation model with the annotated dataset, and (3) Inference of the fine-tuned model enhanced with
API search tools.
pre-trained model has not seen the API in MonkeyEval and General Models
BeatNumEval, and the online search resources cannot provide CodeT5 220M 0 0.1 0 0 0 0
PyCodeGPT 110M 18.04 38.61 12.75 37.62 3.80 14.00
any API-related information. So the API selection on these CodeGen350M 350M 18.51 43.56 16.73 29.70 4.60 14.00
benchmarks will only rely on the API search tool we built on CodeGen2B 2B 29.10 53.46 30.69 42.57 7.00 18.00
GPT3.5 - 58.41 66.21 30.09 33.16 6.00 24.00
the documentation of these private libraries.
API-oriented
C. Metrics CERT-numpy 220M 31.47 46.42 16.03 27.72 2.20 14.00
CERT-pandas 220M 18.81 33.66 28.42 48.04 2.80 6.00
Following the previous work, we use the metric pass rate CodeGenAPI 350M 16.55 29.48 13.58 34.95 7.19 16.93
CodeGenAPI-retrieval 475M 12.67 27.32 11.25 28.61 10.41 23.50
pass@k [3] for performance evaluation and take advantage of CodeGen-retrieval 475M 18.30 35.12 9.54 29.02 7.52 16.36
the provided unit tests to determine the functional correctness Ours
of code solutions. For each problem, we submit k code ToolCoder-OnlineTool
350M 35.64 50.50 22.77 37.62 7.40 20.00
2B 41.58 55.44 31.68 47.52 11.80 24.00
solutions for evaluation. If any of the k code solutions passes
all ground truth test cases, the problem is considered solved.
Then pass@k is the percentage of solved problems. In our
experiments, we set k = {1, 10}. given an NL description, CodeGenAPI firstly uses a retriever
model initialized with BERT [5] to find APIs from documents.
D. Baselines Then it uses a generator initialized with CodeGen-350M to
We select six series of recent code generation models as generate the complete code based on the retrieved API and
baselines, including one of the most powerful models, GPT- problem description. We use the three released settings in their
3.5. These models can be divided into two categories: general paper: CodeGenAPI, CodeGen-retrieval, and CodeGenAPI-
models and API-oriented models. retrieval. The first setting only uses the trained generator
1) General Models: CodeT5 [2] is an encoder-decoder pre- without retrieval, and the latter two use the best-performing
trained model for code-related tasks. It uses the identifier- top2 retrieval results to assist generation.
aware pre-training task and has achieved SOTA results on
E. Implementation Details
many general code generation benchmarks. We use CodeT5-
base with 220M parameters in our experiments. PyCodeGPT Training. Our model is implemented in the Pytorch frame-
[27] is a decoder-only pre-trained code generation model with work, and we perform all the experiments on four RTX 2080-
110M parameters. It is initialized with the GPT-Neo and 11GB GPUs. We initialize our ToolCoder by leveraging pre-
is continually pre-trained with a large-scale code corpus in trained weights of CodeGen-350M and CodeGen-2B. The
Python. CodeGen [14] is a series of decoder-only pre-trained training batch size is set to 8, and the total training epoch is set
code generation models with parameters varying from 350M to 10. We use validation loss to determine the best checkpoint
to 16B. It casts code generation as a multi-turn conversation as the final model.
between a user and a system. CodeGen has shown strong Tool. When implementing the API search tool, we adopt in-
ability on a variety of complex code generation tasks. Due to site online search in datagy.io as well as NumPy7 , Pandas8 and
computational limitations, we use 350M and 2B versions in TorchData websites9 using the DuckDuckGo for public library
our experiments. GPT-3.5 [4, 16] is one of the most powerful benchmarks. For private library benchmarks, we use provided
generation models from OpenAI. We use the “gpt-3.5-turbo“ Monkey and BeatNum library documentations to design an API
model as it is the most cost-effective and performant model in search tool based on the BM25 algorithm. The tool’s response
the GPT3.5 family. As OpenAI states, it can be complemented for inference is considered as the first retrieved API.
with flexible natural language and programming language Inference. During the model generation process, we use
capabilities6 . temperature sampling with T = 0.8 and limit the sample
2) API-oriented models: CERT [27] is a generation ap- budget to 10. Each experiment is run three times with random
proach designed for API-related code. CERT contains two seeds and then averaged for the final results.
modules: the sketcher and generator, each of which is fine-
tuned independently with PyCodeGPT. It first predicts a sketch VI. R ESULTS AND A NALYSES
based on the NL description and generates the complete A. RQ1: Results for Public library API Code Generation
code based on the sketch. For each library, CERT requires To answer RQ1, we evaluate baselines and our ToolCoder
a specially trained weight for generation. We use the released on NumpyEval, PandasEval and TorchDataEval and results
weight as two independent models: CERT-numpy, CERT- are shown in Table III. ToolCoder-OnlineTool represents the
pandas. CodeGenAPI [26] is another API-oriented code gen-
7 https://numpy.org/doc/
eration model. It uses a two-stage pipeline to generate code:
8 https://pandas.pydata.org/docs/
6 https://platform.openai.com/docs/models/gpt-3-5 9 https://pytorch.org/data/
TABLE IV TABLE V
PASS RATE OF MODELS ON P RIVATE LIBRARY BENCHMARKS A BLATION STUDIES ON DATASET S ETTINGS . W E CONDUCT EXPERIMENTS
ON T OOL C ODER -350M.
MonkeyEval BeatNumEval
Model Para.
pass@1 pass@10 pass@1 pass@10 NumpyEval PandasEval TorchDataEval
General Models Dataset Setting
pass@1 pass@10 pass@1 pass@10 pass@1 pass@10
CodeT5 220M 0 0 0 0
CodeGen350M 350M 0.95 4.90 5.15 11.96 ToolCoder-350M 35.64 50.50 22.77 37.62 7.40 20.00
CodeGen2B 2B 1.59 5.94 5.94 11.88 original dataset 19.40 39.60 19.92 38.61 6.00 14.00
GPT3.5 - 2.47 8.91 6.68 17.82 annotation w/o query 14.05 43.56 11.68 33.66 3.80 6.00
ToolCoder-350M 6h 0.65M 35.64 50.50 22.77 37.62 7.40 20.00 OnlineTool-350M 35.64 50.50 22.77 37.62 7.40 20.00
full-training 29h 350M 35.35 58.41 22.67 40.59 6.00 22.00 NoTool-350M 33.76 46.53 20.19 35.64 6.00 16.00
OnlineTool-2B 41.58 55.44 31.68 47.52 11.80 24.00
NoTool-2B 38.71 54.45 31.38 44.55 7.50 20.00
code dataset can not help the model learn to select APIs.
We compare the CodeGen-350M with the model trained on
the original dataset. Results show that additional training on
suggestions. When considering private library benchmarks, the
the code dataset does not significantly improve the model’s
improvement is more significant. We find the model itself
performance. The key to our improvement is to annotate the
works poorly on private libraries. However, with the assistance
API tool into the code dataset to teach the model to use
of the documentation search tool, our model can choose a
external API search tools.
suitable private library API.
2) Training Setting: We performed ablation experiments
Another interesting observation is that the NoTool also
with ToolCoder-350M on the training setting in Table VI. Our
achieves relatively good performance on public library bench-
experiments compare the performance of two approaches: full
marks. We believe that the improvement comes from our
parameter training, referred to as full-training. Our proposed
dataset annotation process. The additional tool call process
method utilizes LoRA for parameter-efficient training. We
in the dataset can be seen as a way to think about and choose
evaluate their performance on public library benchmarks and
the API. The chain of thought in the annotation dataset can
recorded their training costs, including training time and
assist the code generation model in better understanding the
parameters, using 2*2080 GPUs.
functions and application scenarios of different APIs, thus
Results show that our fine-tuning strategy has almost no directly improving the model to select the API. However, for
performance penalty compared with the regular full-training. private libraries, since the knowledge of private libraries is
On the public library benchmarks, the difference between the not seen by the code generation model, this form of dataset
two pass@1 results is within 0.4%. The gap in these results annotation is challenging to bring improvements to the model.
is acceptable, considering the huge savings in training costs. With proper API search tools enhanced, our ToolCoder can
In our experiment settings, our parameter-efficient fine-tuning select API more accurately and improve further.
strategy can reduce the training time from 29h to 6h and the
training parameters from more than 350M to 0.65M. We only
need to train 0.18% parameters in CodeGen-350M and 0.09% D. RQ4: Qualitative analysis
for CodeGen-2B. It makes it possible to efficiently fine-tune To answer RQ4, we perform a case study analysis to assess
models on a consumer-level GPU, such as Nvidia GeForce the generated code’s quality. Figure 5 represents code snippets
RTX 2080 (11GB RAM). generated on public and private library benchmarks. From the
3) Inference Setting: We perform ablation experiments on examples, we obtain the following findings: ❶ The generated
the inference setting in Table VII. We add experiments to search query provides more fine-grained technical planning for
disable the tool in our model. NoTool represents that we the solution. The NumpyEval case requires summing values
disable the tool for inference and use our trained model in a dataframe, and the generated query breaks down the
to directly generate an API based on the search query and requirements, focusing first on summing arrays. It fills the
complete the code. We compare with our original inference gap between requirements and concrete APIs. ❷ The response
setting on public and private library benchmarks. of the search tools both play a crucial role in the generated
Experiments show that our external tools are essential code. The online search engine tool finds the proper API from
in improving performance. On public library benchmarks, the correct websites, and the documentation search tool finds
the online search engine tool improves pass@1 by 1.88%, the proper API by searching over the API comments. ❸ Our
2.57%, 0.4% for ToolCoder-350M, and 2.87%, 0.29%, 4.3% ToolCoder also can make necessary modifications based on
for ToolCoder-2B. The online search engine tool can search the tool response. For example, the online search tool returns
for similar API usage scenarios and provide accurate API the response as cumsum, not directly defined in the input code.
Input: NumpyEval/99 It involves automatically creating source code based on func-
import numpy as np tional requirements, such as natural language descriptions [9]
import pandas as pd
df = pd.DataFrame({'A': [5, 6, 7], 'B': [7, 8, 9]})
or pseudo code algorithms [10, 15, 25]. Recently pre-trained
# What's the best way to sum all values in a language models have shown impressive capabilities in code
Pandas dataframe? generation tasks. Lu et al. [11] adapt GPT-2 [18] model on
# the result is a numeric value
Output: the source code, resulting in CodeGPT. Chen et al. [3] fine-
sum_value = <API>APISearch(Calculate the sum of tune GPT-3 [4] models on the code to produce CodeX and
the values in the given array.)->
🔍 Calculate the sum of the values in the given array
GitHub Copilot. OpenAI also produces the GPT3.5 series of
✅ h8ps://datagy.io/numpy-cumsum/ models, which have shown strong generation capabilities in
-> cumsum
natural language and programming languages. Neither CodeX
sum_value = <API>APISearch(Calculate the sum of
the values in the given array.)->cumsum</API> nor GPT3.5 is open-sourced, which leads to several attempts
np.cumsum(df.values)[-1] to replicate CodeX in industry and academia, resulting in
Input: BeatNumEval/92 GPT-Neo [1], GPT-J [21], CodeParrot [22], PolyCoder [23],
import beatnum as bn
master = bn.numset([1,2,3,4,5])
PyCodeGPT [27], InCoder [6], and CodeGen [14]. In our
search = bn.numset([4,2,2,3]) experiments, we choose the CodeGen series of models as our
# Find indices of a list of values in a beatnum base model for further exploration.
numset
Output: Recently, some work has focused on selecting APIs during
out = <API>APISearch(Find indices where code generation. As discussed in Section II-A, existing code
elements should be inserted to maintain
order.)-> generation models still struggle with selecting appropriate
APIs for a given context, especially for private or lesser-
🔍 Find indices where elements should be inserted to
maintain order known APIs. Existing work [26, 27, 29] has proposed some
✅ BeatNum Documenta?on
find_sorted:
API-oriented code generation methods. They typically use a
find the indices into a sorted numset a such that if the two-stage pipeline, where the first stage involves searching
corresponding elements in v were inserted before the
indices the order of a would be preserved or generating related APIs and then using them to generate
out = <API>APISearch(Find indices where elements code. We pursue this research line and propose to leverage pre-
should be inserted to maintain order.)- trained models and API search tools to automate API selection
>bn.find_sorted</API>bn.find_sorted(master, search)
in coding practices. In comparison, our approach has two
Fig. 5. Case Studies of the ToolCoder-2B, with online search engine tool on
advantages: ❶ Our method shows strong generalization ability.
NumpyEval and documentation search tool on BeatNumEval. By setting an appropriate API search tool, our method can
quickly adapt to any API-related code generation scenario. ❷
Our method does not require multi-stage generation. Instead,
Our ToolCoder can add some components not in the response we integrate the API search tool into the decoding process,
and generate the correct API np.cumsum. making our approach more flexible and allowing the API
VII. T HREATS TO VALIDITY selection process to be closer to the specific code fragment
being generated.
Threats to internal validity are related to the roles of
the model architecture and hyper-parameters setting. In our B. Tool-Augmented Large Language Models
experiments, we do a small-range grid search on learning rate
Recent research in language modeling has explored using
and batch size settings. Our ToolCoder-350M model tries to
external tools to supplement the knowledge stored in the
keep the hyper-parameters the same as baseline models for a
model’s weights [12]. These external tools can include other
fair comparison.
neural networks or even the language model itself, allowing
Threats to external validity are mainly related to the tasks for the composition of different pre-trained models on various
and datasets we choose in this paper. We counter this by modalities, such as the Socratic Model [28]. Alternatively,
evaluating our model on five different benchmarks of two natural language knowledge can be retrieved from external
types of API, including public and private library API code sources, as demonstrated by WebGPT [13] and ReAct [24]
generation. through the use of search APIs. Other approaches, such as
Threats to construct validity include the evaluation metrics Toolformer [20] and ART [17], leverage a combination of
we used in this work. We utilize pass rates to evaluate search tools, question-answering tools, machine translation
the correctness of generated code accurately. This metric is tools, calculators, and other tools to solve various NLP tasks.
adequate for corresponding tasks and has been adopted by ChatGPT Plugins10 further demonstrate the potential for lan-
many previous studies. guage models to integrate with thousands to millions of tools.
VIII. R ELATED W ORK However, incorporating programming tools into code-related
models has not been explored yet. Our paper addresses this gap
A. Code Generation
by abstracting the process of human programmers selecting
Code generation aims to generate the source code that
satisfies a given natural language description or requirement. 10 https://openai.com/blog/chatgpt-plugins
APIs into a programming tool that augments code generation [4] Zekai Chen, Mariann Micsinai Balan, and Kevin Brown.
models. 2023. Language Models are Few-shot Learners for Prog-
nostic Prediction. CoRR abs/2302.12692 (2023). https:
IX. C ONCLUSION //doi.org/10.48550/arXiv.2302.12692 arXiv:2302.12692
[5] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
In this paper, we propose ToolCoder, a novel approach in-
Kristina Toutanova. 2019. BERT: Pre-training of Deep
corporating API search tools into the code generation process
Bidirectional Transformers for Language Understanding.
to assist models in selecting appropriate APIs. We categorize
In Proceedings of the 2019 Conference of the North
API search tools into two types, including online search engine
American Chapter of the Association for Computational
tools and documentation search tools, and abstract them into
Linguistics: Human Language Technologies, NAACL-
a unified form. We propose an automatic dataset annotation
HLT 2019, Minneapolis, MN, USA, June 2-7, 2019,
method to add tool usage information to the source code
Volume 1 (Long and Short Papers), Jill Burstein, Christy
data. The parameter-efficient strategy is used to fine-tune
Doran, and Thamar Solorio (Eds.). Association for Com-
the model. During inference, the model decoding process
putational Linguistics, 4171–4186. https://doi.org/10.
is enhanced with external API search tools for proper API
18653/v1/n19-1423
selection. Experiments on public and private library code
[6] Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang,
generation benchmarks show that our ToolCoder outperforms
Eric Wallace, Freda Shi, Ruiqi Zhong, Wen-tau Yih,
state-of-the-art methods, with at least a 6.21% improvement
Luke Zettlemoyer, and Mike Lewis. 2022. InCoder: A
on average pass@1 metrics. Our experiments also demonstrate
Generative Model for Code Infilling and Synthesis. CoRR
the potential of incorporating programming tools into the code
abs/2204.05999 (2022). https://doi.org/10.48550/arXiv.
generation process, shedding light on this line of future work.
2204.05999 arXiv:2204.05999
[7] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan
R EFERENCES
Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and
[1] Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of
Biderman. 2021. GPT-Neo: Large Scale Autoregressive Large Language Models. In The Tenth International
Language Modeling with Mesh-Tensorflow. If you use Conference on Learning Representations, ICLR 2022,
this software, please cite it using these metadata 58 Virtual Event, April 25-29, 2022. OpenReview.net. https:
(2021). //openreview.net/forum?id=nZeVKeeFYf9
[2] Nghi Bui, Yue Wang, and Steven C. H. Hoi. 2022. [8] Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis
Detect-Localize-Repair: A Unified Framework for Learn- Allamanis, and Marc Brockschmidt. 2019. CodeSearch-
ing to Debug with CodeT5. In Findings of the Asso- Net Challenge: Evaluating the State of Semantic Code
ciation for Computational Linguistics: EMNLP 2022, Search. CoRR abs/1909.09436 (2019). arXiv:1909.09436
Abu Dhabi, United Arab Emirates, December 7-11, http://arxiv.org/abs/1909.09436
2022, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang [9] Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and
(Eds.). Association for Computational Linguistics, 812– Luke Zettlemoyer. 2018. Mapping Language to Code
823. https://aclanthology.org/2022.findings-emnlp.57 in Programmatic Context. In Proceedings of the 2018
[3] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Conference on Empirical Methods in Natural Language
Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harri- Processing, Brussels, Belgium, October 31 - November
son Edwards, Yuri Burda, Nicholas Joseph, Greg Brock- 4, 2018, Ellen Riloff, David Chiang, Julia Hockenmaier,
man, Alex Ray, Raul Puri, Gretchen Krueger, Michael and Jun’ichi Tsujii (Eds.). Association for Computational
Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Linguistics, 1643–1652. https://doi.org/10.18653/v1/
Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, d18-1192
Alethea Power, Lukasz Kaiser, Mohammad Bavarian, [10] Sumith Kulal, Panupong Pasupat, Kartik Chandra, Mina
Clemens Winter, Philippe Tillet, Felipe Petroski Such, Lee, Oded Padon, Alex Aiken, and Percy S Liang. 2019.
Dave Cummings, Matthias Plappert, Fotios Chantzis, Spoc: Search-based pseudocode to code. Advances in
Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Neural Information Processing Systems 32 (2019).
Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, [11] Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey
Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Svyatkovskiy, Ambrosio Blanco, Colin B. Clement,
Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Li-
Joshua Achiam, Vedant Misra, Evan Morikawa, Alec dong Zhou, Linjun Shou, Long Zhou, Michele Tufano,
Radford, Matthew Knight, Miles Brundage, Mira Mu- Ming Gong, Ming Zhou, Nan Duan, Neel Sundaresan,
rati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Shao Kun Deng, Shengyu Fu, and Shujie Liu. 2021.
Amodei, Sam McCandlish, Ilya Sutskever, and Woj- CodeXGLUE: A Machine Learning Benchmark Dataset
ciech Zaremba. 2021. Evaluating Large Language Mod- for Code Understanding and Generation. In Proceedings
els Trained on Code. CoRR abs/2107.03374 (2021). of the Neural Information Processing Systems Track
arXiv:2107.03374 https://arxiv.org/abs/2107.03374 on Datasets and Benchmarks 1, NeurIPS Datasets and
Benchmarks 2021, December 2021, virtual, Joaquin Van- CoRR abs/2302.04761 (2023). https://doi.org/10.48550/
schoren and Sai-Kit Yeung (Eds.). arXiv.2302.04761 arXiv:2302.04761
[12] Grégoire Mialon, Roberto Dessı̀, Maria Lomeli, [21] Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B: A 6
Christoforos Nalmpantis, Ramakanth Pasunuru, Billion Parameter Autoregressive Language Model. https:
Roberta Raileanu, Baptiste Rozière, Timo Schick, //github.com/kingoflolz/mesh-transformer-jax.
Jane Dwivedi-Yu, Asli Celikyilmaz, Edouard Grave, [22] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Yann LeCun, and Thomas Scialom. 2023. Augmented Chaumond, Clement Delangue, Anthony Moi, Pierric
Language Models: a Survey. CoRR abs/2302.07842 Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe
(2023). https://doi.org/10.48550/arXiv.2302.07842 Davison, Sam Shleifer, Patrick von Platen, Clara Ma,
arXiv:2302.07842 Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao,
[13] Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and
Long Ouyang, Christina Kim, Christopher Hesse, Shan- Alexander M. Rush. 2020. Transformers: State-of-the-
tanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Art Natural Language Processing. In Proceedings of the
Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin 2020 Conference on Empirical Methods in Natural Lan-
Button, Matthew Knight, Benjamin Chess, and John guage Processing: System Demonstrations. Association
Schulman. 2021. WebGPT: Browser-assisted question- for Computational Linguistics, Online, 38–45. https:
answering with human feedback. CoRR abs/2112.09332 //www.aclweb.org/anthology/2020.emnlp-demos.6
(2021). arXiv:2112.09332 https://arxiv.org/abs/2112. [23] Frank F. Xu, Uri Alon, Graham Neubig, and Vincent Jo-
09332 sua Hellendoorn. 2022. A systematic evaluation of large
[14] Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, language models of code. In MAPS@PLDI 2022: 6th
Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming ACM SIGPLAN International Symposium on Machine
Xiong. 2022. A Conversational Paradigm for Program Programming, San Diego, CA, USA, 13 June 2022,
Synthesis. CoRR abs/2203.13474 (2022). https://doi. Swarat Chaudhuri and Charles Sutton (Eds.). ACM, 1–
org/10.48550/arXiv.2203.13474 arXiv:2203.13474 10. https://doi.org/10.1145/3520312.3534862
[15] Yusuke Oda, Hiroyuki Fudaba, Graham Neubig, Hideaki [24] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak
Hata, Sakriani Sakti, Tomoki Toda, and Satoshi Naka- Shafran, Karthik Narasimhan, and Yuan Cao. 2022. Re-
mura. 2015. Learning to generate pseudo-code from Act: Synergizing Reasoning and Acting in Language
source code using statistical machine translation. In 2015 Models. CoRR abs/2210.03629 (2022). https://doi.org/
30th IEEE/ACM International Conference on Automated 10.48550/arXiv.2210.03629 arXiv:2210.03629
Software Engineering (ASE). IEEE, 574–584. [25] Pengcheng Yin and Graham Neubig. 2018. TRANX:
[16] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, A transition-based neural abstract syntax parser for se-
Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, mantic parsing and code generation. arXiv preprint
Sandhini Agarwal, Katarina Slama, Alex Ray, John arXiv:1810.02720 (2018).
Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, [26] Daoguang Zan, Bei Chen, Zeqi Lin, Bei Guan, Yongji
Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Wang, and Jian-Guang Lou. 2022. When Language
Christiano, Jan Leike, and Ryan Lowe. 2022. Training Model Meets Private Library. In Findings of the As-
language models to follow instructions with human feed- sociation for Computational Linguistics: EMNLP 2022,
back. CoRR abs/2203.02155 (2022). https://doi.org/10. Abu Dhabi, United Arab Emirates, December 7-11,
48550/arXiv.2203.02155 arXiv:2203.02155 2022, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang
[17] Bhargavi Paranjape, Scott M. Lundberg, Sameer Singh, (Eds.). Association for Computational Linguistics, 277–
Hannaneh Hajishirzi, Luke Zettlemoyer, and Marco Túlio 288. https://aclanthology.org/2022.findings-emnlp.21
Ribeiro. 2023. ART: Automatic multi-step reason- [27] Daoguang Zan, Bei Chen, Dejian Yang, Zeqi Lin, Minsu
ing and tool-use for large language models. CoRR Kim, Bei Guan, Yongji Wang, Weizhu Chen, and Jian-
abs/2303.09014 (2023). https://doi.org/10.48550/arXiv. Guang Lou. 2022. CERT: Continual Pre-training on
2303.09014 arXiv:2303.09014 Sketches for Library-oriented Code Generation. In Pro-
[18] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario ceedings of the Thirty-First International Joint Con-
Amodei, and Ilya Sutskever. 2019. Language Models are ference on Artificial Intelligence, IJCAI 2022, Vienna,
Unsupervised Multitask Learners. Austria, 23-29 July 2022, Luc De Raedt (Ed.). ijcai.org,
[19] Stephen Robertson and Hugo Zaragoza. 2009. The 2369–2375. https://doi.org/10.24963/ijcai.2022/329
Probabilistic Relevance Framework: BM25 and Beyond. [28] Andy Zeng, Adrian Wong, Stefan Welker, Krzysztof
Foundations and Trends in Information Retrieval 3 (01 Choromanski, Federico Tombari, Aveek Purohit,
2009), 333–389. https://doi.org/10.1561/1500000019 Michael S. Ryoo, Vikas Sindhwani, Johnny Lee,
[20] Timo Schick, Jane Dwivedi-Yu, Roberto Dessı̀, Roberta Vincent Vanhoucke, and Pete Florence. 2022.
Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Can- Socratic Models: Composing Zero-Shot Multimodal
cedda, and Thomas Scialom. 2023. Toolformer: Lan- Reasoning with Language. CoRR abs/2204.00598
guage Models Can Teach Themselves to Use Tools. (2022). https://doi.org/10.48550/arXiv.2204.00598
arXiv:2204.00598
[29] Shuyan Zhou, Uri Alon, Frank F. Xu, Zhengbao Jiang,
and Graham Neubig. 2023. DocPrompting: Generat-
ing Code by Retrieving the Docs. In The Eleventh
International Conference on Learning Representations.
https://openreview.net/forum?id=ZTCxT2t2Ru