0% found this document useful (0 votes)
29 views5 pages

Few Shot Training LLMs

Uploaded by

Rafael
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views5 pages

Few Shot Training LLMs

Uploaded by

Rafael
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Few-shot training LLMs for project-specific code-summarization

Toufique Ahmed Premkumar Devanbu


University of California, Davis University of California, Davis
Davis, California, USA Davis, California, USA
[email protected] [email protected]

ABSTRACT One of the most exciting aspects of LLMs is zero, one- or few-shot
Very large language models (LLMs), such as GPT-3 and Codex have training. In this line of work, the LLM is not subject to conven-
achieved state-of-the-art performance on several natural-language tional fine-tuning (as is most typical with BERT, T5, RoBERTA,
tasks, and show great promise also for code. A particularly exciting etc [6, 16, 18]) using a sizeable number of on-task training exam-
aspect of LLMs is their knack for few-shot and zero-shot learning: ples (typically in the range of 100 -100,000 examples); rather it is
arXiv:2207.04237v2 [cs.SE] 8 Sep 2022

they can learn to perform a task with very few examples. Few- given a prefix, comprising just a handful of input-input pairs, and
shotting has particular synergies in software engineering, where then is prompted with a query input (sans output). In this (highly
there are a lot of project-specific phenomena. Developers introduce sample-efficient) regime, LLMs are known to perform surprisingly
very localized identifier names, APIs, terminology, coding patterns, well. Most remarkably, few-shot training does not require any weight
etc to suit the needs of each project. These localized linguistic phe- adjustment whatsoever. Rather, the LLM leverages the information
nomena match the domain concepts, colloquialisms, algorithms, in the first part of the prompt to condition itself to perform the
and data suitable each domain and project, and help other develop- task reflected in the few examples. This works because the massive
ers read the code. These phenomena can also provide useful cues capacity (billions of parameters!) of the model allows it to condition
for machine learning models. However, project-specific data can be its generative behaviour on the given prompt in extremely varied,
quite limited, especially early in the history of a project; thus the subtle & flexible ways. An example two-shot training prompt, for
few-shot learning capacity of LLMs offer a very attractive option. the task of English-German translation, might be, for example:
In this paper, we investigate the use few-shot training with the very The sentence "how are you?" in German
large GPT (Generative Pre-trained Transformer) Codex model, and is "wie geht es?". The sentence "See
find evidence suggesting that one can significantly surpass state-of- you later!" in German is "Bis Bald!".
the-art models for code-summarization, leveraging project-specific The sentence "How much is that apple?"
training. in German is<submit>
If prompted with this, when one hits the submit button, GPT3
KEYWORDS
responds "Wie viel kostet diese Apfel?", which is a good
deep learning, code summarization, large language model translation1 . Likewise, LLMs are known to be capable of few-shot
ACM Reference Format: learning on a wide range of tasks, including question-answering,
Toufique Ahmed and Premkumar Devanbu. 2022. Few-shot training LLMs natural language inference, summarization, etc. It should be noted
for project-specific code-summarization. In 37th IEEE/ACM International that few-shot learning is very challenging indeed, and the apti-
Conference on Automated Software Engineering (ASE ’22), October 10ś14, tude of LLMs to learn to perform different tasks in this regime is
2022, Rochester, MI, USA. ACM, New York, NY, USA, 5 pages. https://doi.org/ quite phenomenal2 . Interestingly, few-shot learning has a peculiar
10.1145/3551349.3559555 and interesting salience for software engineering: for dealing with
project-specific linguistic phenomena.
1 INTRODUCTION Each software project is designed to meet needs in some specific
Very large language models (LLMs) are viewed as a revolutionary business or technical domain; in each domain, there are conventions
advance in natural language processing. Models such as GPT-3 [4], that prescribe specific coding concepts, colloquialisms and idioms.
which have over 150 billion parameters, are trained using a sim- Scientific applications, business applications, government-domain
ple, autoregressive, predict-the-next token regime over enormous applications, all come with specialized terminology and concepts.
corpora. Codex [5], for example is a similar 12 billion parameters These conventions (and associated vocabulary) are almost always
model trained on code. While such models certainly perform very directly adopted into software applications in the domain, and are
well indeed at the task of prediction (e.g., for code completion) , used in all textual artifacts relating to the project: documentation,
they are also quite good at other tasks, such as generating code issue reporting, identifiers, etc. In addition, there are algorithms and
from docstrings, and vice versa, after suitable fine-tuning [5]. data-structures that are specific to projects and domains, and these
would be reflected in coding patterns that developers in that project
Permission to make digital or hard copies of part or all of this work for personal or will recognize. Most engineers experienced in a given domain are
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation very well aware of this: different projects leverage different domain-
on the first page. Copyrights for third-party components of this work must be honored. specific concepts, and these are reflected in identifier naming, API
For all other uses, contact the owner/author(s).
ASE ’22, October 10ś14, 2022, Rochester, MI, USA
1 Actual output from the GPT3 showcase, obtained from the text-DaVinci-002
© 2022 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-9475-8/22/10. model, at https://beta.openai.com/playground
https://doi.org/10.1145/3551349.3559555 2 See https://www.nytimes.com/2022/04/15/magazine/ai-language.html
ASE ’22, October 10–14, 2022, Rochester, MI, USA Toufique Ahmed and Premkumar Devanbu

calls, and coding patterns. But can we exploit this in machine learn- task. Fried et al. [9] introduce a large language model, InCoder,
ing applications in software engineering? and try zero-shot training on CodeXGLUE Python dataset. They
It’s been well-known right from the outset that language model- achieved impressive results; but fine-tuned models like CodeT5 [24],
ing for code has to deal with project-specific phenomena [11, 12, 23]. CodeBERT [7], and PLBART [1] can still outperform the zero-shot
The sticking point here, however, is that project-specific data, espe- setting. Chen et al. [5] fine-tuned Codex on code summarization
cially early-on in a project’s history, may be quite limited in volume; task and proposed a new model Codex-D. However, they used a very
older deep-learning models, require 𝑂 (104 ) or even 𝑂 (105 ) samples small human eval dataset for Codex-D and didn’t use BLEU-4, which
that are specific to a project or domain to learn the local features. is recommended by CodeXGLUE benchmark. This work did not
Even BERT-style foundation models require a lot of training exam- entirely clarify Codex-D performance relative to other pre-trained
ples. Such examples may be hard to find on a project-specific basis, models. None of the above works reported the performance of
even early in the history of a project. Even if enough examples exist, few-shot training or investigated the effectiveness of same-project
retraining a big model for each new project can be cumbersome, but few-shot training, as we do below.
also necessary (thanks to the łcatastrophic forgetting" problem [8]).
The few-shot learning capacity of very-large language models of- 3 METHODOLOGY
fers a work-around. These models can make do with just a handful We present our approach to summarizing code in this section. We
of training examples; furthermore retraining is not really cumber- also discuss the dataset used for evaluation, and explain our design
some, one can just change the prompt. In addition, the very limited choices. Figure 1 presents our simple few-shot-based approach to
training requirement suggests that we might (in the future) local- produce code summaries using the Codex model. There are four
ize to even just a file, or even just a method. We therefore believe major steps as follows. In the following, we assume 𝑓𝑖 , 𝑠𝑖 refers to
that the few-shot setting has tremendous potential to be useful in an i-indexed 𝑖 𝑡ℎ function (code), 𝑖 𝑡ℎ summary (natural language text)
project-specific settings in software engineering. pair
In this paper, we primarily focus on comment synthesis. This
application has the advantage of being both quite useful, and also (1) We prepend 𝑛 functions (cross-project/ same-project), each
well-studied. There has been quite bit of work on investigating followed by a comment, followed by the target function
various kinds of models : RNNs, Transformers, Foundation Models, for which the model is to generate the comment. Thus the
etc, and there are good benchmarks available. We therefore use this prompt is structured as 𝑓1, 𝑠 1, 𝑓2, 𝑠 2 ,. . . 𝑓𝑛 , 𝑠𝑛 , 𝑓𝑞 where the
problem as a test-bed to investigate the following questions. 𝑓𝑖 , 𝑠𝑖 pairs for 𝑖 ≤ 𝑛 constitute the łfew shotž training ex-
amples, and the 𝑓𝑞 refers to the łquery" function for which
(1) Does the few-shot learning capacity of large language models the model is to generate a summary 𝑠𝑞 . Each comment has
extend to the task of code summarization? a starting and ending symbol (i.e., ⟨ s ⟩ & ⟨ s ⟩). We finalize
(2) Can this few-shot learning capacity be extended to same- the input by appending a comment starting symbol (<s>) at
project learning on this same task? the end of the target function.
(3) How does the performance of LLMs in the above two settings (2) After that, we send the prompt to the Codex model.
compare with that of state-of-the-art models? (3) We receive the responsive output from the model. The output
may contain additional text after the comment because we
2 BACKGROUND AND RELATED WORK have to fix the output length before processing the input.
Developers spend around 59% of their time comprehending or un- (4) Finally, we prepare the target comment using the comment
derstanding others’ work or their own prior works [25]. Good ending symbol (</s>).
quality comments can benefit the developers by contributing to Dataset We use the CodeXGLUE [17] code summarization bench-
both the development and maintenance process [21]. Surprisingly, mark. It should be noted that this dataset is unrelated to the Codex
misaligned and outdated comments are very common in SE projects. model. CodeXGLUE is originally adapted from the CodeSearch-
Apart from writing new comments, automated code summarization Net [13] dataset. It’s multilingual, with data from six different
could potentially help update misaligned and outdated comments. languages (i.e., Ruby, JavaScript, Java, Go, PHP, Python). Quite
This has motivated the study of automated code summarization a number of papers using the foundation models [1, 2, 7, 10, 24]
tools. have been evaluated on this dataset for the code summarization
Code summarization bears a strong resemblance to Neural Ma- task; so it constitutes a good benchmarks. However, we could not
chine Translation (NMT) (e.g., translating English to German). In- assess the complete dataset because we only have limited access
spired by NMT, machine-learning researchers in the SE domain (20 requests/min) to the private beta version of the Codex Model;
have adopted a neural encoder-decoder framework for code sum- at our university, we did not have the resources to replicate such a
marization tasks. The earliest work using RNN models [22], and large model. However, we could try to get evidence relevant to our
the newest work based on foundation models [3], all leverage research question; we randomly chose just 1000 examples from the
encoder-decoder models. However, with the advent of very highly test set of all six languages. To properly compare with other foun-
parametrized (with > 150 Billion parameters) LLMs, suggest a path dation models, we also find out the performance of those models
away from encoder-decoder models, towards the use of decoder- on the same collection of samples. We randomly chose ten sam-
only models (like Codex) for a task like code summarization. ples from the training set for few-shot training with Codex. Note
Large language models (including Codex) have been applied to that CodeXGLUE is a properly deduplicated dataset and uses the
the code-summarization (sometimes called łDocstring generationž) cross-project splits for training, testing, and dev set [20].
Few-shot training LLMs for project-specific code-summarization ASE ’22, October 10–14, 2022, Rochester, MI, USA

We also evaluated the Codex model on same-project few-shot Design Choices Several parameters need to be fixed to get the output
training. We have earlier shown that the performance of the deep from Codex. Temperature is one of the crucial parameters. Higher
learning models depends on the identifiers for the code summa- temperature enables the model to take more risks. Following the
rization task [2]. Vocabularies of a project are highly local, and recommendation of OpenAI documentation, we set the temper-
functions from the same projects are likely to share same set of ature to 0 because we aimed for well-defined answers3 . We also
identifiers [11, 23]. We chose four Python projects and four Java set default value 1.0 as Top_p and 50 as max_token count. The
projects from the test set of CodeXGLUE. To have a fair comparison majority of the summaries are less than 50 tokens. However, the
with the prior foundation models, we had to restrict to the test model does continue generating tokens even after completing the
set of CodeXGLUE. After choosing the projects, we retrieved the summary. We clipped the summary using the comment ending
creation date for each sample using łgit blame śignore revž. We symbol (</s>). Note that several other parameters can be altered to
sorted the functions according to the creation date and ensured generate more creative summaries. We weren’t able to fully explore
that only historical data was used for few-shot training to prevent hyper-parameter turning due to API access limits.
data leakage from future samples.
4 RESULT
code_1 We present our performance data illustrating the of cross-project
<s> comment_1 </s> and same-project few-shot training with LLM model Codex. Our
------------------------ (a) results suggest that a) Codex’s performance is quite impressive, in
code_10
<s> comment_10 </s> some cases substantially exceeding the baselines; b) Codex (with
just a few examples from the same project) in some cases can go
i) Concat (b) after (a)
even further.
code_target
<s> 4.1 Cross-project few-shot
As mentioned earlier, CodeXGLUE is a cross-project dataset. To
(b)
show the effectiveness of few-shot training, we randomly chose
ii) Input to the LLM
10 samples from the CodeXGLUE training set for each language.
We prepended these 10 samples to a chosen (query) sample, from
LLM/Codex
the test set, and asked the model to complete the resulting prompt.
Following prior works, we use smoothed BLEU-4 [15] as the evalu-
ation metric. We compared our approach with CodeBERT, Graph-
iii) Output from the LLM CodeBERT, CodeT5, and PolyGlot versions of the CodeBERT and
GraphCodeBERT models. Table 1 suggests that Codex, few-shotted
comment_target </s>
code_random for code summarization, can outperform competitive models. We
observed more than +2 BLEU-4 improvements for JavaScript and Go.
Roy et al. show that BLEU-4 improvements of more than +2 points
iv) Extract the target comment
are reasonable proxies for human-perceptible preference [19]. This
result suggests that LLMs like Codex are really sample-efficient.
comment_target All the baselines are fine-tuned with 24K-251K for each language,
whereas the LLM outperforms all of them with just 10 samples!.
Figure 1: Pipeline for generating comment Observation 1. With 10 samples, Codex outperforms all fine-
tuned foundation models CodeT5, CodeBERT, GraphCodeBERT,
Selecting number of few-shot samples We use the łcode-davinci-002ž, Polyglot CodeBERT, and PolyGlotGraphCodeBERT in all six pro-
the largest model in the Codex series; it can accommodate prompts gramming languages, even though the fine-tuned models are
up to 4000 tokens in length. Our access to the private beta ver- trained with thousands of data.
sion of the model enables few-shotting (fine-tuning with weight
adjustment on the actual neural model is not yet possible, and is 4.2 Same-project few-shot
beyond the scope of this paper). Therefore, our few-shot training
Our hypothesis is that same-project few-shotting will show benefits,
was limited by 4000 tokens. We found that we could safely fit 10-15
since projects tend to follow a distinctive coding and documenta-
sequences in the prompt and ask the model to generate the com-
tion style. Our data (previous section) suggests that cross-project
ment for us. We tried 5, 10, and 15 samples for few-shot training for
few-shot can surpass prior pre-trained models with a significant
1000 test samples from the CodeXGLUE Java code summarization
margin with only 10 samples. We will replace those 10 cross-project
dataset and achieved 19.76, 21.88, and 21.46 BLEU-4, respectively.
few-shot training samples with 10 samples from the same project,
We use 10-shot for the rest of this work, because it requires less
(respecting time-series ordering, so as to avoid leakage between
time apart from giving the best performance. Also, note that using
the training and test examples) and observe the performance. We
too much data for few-shot or fine-tuning may cause catastrophic
believe that even with a few samples, Codex model will be able
forgetting in the model [14]. We also discuss the performance for
zero-shot and one-shot training in Section 4.4. 3 https://beta.openai.com/docs/api-reference/completions/create
ASE ’22, October 10–14, 2022, Rochester, MI, USA Toufique Ahmed and Premkumar Devanbu

Models
Language
PolyGlot PolyGlot Improvement in %
CodeBERT GraphCodeBERT CodeT5 Codex p-value
CodeBERT GraphCodeBERT (CodeT5 to Codex)
Java 18.8 20.22 18.52 19.94 19.78 21.88 10.61% <0.01
Python 17.73 18.19 17.35 18.33 19.98 20.76 3.94% 0.03
Ruby 12.61 14.64 12.6 14.9 15.33 16.95 10.52% <0.01
JS 14.30 16.34 15.21 15.92 15.98 18.42 15.23% <0.01
Go 18.5 19.18 18.71 19.3 19.91 22.65 13.73% <0.01
PHP 25.88 26.46 25.97 26.54 26.32 26.63 1.17% 0.27
Average 17.97 19.17 18.06 19.16 19.55 21.22 8.52% <0.01
p-value is calculated with pairwise 2-sample Wilcoxon Signed rank test between CodeT5 and Codex

Table 1: Comparison to existing models, on CodeXGLUE dataset

Models
Language Project
PolyGlot PolyGlot Codex Codex Improvement in % Codex
#of test samples CodeBERT GraphCodeBERT CodeT5 p-value
CodeBERT GraphCodeBERT Cross-project (same-project) (cross-project to same-project)
wildfly/wildfly 431 17.56 19.04 17.18 18.41 18.22 19.28 19.65 1.92% 0.03
orientechnologies/orientdb 423 15.7 16.86 16.65 16.42 17.76 20.11 22.34 11.06% 0.17
Java
ngageoint/geopackage-android 260 31.17 31.27 33.27 29.94 29.99 26.97 39.46 46.31% <0.01
RestComm/jain-slee 222 16.07 16.22 15.71 16.21 18 18.91 19.29 2.01% 0.08
apache/airflow 530 17.95 17.61 17.51 17.85 18.85 22.23 23.03 3.60% 0.22
tensorflow/probability 513 17.88 18.29 16.76 18.39 18.61 20.52 22.74 10.82% <0.01
Python
h2oai/h2o-3 254 15.65 15.92 14.44 14.94 17.07 18.98 19.65 3.48% 0.28
chaoss/grimoirelab-perceval 222 26.51 25.77 25.8 27.37 24.61 26.95 28.82 6.94% 0.04
Average 19.81 20.12 19.67 19.94 20.39 21.74 24.37 12.09% <0.01
p-value is calculated by performing pairwise 2 sample Wilcoxon Signed rank test between Codex (cross-project) and Codex (sample-project)

Table 2: Effectiveness of same-project few-shot training for code summarization

to produce significant improvements to the output. Table 2 shows projects (2% to 46% improvement). However, for both settings, we
that we outperform all the models, even the Codex model with observe overall statistically significant improvements.
cross-project data for all the projects under consideration. The
performance went up from 21.65 BLEU-4 to 24.37 BLEU-4 (12.56% Observation 3. Though we did not observe statistically significant
improvement) for the Codex models, which exhibits the effective- results for all programming languages and all projects, we observe
ness of few-shot training. overall statistically significant improvements.

Observation 2. Same-project few-shot training improves the


Codex model’s performance for all 8 projects.
4.4 Zero-shot and one-shot training
Terms like zero-shot and one-shot training are getting popular with
4.3 Testing Statistical significance of large language models. However, our data suggests that zero-shot
improvements doesn’t work as well for tasks like code summarization. Codex
model works left to right and predicts the future tokens only. With
We performed a one-sided pair-wise Wilcoxon-rank test to see the
zero-shot training, the model is less capable at tasks it was not
impact of few-shot training in a large language model. We compare
trained to do. For instance, usually, docstring appears before the
the CodeT5 model with Codex in a cross-project few-shot training
code, and Codex is trained on GitHub data. So, the model may be
setup because CodeT5 is the best-performing model among the pre-
able to generate code when prompted with docstring, even without
trained models. We compare the cross-project and same-project
seeing any examples. This is not the case for code summarization,
codex output in the same-project setup because we are interested in
which has the reverse default ordering. Here, the input to the model
how much few-shot training can improve the model’s performance.
is the code, and docstring is the output. We need a few samples to
For cross-project setup, we observe 1%-15% improvement for all six
teach the Codex to generate docstring after code. However, we did
programming languages (see Table 1). We also found substantial
try both zero-shot and one-shot training with Codex and achieved
statistically significant improvement for four languages. Though
only 2.96 and 6.22 BLEU-4 on average; we omit details due to the
we failed to find any significant improvement for Python and PHP,
convincingly bad performance.
Codex few-shot training still outperforms the traditional fine-tuned
pre-trained models with 10 samples. We found statistically sig-
Observation 4. Zero-shot and one-shot training in Codex do not
nificant improvement for 2 projects (Table 2) over cross-project
work for code summarization task.
Codex for same-project training even though we improved for all 8
Few-shot training LLMs for project-specific code-summarization ASE ’22, October 10–14, 2022, Rochester, MI, USA

5 THREATS Askell, et al. 2020. Language models are few-shot learners. Advances in neural
information processing systems 33 (2020), 1877ś1901.
Code summarization using Codex poses less direct safety & secu- [5] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira
rity threats as other problems like code generation. Docstrings or Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman,
et al. 2021. Evaluating large language models trained on code. arXiv preprint
comments are never executed as part of the program; however, they arXiv:2107.03374 (2021).
could lead to problems if they were to mislead programmers. [6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert:
There is a risk that our test data has been already seen by the Pre-training of deep bidirectional transformers for language understanding. arXiv
preprint arXiv:1810.04805 (2018).
CodeX during it’s very large-scale pre-training; LLMs are pre- [7] Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong,
trained on enormous datasets. The training dataset was unavailable Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. CodeBERT: A Pre-
to us at the time, and so we couldn’t account for this risk. However, Trained Model for Programming and Natural Languages. In Proceedings of the
2020 Conference on Empirical Methods in Natural Language Processing: Findings.
there are a couple of observations that offer suggestive evidence 1536ś1547.
that the model hasn’t just previously memorized our test data: first, [8] Robert M French. 1999. Catastrophic forgetting in connectionist networks. Trends
in cognitive sciences 3, 4 (1999), 128ś135.
it’s performance in a zero- or one-shot setting in most cases is [9] Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi,
quite abysmal. Second, the performance does smoothly improve, as Ruiqi Zhong, Wen-tau Yih, Luke Zettlemoyer, and Mike Lewis. 2022. InCoder: A
expected, in most cases upto around 10 training samples embedded Generative Model for Code Infilling and Synthesis. arXiv preprint arXiv:2204.05999
(2022).
in the prompt. This suggests that the model’s conditioned gener- [10] Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, LIU Shujie, Long
ative ability improves with more training samples; the prior that Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, et al. 2020. GraphCodeBERT:
the model internally computes and uses to condition its comment Pre-training Code Representations with Data Flow. In International Conference
on Learning Representations.
generation (𝑝 (𝑐𝑜𝑚𝑚𝑒𝑛𝑡𝑠 | 𝑐𝑜𝑑𝑒)) is gradually improving with more [11] Vincent J Hellendoorn and Premkumar Devanbu. 2017. Are deep neural networks
training samples, suggesting that it is actually generalizing from the best choice for modeling source code?. In Proceedings of the 2017 11th Joint
Meeting on Foundations of Software Engineering. 763ś773.
the few-shots, rather that just regurgitating an example it’s seen [12] Abram Hindle, Earl T Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu.
before. 2012. On the naturalness of software. In2012 34th International Conference on
Software Engineering (ICSE).
[13] Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc
6 CONCLUSION Brockschmidt. 2019. Codesearchnet challenge: Evaluating the state of semantic
Large language models are gaining popularity and are getting even code search. arXiv preprint arXiv:1909.09436 (2019).
[14] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume
larger every few months. In this paper, we investigated the effective- Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka
ness of few-shot training for code summarization task and found Grabska-Barwinska, et al. 2017. Overcoming catastrophic forgetting in neural
that it can significantly outperform a fine-tuned model trained networks. Proceedings of the national academy of sciences 114, 13 (2017), 3521ś
3526.
with thousands of samples with just ten samples. This sample ef- [15] Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries.
ficiency also opens the door for using the same project samples, In Text summarization branches out. 74ś81.
[16] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer
which are known to be sharing vocabulary and other critical in- Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A
ternal properties of the project. We observed the impact of same- robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692
project few-shot training and found that a few-shot codex in the (2019).
[17] Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio
same-project setting performs better than a cross-project, and the Blanco, Colin B. Clement, Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Lidong
overall improvement is statistically significant. Applying same- Zhou, Linjun Shou, Long Zhou, Michele Tufano, Ming Gong, Ming Zhou, Nan
project data is very promising and feasible because ten samples Duan, Neel Sundaresan, Shao Kun Deng, Shengyu Fu, and Shujie Liu. 2021.
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding
for a task like summarization can be generated within a few hours and Generation. CoRR abs/2102.04664 (2021).
of the development process. We believe that same-project few- [18] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the lim-
shot training with LLM models can benefit other SE tasks also. its of transfer learning with a unified text-to-text transformer. arXiv preprint
Finally, code summarization dataset is made available anonymously arXiv:1910.10683 (2019).
at https://doi.org/10.5281/zenodo.6592064. [19] Devjeet Roy, Sarah Fakhoury, and Venera Arnaoudova. 2021. Reassessing auto-
matic evaluation metrics for code summarization tasks. In Proceedings of the 29th
This work is supported by NSF CISE MEDIUM 2107592, and NSIF ACM Joint Meeting on European Software Engineering Conference and Symposium
CISE LARGE 1414172. Ahmed is also supported by the College of on the Foundations of Software Engineering. 1105ś1116.
Engineering Dean’s Distinguished Fellowship at UC Davis. [20] Ensheng Shia, Yanlin Wangb, Lun Dub, Junjie Chenc, Shi Hanb, Hongyu Zhangd,
Dongmei Zhangb, and Hongbin Suna. 2022. On the Evaluation of Neural Code
Summarization. ICSE.
REFERENCES [21] Giriprasad Sridhara, Emily Hill, Divya Muppaneni, Lori Pollock, and K Vijay-
[1] Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Uni- Shanker. 2010. Towards automatically generating summary comments for java
fied Pre-training for Program Understanding and Generation. In Proceedings of methods. In Proceedings of the IEEE/ACM international conference on Automated
the 2021 Conference of the North American Chapter of the Association for Computa- software engineering. 43ś52.
tional Linguistics: Human Language Technologies. Association for Computational [22] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning
Linguistics, Online, 2655ś2668. https://www.aclweb.org/anthology/2021.naacl- with neural networks. In Advances in neural information processing systems. 3104ś
main.211 3112.
[2] Toufique Ahmed and Premkumar Devanbu. 2022. Multilingual training for soft- [23] Zhaopeng Tu, Zhendong Su, and Premkumar Devanbu. 2014. On the localness
ware engineering. In Proceedings of the 44th International Conference on Software of software. In Proceedings of the 22nd ACM SIGSOFT International Symposium on
Engineering. 1443ś1455. Foundations of Software Engineering. 269ś280.
[3] Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, [24] Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. 2021. CodeT5: Identifier-
Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and
Brunskill, et al. 2021. On the Opportunities and Risks of Foundation Models. Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural
arXiv preprint arXiv:2108.07258 (2021). Language Processing. 8696ś8708.
[4] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, [25] Xin Xia, Lingfeng Bao, David Lo, Zhenchang Xing, Ahmed E Hassan, and Shan-
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda ping Li. 2017. Measuring program comprehension: A large-scale field study with
professionals. IEEE Transactions on Software Engineering 44, 10 (2017), 951ś976.

You might also like