Update 2021.07.30: We update the code completion dataset with literals normalized to avoid sensitive information.
Here is the introduction and pipeline for token level code completion task.
Predict next code token given context of previous tokens. Models are evaluated by token level accuracy.
Code completion is a one of the most widely used features in software development through IDEs. An effective code completion tool could improve software developers' productivity. We provide code completion evaluation tasks in two granularities -- token level and line level. Here we introduce token level code completion. Token level task is analogous to language modeling. Models should have be able to predict the next token in arbitary types.
We collect and provide two datasets for code completion. One in python, the other in java.
- python 3.7
- javalang == 0.13.0
We use py150 dataset from Raychev's OOPSLA 2016 paper Probabilistic Model for Code with Decision Trees.
To download and preprocess the dataset, navigate to dataset/py150 directory, and run
bash download_and_extract.sh
python preprocess.py --base_dir=py150_files --output_dir=token_completionWe use java corpus dataset mined by Allamanis and Sutton, in their MSR 2013 paper Mining Source Code Repositories at Massive Scale using Language Modeling. We follow the same split and preprocessing in Karampatsis's ICSE 2020 paper Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code.
To download the preprocessed dataset, navigate to dataset/javaCorpus directory, and run
bash download.sh
python preprocess.py --base_dir=token_completion --output_dir=token_completion- Tokenization is applied since we focus the token-level completion.
- We normalize uncommon literals for better user experience. Developers sometimes leave their names, IP address, phone numbers in their codes, and we don't encourage models to focus on these string or numeric literals. So we normalized these literals by some special tokens. Considering that frequently used literals may contain useful information, e.g. "__main__" or "utf-8", we preserve the 200 most frequent string and 30 most frequent numeric literals. These literals will be normalized by tokens in
"<STR_LIT:utf-8>"format, while uncommon literals are replaced by<STR_LIT>or<NUM_LIT>. - We add
<s>and</s>to indicate the start and the end of one piece of code. <EOL>is added in python corpus to mark the ending of a line since in python there is no;or}to mark the ending of a statement like in java.
Code corpus are saved in txt format files. one line is a tokenized code snippets:
<s> from __future__ import unicode_literals <EOL> from django . db import models , migrations <EOL> class Migration ( migrations . Migration ) : <EOL> dependencies = [ <EOL> ] <EOL> operations = [ <EOL> migrations . CreateModel ( <EOL> name = '<STR_LIT>' , <EOL> fields = [ <EOL> ( '<STR_LIT:id>' , models . AutoField ( verbose_name = '<STR_LIT>' , serialize = False , auto_created = True , primary_key = True ) ) , <EOL> ( '<STR_LIT:name>' , models . CharField ( help_text = b'<STR_LIT>' , max_length = <NUM_LIT> ) ) , <EOL> ( '<STR_LIT:image>' , models . ImageField ( help_text = b'<STR_LIT>' , null = True , upload_to = b'<STR_LIT>' , blank = True ) ) , <EOL> ] , <EOL> options = { <EOL> '<STR_LIT>' : ( '<STR_LIT:name>' , ) , <EOL> '<STR_LIT>' : '<STR_LIT>' , <EOL> } , <EOL> bases = ( models . Model , ) , <EOL> ) , <EOL> ] </s>
Data statistics of py150 dataset are shown in the below table, note that there doesn't exist dev set in the origin py150 dataset, we select 5,000 files in the original train set as dev set.
| Data Split | #Files | #Tokens |
|---|---|---|
| Train | 95,000 | 72.1M |
| Dev | 5,000 | 4.4M |
| Test | 50,000 | 37.3M |
Data statistics of Github Java Corpus dataset are shown in the below table:
| Data Split | #Files | #Tokens |
|---|---|---|
| Train | 12,934 | 15.7M |
| Dev | 7,176 | 3.8M |
| Test | 8,268 | 5.3M |
We provide a script to evaluate predictions for this task, and report accuracy score. You can run the script like this:
python evaluator/evaluator.py -a=evaluator/answers.txt -p=evaluator/predictions.txtThe outputs are:
Total 5315204 tokens, accuracy: 76.45
Answer file is in the same format of the preprocessed dev dataset file. A legal prediction file is expected to be a txt format file. It should have the same number of lines as answer file. And for each line, it should contain the same number of tokens (split by space) as the corresponding line in the answer file. Note that <s>, </s>, <EOL> are not evaluated so that you don't need worry about how to predict the first token. You can put any token you like at first. For example, one line in the answer file is:
<s> import json <EOL> json . load ( f ) </s>
And the corresponding line in your prediction file may be:
. import numpy <EOL> json . dump ( open ) <EOL>
The accuracy on this line is 62.5%
we provide CodeGPT, which is a Transformer-based language model pre-trained on programming language (PL). CodeGPT shares the same model architecture and training object with GPT-2, consisting 12 layers of Transformer decoders. We pre-train monolingual models respectively on Python and Java corpus from the CodeSearchNet dataset, which includes 1.1M Python functions and 1.6M Java methods. A function or method in training dataset consists function signature and function body. Some functions also contain NL docstrings. The dataset statistics are shown below:
| #Functions | #Tokens | |
|---|---|---|
| Python | 1,144,977 | 119.0M |
| Java | 1,554,613 | 169.4M |
We release two CodeGPT models for each programming language. One model is pre-trained from scratch, in a way that the BPE (byte pair encoder) vocabulary is newly obtained on code corpus and that model parameters are randomly initialized. The other model is a domain-adaptive one, which uses GPT-2 model as the starting point and is continually trained on code corpus. Therefore, the second model has the same vocabulary with GPT-2, and inherits the natural language understanding ability of GPT-2. It might perform better on natural language related tasks. We call the second model CodeGPT-adapted and regard it as the default one.
All the models are publicly available at huggingface website. Model names are CodeGPT-small-py, CodeGPT-small-java, CodeGPT-small-py-adaptedGPT2, CodeGPT-small-java-adaptedGPT2
- python 3.6 or 3.7
- torch>=1.4.0
- transformers>=2.5.0 and < 4.0.0
- fuzzywuzzy
To fine-tune CodeGPT on javaCorpus dataset for code completion in multi-GPU on a single machine, navigate to code directory, run:
LANG=java # set python for py150
DATADIR=../dataset/javaCorpus/token_completion
LITFILE=../dataset/javaCorpus/literals.json
OUTPUTDIR=../save/javaCorpus
PRETRAINDIR=microsoft/CodeGPT-small-java # microsoft/CodeGPT-small-py for py150
LOGFILE=completion_javaCorpus.log
PER_NODE_GPU=YOUR_GPU_NUM # modify YOUR_GPU_NUM
python -m torch.distributed.launch --nproc_per_node=$PER_NODE_GPU run_lm.py \
--data_dir=$DATADIR \
--lit_file=$LITFILE \
--langs=$LANG \
--output_dir=$OUTPUTDIR \
--pretrain_dir=$PRETRAINDIR \
--log_file=$LOGFILE \
--model_type=gpt2 \
--block_size=1024 \
--do_train \
--gpu_per_node $PER_NODE_GPU \
--learning_rate=8e-5 \
--weight_decay=0.01 \
--evaluate_during_training \
--per_gpu_train_batch_size=2 \
--per_gpu_eval_batch_size=4 \
--gradient_accumulation_steps=4 \
--num_train_epochs=5 \
--logging_steps=100 \
--save_steps=1000 \
--seed=42 \
--overwrite_output_dir \
--not_pretrainWe stop at 50000 steps on py150 experiment, which takes 25 hours. And 2 hours with 2000 steps on java dataset. Both experiments run on 2 NVIDIA P100.
It's recommanded to run evaluation on single GPU. The predictions will be saved at $OUTPUTDIR/predictions.txt
export CUDA_VISIBLE_DEVICES=0
LANG=java # set python for py150
DATADIR=../dataset/javaCorpus/token_completion
LITFILE=../dataset/javaCorpus/literals.json
OUTPUTDIR=../save/javaCorpus
PRETRAINDIR=../save/javaCorpus/checkpoint # directory of your saved model
LOGFILE=completion_javaCorpus_eval.log
python -u run_lm.py \
--data_dir=$DATADIR \
--lit_file=$LITFILE \
--langs=$LANG \
--output_dir=$OUTPUTDIR \
--pretrain_dir=$PRETRAINDIR \
--log_file=$LOGFILE \
--model_type=gpt2 \
--block_size=1024 \
--do_eval \
--per_gpu_eval_batch_size=16 \
--logging_steps=100 \
--seed=42 It might take 60 minutes for inference on py150 dataset and 15 minutes on java Corpus on a single NVIDIA P100.
| Model | Accuracy |
|---|---|
| LSTM + BPE | 61.94 |
| Transformer (12L) | 74.48 |
| GPT-2 | 75.90 |
| CodeGPT | 76.58 |
| CodeGPT-adapted | 76.60 |
| Model | Accuracy |
|---|---|
| LSTM + BPE | 58.92 |
| Transformer (12L) | 65.18 |
| GPT-2 | 75.40 |
| CodeGPT | 76.79 |
| CodeGPT-adapted | 77.73 |
If you use code completion datasets, please also cite the following papers in addition to our CodeXGLUE:
@article{raychev2016probabilistic,
title={Probabilistic Model for Code with Decision Trees},
author={Raychev, Veselin and Bielik, Pavol and Vechev, Martin},
journal={ACM SIGPLAN Notices},
pages={731--747},
year={2016},
publisher={ACM New York, NY, USA}
}@inproceedings{allamanis2013mining,
title={Mining Source Code Repositories at Massive Scale using Language Modeling},
author={Allamanis, Miltiadis and Sutton, Charles},
booktitle={2013 10th Working Conference on Mining Software Repositories (MSR)},
pages={207--216},
year={2013},
organization={IEEE}
}