Name	Name	Last commit message	Last commit date
parent directory ..
code	code
dataset	dataset
evaluator	evaluator
README.md	README.md

CodeXGLUE -- Code Completion (token level)

Update 2021.07.30: We update the code completion dataset with literals normalized to avoid sensitive information.

Here is the introduction and pipeline for token level code completion task.

Task Definition

Predict next code token given context of previous tokens. Models are evaluated by token level accuracy.

Code completion is a one of the most widely used features in software development through IDEs. An effective code completion tool could improve software developers' productivity. We provide code completion evaluation tasks in two granularities -- token level and line level. Here we introduce token level code completion. Token level task is analogous to language modeling. Models should have be able to predict the next token in arbitary types.

Dataset

We collect and provide two datasets for code completion. One in python, the other in java.

Dependency

python 3.7
javalang == 0.13.0

py150 dataset

We use py150 dataset from Raychev's OOPSLA 2016 paper Probabilistic Model for Code with Decision Trees.

To download and preprocess the dataset, navigate to dataset/py150 directory, and run

bash download_and_extract.sh
python preprocess.py --base_dir=py150_files --output_dir=token_completion

Github Java Corpus

We use java corpus dataset mined by Allamanis and Sutton, in their MSR 2013 paper Mining Source Code Repositories at Massive Scale using Language Modeling. We follow the same split and preprocessing in Karampatsis's ICSE 2020 paper Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code.

To download the preprocessed dataset, navigate to dataset/javaCorpus directory, and run

bash download.sh
python preprocess.py --base_dir=token_completion --output_dir=token_completion

Data Preprocessing

Tokenization is applied since we focus the token-level completion.
We normalize uncommon literals for better user experience. Developers sometimes leave their names, IP address, phone numbers in their codes, and we don't encourage models to focus on these string or numeric literals. So we normalized these literals by some special tokens. Considering that frequently used literals may contain useful information, e.g. "__main__" or "utf-8", we preserve the 200 most frequent string and 30 most frequent numeric literals. These literals will be normalized by tokens in "<STR_LIT:utf-8>" format, while uncommon literals are replaced by <STR_LIT> or <NUM_LIT>.
We add <s> and </s> to indicate the start and the end of one piece of code.
<EOL> is added in python corpus to mark the ending of a line since in python there is no ; or } to mark the ending of a statement like in java.

Data Format

Code corpus are saved in txt format files. one line is a tokenized code snippets:

<s> from __future__ import unicode_literals <EOL> from django . db import models , migrations <EOL> class Migration ( migrations . Migration ) : <EOL> dependencies = [ <EOL> ] <EOL> operations = [ <EOL> migrations . CreateModel ( <EOL> name = '<STR_LIT>' , <EOL> fields = [ <EOL> ( '<STR_LIT:id>' , models . AutoField ( verbose_name = '<STR_LIT>' , serialize = False , auto_created = True , primary_key = True ) ) , <EOL> ( '<STR_LIT:name>' , models . CharField ( help_text = b'<STR_LIT>' , max_length = <NUM_LIT> ) ) , <EOL> ( '<STR_LIT:image>' , models . ImageField ( help_text = b'<STR_LIT>' , null = True , upload_to = b'<STR_LIT>' , blank = True ) ) , <EOL> ] , <EOL> options = { <EOL> '<STR_LIT>' : ( '<STR_LIT:name>' , ) , <EOL> '<STR_LIT>' : '<STR_LIT>' , <EOL> } , <EOL> bases = ( models . Model , ) , <EOL> ) , <EOL> ] </s>

Data Statistics

Data statistics of py150 dataset are shown in the below table, note that there doesn't exist dev set in the origin py150 dataset, we select 5,000 files in the original train set as dev set.

Data Split	#Files	#Tokens
Train	95,000	72.1M
Dev	5,000	4.4M
Test	50,000	37.3M

Data statistics of Github Java Corpus dataset are shown in the below table:

Data Split	#Files	#Tokens
Train	12,934	15.7M
Dev	7,176	3.8M
Test	8,268	5.3M

Evaluator

We provide a script to evaluate predictions for this task, and report accuracy score. You can run the script like this:

python evaluator/evaluator.py -a=evaluator/answers.txt -p=evaluator/predictions.txt

The outputs are:

Total 5315204 tokens, accuracy: 76.45

Input Format

Answer file is in the same format of the preprocessed dev dataset file. A legal prediction file is expected to be a txt format file. It should have the same number of lines as answer file. And for each line, it should contain the same number of tokens (split by space) as the corresponding line in the answer file. Note that <s>, </s>, <EOL> are not evaluated so that you don't need worry about how to predict the first token. You can put any token you like at first. For example, one line in the answer file is:

<s> import json <EOL> json . load ( f ) </s>

And the corresponding line in your prediction file may be:

. import numpy <EOL> json . dump ( open ) <EOL>

The accuracy on this line is 62.5%

Pipeline

CodeGPT

we provide CodeGPT, which is a Transformer-based language model pre-trained on programming language (PL). CodeGPT shares the same model architecture and training object with GPT-2, consisting 12 layers of Transformer decoders. We pre-train monolingual models respectively on Python and Java corpus from the CodeSearchNet dataset, which includes 1.1M Python functions and 1.6M Java methods. A function or method in training dataset consists function signature and function body. Some functions also contain NL docstrings. The dataset statistics are shown below:

	#Functions	#Tokens
Python	1,144,977	119.0M
Java	1,554,613	169.4M

We release two CodeGPT models for each programming language. One model is pre-trained from scratch, in a way that the BPE (byte pair encoder) vocabulary is newly obtained on code corpus and that model parameters are randomly initialized. The other model is a domain-adaptive one, which uses GPT-2 model as the starting point and is continually trained on code corpus. Therefore, the second model has the same vocabulary with GPT-2, and inherits the natural language understanding ability of GPT-2. It might perform better on natural language related tasks. We call the second model CodeGPT-adapted and regard it as the default one.

All the models are publicly available at huggingface website. Model names are CodeGPT-small-py, CodeGPT-small-java, CodeGPT-small-py-adaptedGPT2, CodeGPT-small-java-adaptedGPT2

Dependency

python 3.6 or 3.7
torch>=1.4.0
transformers>=2.5.0 and < 4.0.0
fuzzywuzzy

Fine-tune

To fine-tune CodeGPT on javaCorpus dataset for code completion in multi-GPU on a single machine, navigate to code directory, run:

LANG=java                       # set python for py150
DATADIR=../dataset/javaCorpus/token_completion
LITFILE=../dataset/javaCorpus/literals.json
OUTPUTDIR=../save/javaCorpus
PRETRAINDIR=microsoft/CodeGPT-small-java        # microsoft/CodeGPT-small-py for py150
LOGFILE=completion_javaCorpus.log
PER_NODE_GPU=YOUR_GPU_NUM       # modify YOUR_GPU_NUM

python -m torch.distributed.launch --nproc_per_node=$PER_NODE_GPU run_lm.py \
        --data_dir=$DATADIR \
        --lit_file=$LITFILE \
        --langs=$LANG \
        --output_dir=$OUTPUTDIR \
        --pretrain_dir=$PRETRAINDIR \
        --log_file=$LOGFILE \
        --model_type=gpt2 \
        --block_size=1024 \
        --do_train \
        --gpu_per_node $PER_NODE_GPU \
        --learning_rate=8e-5 \
        --weight_decay=0.01 \
        --evaluate_during_training \
        --per_gpu_train_batch_size=2 \
        --per_gpu_eval_batch_size=4 \
        --gradient_accumulation_steps=4 \
        --num_train_epochs=5 \
        --logging_steps=100 \
        --save_steps=1000 \
        --seed=42 \
        --overwrite_output_dir \
        --not_pretrain

We stop at 50000 steps on py150 experiment, which takes 25 hours. And 2 hours with 2000 steps on java dataset. Both experiments run on 2 NVIDIA P100.

Evaluation && Inference

It's recommanded to run evaluation on single GPU. The predictions will be saved at $OUTPUTDIR/predictions.txt

export CUDA_VISIBLE_DEVICES=0
LANG=java                       # set python for py150
DATADIR=../dataset/javaCorpus/token_completion
LITFILE=../dataset/javaCorpus/literals.json
OUTPUTDIR=../save/javaCorpus
PRETRAINDIR=../save/javaCorpus/checkpoint       # directory of your saved model
LOGFILE=completion_javaCorpus_eval.log

python -u run_lm.py \
        --data_dir=$DATADIR \
        --lit_file=$LITFILE \
        --langs=$LANG \
        --output_dir=$OUTPUTDIR \
        --pretrain_dir=$PRETRAINDIR \
        --log_file=$LOGFILE \
        --model_type=gpt2 \
        --block_size=1024 \
        --do_eval \
        --per_gpu_eval_batch_size=16 \
        --logging_steps=100 \
        --seed=42

It might take 60 minutes for inference on py150 dataset and 15 minutes on java Corpus on a single NVIDIA P100.

Result

py150

Model	Accuracy
LSTM + BPE	61.94
Transformer (12L)	74.48
GPT-2	75.90
CodeGPT	76.58
CodeGPT-adapted	76.60

javaCorpus

Model	Accuracy
LSTM + BPE	58.92
Transformer (12L)	65.18
GPT-2	75.40
CodeGPT	76.79
CodeGPT-adapted	77.73

Reference

If you use code completion datasets, please also cite the following papers in addition to our CodeXGLUE:

@article{raychev2016probabilistic,
  title={Probabilistic Model for Code with Decision Trees},
  author={Raychev, Veselin and Bielik, Pavol and Vechev, Martin},
  journal={ACM SIGPLAN Notices},
  pages={731--747},
  year={2016},
  publisher={ACM New York, NY, USA}
}

@inproceedings{allamanis2013mining,
  title={Mining Source Code Repositories at Massive Scale using Language Modeling},
  author={Allamanis, Miltiadis and Sutton, Charles},
  booktitle={2013 10th Working Conference on Mining Software Repositories (MSR)},
  pages={207--216},
  year={2013},
  organization={IEEE}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

CodeXGLUE -- Code Completion (token level)

Task Definition

Dataset

Dependency

py150 dataset

Github Java Corpus

Data Preprocessing

Data Format

Data Statistics

Evaluator

Input Format

Pipeline

CodeGPT

Dependency

Fine-tune

Evaluation && Inference

Result

py150

javaCorpus

Reference

FilesExpand file tree

CodeCompletion-token

Directory actions

More options

Directory actions

More options

Latest commit

History

CodeCompletion-token

Folders and files

parent directory

README.md

CodeXGLUE -- Code Completion (token level)

Task Definition

Dataset

Dependency

py150 dataset

Github Java Corpus

Data Preprocessing

Data Format

Data Statistics

Evaluator

Input Format

Pipeline

CodeGPT

Dependency

Fine-tune

Evaluation && Inference

Result

py150

javaCorpus

Reference