Name	Name	Last commit message	Last commit date
parent directory ..
code	code
dataset/concode	dataset/concode
evaluator	evaluator
README.md	README.md

CodeXGLUE -- Text2Code Generation

Here are the dataset and pipeline for text-to-code generation task.

Task Definition

Generate source code of class member functions in Java, given natural language description and class environment. Class environment is the programmatic context provided by the rest of the class, including other member variables and member functions in class. Models are evaluated by exact match and BLEU.

It's a challenging task because the desired code can vary greatly depending on the functionality the class provides. Models must (a) have a deep understanding of NL description and map the NL to environment variables, library API calls and user-defined methods in the class, and (b) decide on the structure of the resulting code.

Dataset

Concode dataset

We use concode dataset which is a widely used code generation dataset from Iyer's EMNLP 2018 paper Mapping Language to Code in Programmatic Context.

We have downloaded his published dataset and followed his preprocessed script. You can find the preprocessed data in dataset/concode directory.

Data statistics of concode dataset are shown in the below table:

	#Examples
Train	100,000
Dev	2,000
Test	2,000

Data Format

Code corpus are saved in json lines format files. one line is a json object:

{
  "nl": "Increment this vector in this place. con_elem_sep double[] vecElement con_elem_sep double[] weights con_func_sep void add(double)",
  "code": "public void inc ( ) { this . add ( 1 ) ; }"
}

nl combines natural language description and class environment. Elements in class environment are seperated by special tokens like con_elem_sep and con_func_sep.

Evaluator

We provide a script to evaluate predictions for this task, and report exact match and BLEU score. You can run the script like this:

python evaluator/evaluator.py -a=evaluator/answers.json -p=evaluator/predictions.txt

The outputs are:

BLEU: 16.68, EM: 17.0

The CodeBLEU score can be calculated by this script

Input Format

Answer file is in the same format of the dev set json lines file. A legal prediction file is expected to be a txt format file. It should have the same number of lines as answer file. Each line is the model prediction for the corresponding input in answer file. For example, one line in the answer file is:

{
  "nl": "Increment this vector in this place. con_elem_sep double[] vecElement con_elem_sep double[] weights con_func_sep void add(double)",
  "code": "public void inc ( ) { this . add ( 1 ) ; }"
}

And the corresponding line in your prediction file is:

public void inc ( ) { this . add ( 1 ) ; }

Pipeline

We provide a pipeline for this task with CodeGPT and CodeGPT-adapted model.

Dependency

python 3.6 or 3.7
torch==1.4.0
transformers>=2.5.0

Fine-tune

To fine-tune CodeGPT on concode dataset for text2code generation on multi-GPUs at a single machine, navigate to code directory, run:

LANG=java
DATADIR=../dataset/concode
OUTPUTDIR=../save/concode
PRETRAINDIR=microsoft/CodeGPT-small-java-adaptedGPT2    # will download pre-trained CodeGPT model
LOGFILE=text2code_concode.log
PER_NODE_GPU=YOUR_GPU_NUM       # modify YOUR_GPU_NUM

python -m torch.distributed.launch --nproc_per_node=$PER_NODE_GPU run.py \
        --data_dir=$DATADIR \
        --langs=$LANG \
        --output_dir=$OUTPUTDIR \
        --pretrain_dir=$PRETRAINDIR \
        --log_file=$LOGFILE \
        --model_type=gpt2 \
        --block_size=512 \
        --do_train \
        --node_index 0 \
        --gpu_per_node $PER_NODE_GPU \
        --learning_rate=5e-5 \
        --weight_decay=0.01 \
        --evaluate_during_training \
        --per_gpu_train_batch_size=6 \
        --per_gpu_eval_batch_size=12 \
        --gradient_accumulation_steps=2 \
        --num_train_epochs=30 \
        --logging_steps=100 \
        --save_steps=5000 \
        --overwrite_output_dir \
        --seed=42

We stop at 60000 steps, which takes 22 hours on 2 NVIDIA P100.

Evaluation

It's recommanded to run evaluation on dev set on single GPU. The predictions on dev set will be saved in $OUTPUTDIR/dev.output.

export CUDA_VISIBLE_DEVICES=0
LANG=java
DATADIR=../dataset/concode
OUTPUTDIR=../save/concode
PRETRAINDIR=../save/concode/checkpoint
LOGFILE=text2code_concode_eval.log

python -u run.py \
        --data_dir=$DATADIR \
        --langs=$LANG \
        --output_dir=$OUTPUTDIR \
        --pretrain_dir=$PRETRAINDIR \
        --log_file=$LOGFILE \
        --model_type=gpt2 \
        --block_size=512 \
        --do_eval \
        --logging_steps=100 \
        --seed=42

Inference

It's recommanded to run inference on test set on single GPU. The predictions will be saved in $OUTPUTDIR/test.output.

export CUDA_VISIBLE_DEVICES=0
LANG=java
DATADIR=../dataset/concode
OUTPUTDIR=../save/concode
PRETRAINDIR=../save/concode/checkpoint
LOGFILE=text2code_concode_infer.log

python -u run.py \
        --data_dir=$DATADIR \
        --langs=$LANG \
        --output_dir=$OUTPUTDIR \
        --pretrain_dir=$PRETRAINDIR \
        --log_file=$LOGFILE \
        --model_type=gpt2 \
        --block_size=512 \
        --do_infer \
        --logging_steps=100 \
        --seed=42

It might take 40 minutes for inference on a single NVIDIA P100.

Result

The results on concode test set are shown as below:

Model	EM	BLEU	CodeBLEU
Seq2Seq	3.05	21.31	26.39
Seq2Action+MAML	10.05	24.40	29.46
Iyer-Simp+200 idoms	12.20	26.60	-
GPT-2	17.35	25.37	29.69
CodeGPT	18.25	28.69	32.71
CodeGPT-adapted	20.10	32.79	35.98

Reference

If you use concode dataset, please also cite this paper in addition to our CodeXGLUE:

@article{iyer2018mapping,
  title={Mapping language to code in programmatic context},
  author={Iyer, Srinivasan and Konstas, Ioannis and Cheung, Alvin and Zettlemoyer, Luke},
  journal={arXiv preprint arXiv:1808.09588},
  year={2018}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

CodeXGLUE -- Text2Code Generation

Task Definition

Dataset

Concode dataset

Data Format

Evaluator

Input Format

Pipeline

Dependency

Fine-tune

Evaluation

Inference

Result

Reference

FilesExpand file tree

text-to-code

Directory actions

More options

Directory actions

More options

Latest commit

History

text-to-code

Folders and files

parent directory

README.md

CodeXGLUE -- Text2Code Generation

Task Definition

Dataset

Concode dataset

Data Format

Evaluator

Input Format

Pipeline

Dependency

Fine-tune

Evaluation

Inference

Result

Reference