Here are the dataset and pipeline for text-to-code generation task.
Generate source code of class member functions in Java, given natural language description and class environment. Class environment is the programmatic context provided by the rest of the class, including other member variables and member functions in class. Models are evaluated by exact match and BLEU.
It's a challenging task because the desired code can vary greatly depending on the functionality the class provides. Models must (a) have a deep understanding of NL description and map the NL to environment variables, library API calls and user-defined methods in the class, and (b) decide on the structure of the resulting code.
We use concode dataset which is a widely used code generation dataset from Iyer's EMNLP 2018 paper Mapping Language to Code in Programmatic Context.
We have downloaded his published dataset and followed his preprocessed script. You can find the preprocessed data in dataset/concode directory.
Data statistics of concode dataset are shown in the below table:
| #Examples | |
|---|---|
| Train | 100,000 |
| Dev | 2,000 |
| Test | 2,000 |
Code corpus are saved in json lines format files. one line is a json object:
{
"nl": "Increment this vector in this place. con_elem_sep double[] vecElement con_elem_sep double[] weights con_func_sep void add(double)",
"code": "public void inc ( ) { this . add ( 1 ) ; }"
}
nl combines natural language description and class environment. Elements in class environment are seperated by special tokens like con_elem_sep and con_func_sep.
We provide a script to evaluate predictions for this task, and report exact match and BLEU score. You can run the script like this:
python evaluator/evaluator.py -a=evaluator/answers.json -p=evaluator/predictions.txtThe outputs are:
BLEU: 16.68, EM: 17.0
The CodeBLEU score can be calculated by this script
Answer file is in the same format of the dev set json lines file. A legal prediction file is expected to be a txt format file. It should have the same number of lines as answer file. Each line is the model prediction for the corresponding input in answer file. For example, one line in the answer file is:
{
"nl": "Increment this vector in this place. con_elem_sep double[] vecElement con_elem_sep double[] weights con_func_sep void add(double)",
"code": "public void inc ( ) { this . add ( 1 ) ; }"
}
And the corresponding line in your prediction file is:
public void inc ( ) { this . add ( 1 ) ; }
We provide a pipeline for this task with CodeGPT and CodeGPT-adapted model.
- python 3.6 or 3.7
- torch==1.4.0
- transformers>=2.5.0
To fine-tune CodeGPT on concode dataset for text2code generation on multi-GPUs at a single machine, navigate to code directory, run:
LANG=java
DATADIR=../dataset/concode
OUTPUTDIR=../save/concode
PRETRAINDIR=microsoft/CodeGPT-small-java-adaptedGPT2 # will download pre-trained CodeGPT model
LOGFILE=text2code_concode.log
PER_NODE_GPU=YOUR_GPU_NUM # modify YOUR_GPU_NUM
python -m torch.distributed.launch --nproc_per_node=$PER_NODE_GPU run.py \
--data_dir=$DATADIR \
--langs=$LANG \
--output_dir=$OUTPUTDIR \
--pretrain_dir=$PRETRAINDIR \
--log_file=$LOGFILE \
--model_type=gpt2 \
--block_size=512 \
--do_train \
--node_index 0 \
--gpu_per_node $PER_NODE_GPU \
--learning_rate=5e-5 \
--weight_decay=0.01 \
--evaluate_during_training \
--per_gpu_train_batch_size=6 \
--per_gpu_eval_batch_size=12 \
--gradient_accumulation_steps=2 \
--num_train_epochs=30 \
--logging_steps=100 \
--save_steps=5000 \
--overwrite_output_dir \
--seed=42We stop at 60000 steps, which takes 22 hours on 2 NVIDIA P100.
It's recommanded to run evaluation on dev set on single GPU. The predictions on dev set will be saved in $OUTPUTDIR/dev.output.
export CUDA_VISIBLE_DEVICES=0
LANG=java
DATADIR=../dataset/concode
OUTPUTDIR=../save/concode
PRETRAINDIR=../save/concode/checkpoint
LOGFILE=text2code_concode_eval.log
python -u run.py \
--data_dir=$DATADIR \
--langs=$LANG \
--output_dir=$OUTPUTDIR \
--pretrain_dir=$PRETRAINDIR \
--log_file=$LOGFILE \
--model_type=gpt2 \
--block_size=512 \
--do_eval \
--logging_steps=100 \
--seed=42It's recommanded to run inference on test set on single GPU. The predictions will be saved in $OUTPUTDIR/test.output.
export CUDA_VISIBLE_DEVICES=0
LANG=java
DATADIR=../dataset/concode
OUTPUTDIR=../save/concode
PRETRAINDIR=../save/concode/checkpoint
LOGFILE=text2code_concode_infer.log
python -u run.py \
--data_dir=$DATADIR \
--langs=$LANG \
--output_dir=$OUTPUTDIR \
--pretrain_dir=$PRETRAINDIR \
--log_file=$LOGFILE \
--model_type=gpt2 \
--block_size=512 \
--do_infer \
--logging_steps=100 \
--seed=42It might take 40 minutes for inference on a single NVIDIA P100.
The results on concode test set are shown as below:
| Model | EM | BLEU | CodeBLEU |
|---|---|---|---|
| Seq2Seq | 3.05 | 21.31 | 26.39 |
| Seq2Action+MAML | 10.05 | 24.40 | 29.46 |
| Iyer-Simp+200 idoms | 12.20 | 26.60 | - |
| GPT-2 | 17.35 | 25.37 | 29.69 |
| CodeGPT | 18.25 | 28.69 | 32.71 |
| CodeGPT-adapted | 20.10 | 32.79 | 35.98 |
If you use concode dataset, please also cite this paper in addition to our CodeXGLUE:
@article{iyer2018mapping,
title={Mapping language to code in programmatic context},
author={Iyer, Srinivasan and Konstas, Ioannis and Cheung, Alvin and Zettlemoyer, Luke},
journal={arXiv preprint arXiv:1808.09588},
year={2018}
}