EDIT-Bench is a code editing benchmark built on real code edits gathered from VSCode.
Quick Links
📄 Paper: arXiv 2511.04486
📦 Dataset: Hugging Face /copilot-arena/EDIT-Bench
🏆 Leaderboard: waynechi.com/edit-bench
Evaluating new models on EDIT-Bench is easy!
- Install Docker.
- Provide your generated code edits.
- Modify and run the script at
examples/run_experiment.py
As an example, we pre-generated code edits for gpt-o3-mini.
Generations with the expected format are found at generations/whole_file/gpt-o3-mini.
To evaluate these generations:
bash run_experiment.sh examples/run_experiment.pyYou will find the results in example_results/gpt-o3-mini.json
The core function used to run our experiments is:
test_edits(gen_path=GENERATION_PATH, split=SPLIT, output_file=OUTPUT_FILE)We provided the simplest example with run_experiments.py, however you can customize the file in multiple ways.
You need to generate files before running our tests. We've provided an example at examples/generate_and_run_experiment.py on how to do both at once.
bash run_experiment.sh examples/generate_and_run_experiment.pyThis uses the prompts/whole_file.txt prompt which is the baseline used in our paper.
For a complete end-to-end generation and testing script using OpenRouter and OpenAI, see examples/openrouter_experiment.py and examples/openai_experiment.py. These scripts take a YAML file as the first argument and runs the experiment with the configuration inside the YAML. For example:
bash run_experiment.sh examples/openai_experiment.py configs/gpt-5-high.yamlTo view experiments, use the scripts/display_results_csv.py script provided by passing in the directory containing your results:
python3 scripts/display_results_csv.py <path_to_json_dir>Many optional arguments are provided to change the formatting and information (e.g. --csv flag returns the data in csv form, --split partitions data to specific questions in the split)
All experiments are executed using the run_experiment.sh shell script, which serves as the main command-line interface for the framework.
This script handles building docker containers and running experiments inside the container.
All environment variables to be used in the docker container are defined in the EditBench.config file.
By default, the bash script will built and run the docker container, then execute the given python file along with the command line arguments inside the docker container.
bash ./run_experiment <path to python file> [args for python file]To help with debugging, we provide the build and shell commands
# Force rebuild the Docker container
bash ./run_experiment build
# Create an interactive session (useful for debugging)
bash ./run_experiment shell
Writing Your Own Inference & Testing Script
Experiments run inside Docker containers, and the edit_bench package provides convenient functions for running experiments. The docker container is an isolated execution environment and mounts this repo inside the container as /projects (can be accessed using the WORKDIR env variable). Edits made in this repo are synced with the repo inside docker.
The two function you need from edit_bench.evaluation are:
generate_files- Generates code files for the specified modeltest_edits- Runs tests for the specified model's generations
The end-to-end examples (e.g. examples/openai_experiment.py) provide practical uses for these function. The spec for these functions:
generate_files(fn, prompt_path, generations_path, split)
- This function loads data from HF and uses
fnin multiple threads to generate solutions to each problem. The function ignores problem_ids that already exist ingenerations_path fn(prompt, lang)is a function that takes a prompt string and programming language string and returns the model's generation for that prompt. The lang string makes parsing the output easierprompt_pathis the path to the prompt f-string. Seeprompts/for examples. The f-string has access to variables: lang (programming language), original_code, instruction (user instruction), and highlighted_codegeneration_pathis the directory for generated outputs. The generations are stored by problem_id name. Set the path prefix to/projects(can be accessed using the WORKDIR env variable) for the generations to persist outside of docker.splitthe set of questions to use from HF
test_edits(gen_path, split, output_file)
- This function tests the generations in the
gen_pathdirectory and returns the results (as json) tooutput_file. The tests will not run if > 90% of results are already present in the output file (strongly indicates that tests were already run). gen_pathwhere the generations are locatedsplitthe HF split to useoutput_filethe location of outputs. Use/projects(can be accessed using the WORKDIR env variable) to ensure results persists between docker runs.
For questions and feedback, please open an issue or feel free to reach out directly!
Wayne Chi Twitter • GitHub • Website |
Valerie Chen Twitter • GitHub • Website |
Ryan Shar Twitter • GitHub • Website |
This project is licensed under the Apache 2.0 License.
@misc{chi2025editbenchevaluatingllmabilities,
title={EDIT-Bench: Evaluating LLM Abilities to Perform Real-World Instructed Code Edits},
author={Wayne Chi and Valerie Chen and Ryan Shar and Aditya Mittal and Jenny Liang and Wei-Lin Chiang and Anastasios Nikolas Angelopoulos and Ion Stoica and Graham Neubig and Ameet Talwalkar and Chris Donahue},
year={2025},
eprint={2511.04486},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2511.04486},
}