Skip to content

waynchi/editbench

Repository files navigation

EDIT-Bench: Evaluating LLM Abilities to Perform Real-World Instructed Code Edits

EDIT-Bench is a code editing benchmark built on real code edits gathered from VSCode.

arXiv Paper   Dataset on Hugging Face   Leaderboard Website

GitHub stars GitHub forks GitHub last commit

License: MIT Python


Quick Links
📄 Paper: arXiv 2511.04486
📦 Dataset: Hugging Face /copilot-arena/EDIT-Bench
🏆 Leaderboard: waynechi.com/edit-bench

Quick Start

Evaluating new models on EDIT-Bench is easy!

  1. Install Docker.
  2. Provide your generated code edits.
  3. Modify and run the script at examples/run_experiment.py

Example

As an example, we pre-generated code edits for gpt-o3-mini. Generations with the expected format are found at generations/whole_file/gpt-o3-mini.

To evaluate these generations:

bash run_experiment.sh examples/run_experiment.py

You will find the results in example_results/gpt-o3-mini.json

Customize Experiments

The core function used to run our experiments is:

test_edits(gen_path=GENERATION_PATH, split=SPLIT, output_file=OUTPUT_FILE)

We provided the simplest example with run_experiments.py, however you can customize the file in multiple ways.

Generation Example

You need to generate files before running our tests. We've provided an example at examples/generate_and_run_experiment.py on how to do both at once.

bash run_experiment.sh examples/generate_and_run_experiment.py

This uses the prompts/whole_file.txt prompt which is the baseline used in our paper.

Other Examples

For a complete end-to-end generation and testing script using OpenRouter and OpenAI, see examples/openrouter_experiment.py and examples/openai_experiment.py. These scripts take a YAML file as the first argument and runs the experiment with the configuration inside the YAML. For example:

bash run_experiment.sh examples/openai_experiment.py configs/gpt-5-high.yaml

To view experiments, use the scripts/display_results_csv.py script provided by passing in the directory containing your results:

python3 scripts/display_results_csv.py <path_to_json_dir>

Many optional arguments are provided to change the formatting and information (e.g. --csv flag returns the data in csv form, --split partitions data to specific questions in the split)

Extra Information

All experiments are executed using the run_experiment.sh shell script, which serves as the main command-line interface for the framework. This script handles building docker containers and running experiments inside the container. All environment variables to be used in the docker container are defined in the EditBench.config file.

Commands in run_experiment

By default, the bash script will built and run the docker container, then execute the given python file along with the command line arguments inside the docker container.

bash ./run_experiment <path to python file> [args for python file]

To help with debugging, we provide the build and shell commands

# Force rebuild the Docker container
bash ./run_experiment build

# Create an interactive session (useful for debugging)
bash ./run_experiment shell
Writing Your Own Inference & Testing Script

Experiments run inside Docker containers, and the edit_bench package provides convenient functions for running experiments. The docker container is an isolated execution environment and mounts this repo inside the container as /projects (can be accessed using the WORKDIR env variable). Edits made in this repo are synced with the repo inside docker.

The two function you need from edit_bench.evaluation are:

  • generate_files - Generates code files for the specified model
  • test_edits - Runs tests for the specified model's generations

The end-to-end examples (e.g. examples/openai_experiment.py) provide practical uses for these function. The spec for these functions:

generate_files(fn, prompt_path, generations_path, split)

  • This function loads data from HF and uses fn in multiple threads to generate solutions to each problem. The function ignores problem_ids that already exist in generations_path
  • fn(prompt, lang) is a function that takes a prompt string and programming language string and returns the model's generation for that prompt. The lang string makes parsing the output easier
  • prompt_path is the path to the prompt f-string. See prompts/ for examples. The f-string has access to variables: lang (programming language), original_code, instruction (user instruction), and highlighted_code
  • generation_path is the directory for generated outputs. The generations are stored by problem_id name. Set the path prefix to /projects (can be accessed using the WORKDIR env variable) for the generations to persist outside of docker.
  • split the set of questions to use from HF

test_edits(gen_path, split, output_file)

  • This function tests the generations in the gen_path directory and returns the results (as json) to output_file. The tests will not run if > 90% of results are already present in the output file (strongly indicates that tests were already run).
  • gen_path where the generations are located
  • split the HF split to use
  • output_file the location of outputs. Use /projects (can be accessed using the WORKDIR env variable) to ensure results persists between docker runs.

Contact

For questions and feedback, please open an issue or feel free to reach out directly!

Authors

Wayne Chi
Wayne Chi

TwitterGitHubWebsite
Valerie Chen
Valerie Chen

TwitterGitHubWebsite
Ryan Shar
Ryan Shar

TwitterGitHubWebsite

License

This project is licensed under the Apache 2.0 License.

Citation

@misc{chi2025editbenchevaluatingllmabilities,
      title={EDIT-Bench: Evaluating LLM Abilities to Perform Real-World Instructed Code Edits}, 
      author={Wayne Chi and Valerie Chen and Ryan Shar and Aditya Mittal and Jenny Liang and Wei-Lin Chiang and Anastasios Nikolas Angelopoulos and Ion Stoica and Graham Neubig and Ameet Talwalkar and Chris Donahue},
      year={2025},
      eprint={2511.04486},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2511.04486}, 
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •