CodeTaste is a benchmark for evaluating AI agents on real-world code refactoring tasks and measuring their alignment with human developer choices. It builds perβinstance execution environments, runs agents in lockedβdown containers, evaluates their performance with tests and static analysis rules.
Check out our Paper for more details: CodeTaste: Can LLMs Generate Human-Level Code Refactorings?.
- Python: >= 3.10 (managed via Poetry >= 2.0.0).
- Containerization: Podman (for secure and reproducible executions).
- Hardware: Execution scripts multiple parralel containers by default (configurable workers). Ensure that you have sufficient RAM, CPU and storage. We recommend running the evaluation and inference harness on a
x86_64machine having 16 CPU cores and 32GB of RAM per parallel worker. You should have at least 500GB of free disk space to accommodate the base images, instance-specific runtime images, and outputs. - Utilities:
unzip(forcodetaste100.zip) andzip(for combining splitoutputs.zipparts). - API Keys:
- Anthropic: Set via
ANTHROPIC_API_KEY. Required by default for the agent used for environment creation (bootstrap) and the judge (multiplaninference). Note: Can be overridden by editing the judge/bootstrap source files. - Your Agent's Key: Set via
API_KEY_PASSED_TO_AGENTfor inference.
- Anthropic: Set via
Ensure the user socket is enabled so the Python client can communicate with the daemon:
systemctl --user enable --now podman.socket
export DOCKER_HOST=unix:///run/user/$(id -u)/podman/podman.sockBy default, images are pulled from ghcr.io/logic-star-ai/codetaste. Override with:
export CODETASTE_IMAGE_REPOSITORY=ghcr.io/logic-star-ai/codetaste# 0. Deactivate any existing virtual environment to enable in-project .venv creation
# 1. Install dependencies
git clone git@github.com:logic-star-ai/codetaste.git && cd codetaste
poetry install && source .venv/bin/activate
# 2. Download benchmark artifacts (assets, instance_images, metadata, pseudo-agents)
curl -L -o codetaste100.zip "https://github.com/logic-star-ai/refactoring-benchmark/releases/download/v1.0.0/codetaste100.zip"
unzip -o codetaste100.zip -d . && rm codetaste100.zip
# 3. Pull the base container image
podman pull ghcr.io/logic-star-ai/codetaste/benchmark-base-all:latest
# 4. Verify that you can run tests
pytestRun your agent against the benchmark. Results are outputted and cached by default in outputs/.
chmod +x ./entrypoint.sh # Required for runtime images
python -m refactoring_benchmark.cli.inference \
--instances 100 \
--agent-dir ./agents/your-agent \
--description-type instructed \
--output-dir ./outputs/instructed/direct \
--env API_KEY_PASSED_TO_AGENT="$API_KEY_PASSED_TO_AGENT"
# Optional: append --plan or --multiplanHandling Inference Errors: We do a full restart of the inference for an instance (i.e. deletion of the corresponding instance output directory outputs/<description_type>/<mode>/<owner>/<repo>/<hash>/<agent_id>/ and rerunning inference) if either (1) the agent doesn't produce results, due to an unexpected error (e.g. LLM Provider cannot be reached) or (2) the agent fails to place plan(s) under the expected path.
Applies the generated prediction.diff, calculates static analysis rules, and runs the test suite up to 5x to account for flakiness.
python -m refactoring_benchmark.cli.evaluate \
--instances 100 \
--agent-id <your-agent-id> \
--output-dir ./outputs/instructed/directAdjust and run run_agent_description.sh to run both inference and evaluation in one go.
After this step the outputs/ directory will be populated with the results of the inference and evaluation.
outputs/
βββ <description_type>/ # instructed | open
βββ <mode>/ # direct | plan | multiplan
βββ <owner>/<repo>/<hash8>/<agent_id>/
βββ prediction.diff # The generated patch
βββ inference_metadata.json # Cost, finish reasons, etc.
βββ evaluation/
βββ evaluation_result.json
βββ rules_positive.sarif # IFR+ data
βββ rules_negative.sarif # IFR- data
βββ test_output.txt # Raw test logs
βββ rule_output.txt # Raw rule logs
Generate the plots and tables used in the CodeTaste paper.
chmod +x run_analyze.sh
./run_analyze.shThere are 4 core metrics we evaluate:
- Pass: Checks whether the model's patch preserves functional integrity, using the repository's test suite.
- IFR: Measures whether the patch follows the intended refactoring using static analysis checks
- Alignment: A combined score that only rewards rule compliance when tests are valid, and
- Change Precision: Measures how well the patch avoids unrelated changes outside the intended refactoring scope. The golden commit reference solutions achieve 57.5%.
We list top performing agentic systems on our leaderboard. If you want your results included, please share a brief description of your approach, the corresponding outputs/ (with traces) and plots/ directories, and a link to the project's homepage. Please contact us at alex@logicstar.ai for submissions.
If you want independent verification of the results, we also require the agent/ directory containing the exact agent implementation used for inference.
The inclusion in the leaderboard will be performed on a best effort basis, but we can not guarantee inclusion or timely processing of your requests.
To reproduce the exact plots and tables found in the CodeTaste paper without rerunning the inference pipeline, you can use our precomputed outputs.
- Download Precomputed Outputs:
Run the included script to download and extract the evaluation data into the
outputs/directory, or download theoutputs.{zip,z01,z02,z03}from our Github Releases and extract them manually.
chmod +x .github/download_outputs.sh
./.github/download_outputs.sh- Generate Plots and Tables: Regenerate all analytical charts.
chmod +x run_analyze.sh
./run_analyze.shFor detailed guides on specific phases of the pipeline, refer to our documentation:
docs/bootstrap.md- Building runtime images.docs/inference.md- Running agents to generate patches.docs/evaluation.md- Applying patches, running tests, and computing IFR.docs/analysis.md- Aggregating metrics and generating plots.docs/benchmarking-your-agent.md- Guide to testing your own custom agent.
We would love to hear from the broader NLP, Machine Learning, and Software Engineering research communities, and welcome contributions to CodeTaste! If you have suggestions for improvements, new features, or want to report issues, please open an issue or submit a pull request on our GitHub repository.
@misc{codetaste2026,
title={CodeTaste: Can LLMs Generate Human-Level Code Refactorings?},
author={Alex Thillen and Niels MΓΌndler and Veselin Raychev and Martin Vechev},
year={2026},
eprint={2603.04177},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2603.04177},
}
This project is licensed under the MIT License - see the LICENSE file for details.