A QuixBugs Benchmark Study
Author: Manish Singh Project: Automated Code Correction Agent Development
This project presents the development and evaluation of an LLM-powered agent for automated detection and correction of single-line defects in Python programs using the QuixBugs benchmark.
-
Achieved Accuracy: ✅ 86% (43/50 programs successfully corrected)
-
Benchmark: QuixBugs (University of Washington)
-
Defect Classes: 14 categories analyzed & targeted repair strategies implemented
-
Performance: Competitive with existing Automated Program Repair (APR) techniques
-
Future Goal: Integration with MMAPR framework for achieving 93–96% accuracy
-
Errors are fixed iteratively with up to 5 attempts per program.
├── AIML_CODEDEBU_FINAL_MANISH_SINGH_23_CS_244.ipynb # Final implementation notebook
├── AIML_CODEDEBU.ipynb # Experimental prototype notebook
├── tester.py # Automated test runner
├── requirements.txt # Dependencies
├── README.md # Documentation
└── /images # Results & workflow images
- Economic Impact: Software bugs cost ~$2.84 trillion annually
- Time Drain: Debugging consumes 50–75% of dev time
- Limitations of Tools: No existing solution addresses both syntactic & semantic bugs simultaneously
- Case Example: Windows Blue Screen of Death caused by a single-line pointer error
- Context preservation while fixing defects
- Avoiding false positives and overcorrections
- Handling semantic vs syntactic bugs
- Ensuring test coverage for all edge cases
- Overcoming API rate limits and model simplicity trade-offs
We identified 14 defect classes in QuixBugs:
| Defect Class | Frequency | Example | Repair Strategy |
|---|---|---|---|
| Off-by-one error | 28% | range(len(arr)-1) → range(len(arr)) |
Boundary adjustment |
| Incorrect operators | 22% | < → <=, == → is |
Operator replacement |
| Missing null checks | 15% | if x is not None: |
Defensive programming |
| Logical conditionals | 12% | and → or |
Logic operator correction |
| Variable initialization | 8% | Wrong defaults | Scope-aware init |
| Other | 15% | Mixed patterns | Case-specific fixes |
agent_executor = create_react_agent(model, tools)
tools = [
run_python_code, # Code execution & validation
run_python_code_from_file, # File-based testing
iterative_fix_and_test # Multi-attempt repair workflow
]🛠 Workflow
- Detect defect (pattern classification)
- Generate repair suggestion
- Run automated tests with
pytest - Iterate with error feedback (max 5 attempts)
- Save fixed program
- Total Programs: 50
- Corrected: 43 (86%)
- Failed Repairs: 7
- Avg Attempts per Fix: 1.8
- Fix Time: ⏱️ 5–10s (vs human 15–30 min)
- Off-by-one errors → 92%
- Comparison operators → 95%
- Null checks → 78%
- Logical conditionals → 83%
- Variable initialization → 100%
- Complex multi-line logic dependencies
- Ambiguous defect classification
- Rare edge cases not in training data
- Example: shortest_path_lengths.py failed due to state initialization missing
📍 Insert Example Error vs Fixed Code Image Here

| Method | QuixBugs Accuracy | Key Limitation |
|---|---|---|
| Our Agent | 86% | Complex dependencies |
| GenProg | 65% | Weak semantic understanding |
| Prophet | 58% | Pattern overfitting |
| CodeT5 | 72% | Single-attempt limitation |
| Human Expert | 100% | Time-intensive |
- Multi-modal input (ASTs, error traces, natural language)
- Few-shot learning with peer programs
- Ensemble repair with multiple LLM backends (GPT-4, Gemini, CodeT5)
📍 Inspired and implemented based on

- Multi-language support (Python → Java, C++, JS)
- Static & dynamic analysis integration (SonarQube, CodeQL)
- CI/CD pipeline & IDE plugin integration
- MMAPR Framework Research Paper
- QuixBugs Benchmark – University of Washington
- ACM Computing Surveys, 2021 – Automated Program Repair
- OpenAI Codex Technical Report, 2021
- LangChain Documentation
- Agentic Paper
git clone https://github.com/<your-repo>/LLM-CodeCorrection-Agent.git
cd LLM-CodeCorrection-Agent
pip install -r requirements.txtpython tester.py --file buggy_code.pypytest tests/- Agent fixing buggy code
- Test results from
pytest - Comparison graphs
✔️ Achieved 86% success rate on QuixBugs ✔️ Developed 14-class defect taxonomy ✔️ Implemented iterative agentic repair workflow ✔️ Established roadmap for MMAPR integration (93–96% accuracy)
✨ This work demonstrates that LLM-powered agents can bridge the gap between traditional APR tools and human-level expertise in automated debugging.


