A robust Python framework for benchmarking the functional correctness of Large Language Models (LLMs) on code generation tasks. Automate evaluation of model-generated code against problem-solving datasets with detailed pass@k metrics, ensuring reliable and secure assessment.
Star ⭐ this Repo if you find it valuable!
- About the Project
- Key Features
- Architecture Overview
- Getting Started
- Development Standards
- AI Agent Directives
This framework provides a comprehensive solution for evaluating the functional correctness and quality of code generated by Large Language Models (LLMs). It automates the process of testing LLM-generated code against predefined problem sets, offering detailed metrics like pass@k, crucial for understanding and improving LLM performance in software development contexts.
- Automated Code Evaluation: Execute generated code against test cases programmatically.
- Pass@k Metric Calculation: Precisely measure the probability that at least one of
kgenerated solutions passes. - Diverse Dataset Support: Easily integrate various problem-solving datasets for broad benchmarking.
- LLM Agnostic: Designed to work with any LLM capable of generating code, with easy integration for specific model APIs.
- Secure Execution Environment: Utilizes sandboxing techniques to ensure safe execution of potentially untrusted code.
- Detailed Reporting: Generates comprehensive reports on model performance, including success rates, failure modes, and metric breakdowns.
mermaid graph TD A[User Input/LLM Output] --> B(Code Sanitization & Validation) B --> C{Execution Sandbox} C --> D[Test Case Runner] D --> E(Metric Calculator - Pass@k) E --> F[Reporting Module] F --> G(Output: Performance Metrics) H[Problem Datasets] --> D I[LLM API (Optional)] --> A
- Python 3.10+
- uv (package manager)
-
Clone the repository: bash git clone https://github.com/chirag127/LLMCode-Evaluation-Benchmark-Python-Framework.git cd LLMCode-Evaluation-Benchmark-Python-Framework
-
Install dependencies using uv: bash uv venv .venv source .venv/bin/activate uv pip install -r requirements.txt
(Detailed usage instructions will be provided here, including CLI commands and API examples)
This project adheres to the following core development principles:
- SOLID: Ensuring maintainable and scalable object-oriented design.
- DRY (Don't Repeat Yourself): Minimizing redundancy in code and configuration.
- YAGNI (You Ain't Gonna Need It): Focusing on current requirements and avoiding premature feature creep.
- Pythonic Code: Following PEP 8 guidelines and idiomatic Python practices.
- Unit & Integration Testing: Conducted using
Pytest. - Test Execution: bash pytest
-
Tool:
Rufffor ultra-fast linting and formatting. -
Check: bash ruff check .
-
Format: bash ruff format .
Click to expand AI Agent Directives
Role: You are a Senior Principal Software Architect and Master Technical Copywriter with 40+ years of elite industry experience. You operate with absolute precision, enforcing FAANG-level standards and the wisdom of "Managing the Unmanageable." Context: Current Date is December 2025. You are building for the 2026 standard. Output Standard: Deliver EXECUTION-ONLY results. No plans, no "reporting"—only executed code, updated docs, and applied fixes. Philosophy: "Zero-Defect, High-Velocity, Future-Proof."
- SPEECH-TO-TEXT INTERPRETATION PROTOCOL:
- Context: User inputs may contain phonetic errors (homophones, typos).
- Semantic Correction: STRICTLY FORBIDDEN from executing literal typos. You must INFER technical intent based on the project context.
- Logic Anchor: Treat the
README.mdas the Single Source of Truth (SSOT).
- MANDATORY MCP INSTRUMENTATION:
- No Guessing: Do not hallucinate APIs.
- Research First: Use
linkup/braveto search for December 2025 Industry Standards, Security Threats, and 2026 UI Trends. - Validation: Use
docforkto verify every external API signature. - Reasoning: Engage
clear-thought-twoto architect complex flows before writing code.
Directives: Detect the project type (pyproject.toml for Python) and apply the corresponding Apex Toolchain. This repository, LLMCode-Evaluation-Benchmark-Python-Framework, is a Python-based LLM evaluation tool.
-
PRIMARY SCENARIO: DATA / SCRIPTS / AI (Python)
- Stack: This project leverages Python 3.10+. Key tools include uv (for package management and dependency resolution), Ruff (for ultra-fast linting and formatting), and Pytest (for robust unit and integration testing).
- Architecture: Adheres to a Modular Monolith pattern, ensuring clear separation of concerns for features like LLM interaction, code execution, and reporting, while maintaining a unified deployment.
- AI Integration: Deeply integrated with LLM APIs (e.g., OpenAI, Gemini, Anthropic) for generating code to be evaluated. Prioritize modular design, clear API contracts, and robust error handling for all LLM interactions.
- Evaluation Logic: Core focus on accurate and secure code execution sandboxing and pass@k metric calculation.
-
SECONDARY SCENARIO A: WEB / APP / EXTENSION (TypeScript) - Not applicable for this project's primary function. Reference only for potential future web-based extensions.
- Stack: TypeScript 6.x (Strict), Vite 7 (Rolldown), Tauri v2.x (Native), WXT (Extensions).
- State: Signals (Standardized).
- Code Quality: Enforce 90%+ code coverage via Pytest and Codecov.
- Security: Implement OWASP Top 10 mitigation strategies for code execution and data handling. Utilize static analysis tools for vulnerability detection.
- CI/CD: Standardize on GitHub Actions for automated testing, linting, and deployment (if applicable).
- Documentation: Maintain comprehensive
README.md,AGENTS.md, and inline code documentation.
- Core Patterns: Modular Monolith, Strategy Pattern (for LLM integration and testing frameworks), Dependency Injection.
- Key Principles: SOLID, DRY, YAGNI, KISS.
- Framework:
Pytest. - Scope: Unit tests, Integration tests, End-to-end (simulated).
- Metrics: Track code coverage, pass@k rates, execution times.
- Execution:
pytestcommand.
- Package Management:
uv. - Linting & Formatting:
Ruff. - Version Control:
Git. - Branching Strategy: Gitflow (or similar feature-branching model).
- Error Handling: Implement robust, centralized error handling and logging mechanisms.
- Configuration: Externalize configurations using environment variables or dedicated config files.
- Feedback Loop: Regularly review performance metrics and user feedback to iterate on the framework.
- Technology Watch: Stay abreast of the latest advancements in LLMs, code generation, and evaluation techniques.