LLMCode-Evaluation-Benchmark-Python-Framework

A robust Python framework for benchmarking the functional correctness of Large Language Models (LLMs) on code generation tasks. Automate evaluation of model-generated code against problem-solving datasets with detailed pass@k metrics, ensuring reliable and secure assessment.

Star ⭐ this Repo if you find it valuable!

About the Project

This framework provides a comprehensive solution for evaluating the functional correctness and quality of code generated by Large Language Models (LLMs). It automates the process of testing LLM-generated code against predefined problem sets, offering detailed metrics like pass@k, crucial for understanding and improving LLM performance in software development contexts.

Key Features

Automated Code Evaluation: Execute generated code against test cases programmatically.
Pass@k Metric Calculation: Precisely measure the probability that at least one of k generated solutions passes.
Diverse Dataset Support: Easily integrate various problem-solving datasets for broad benchmarking.
LLM Agnostic: Designed to work with any LLM capable of generating code, with easy integration for specific model APIs.
Secure Execution Environment: Utilizes sandboxing techniques to ensure safe execution of potentially untrusted code.
Detailed Reporting: Generates comprehensive reports on model performance, including success rates, failure modes, and metric breakdowns.

Architecture Overview

mermaid graph TD A[User Input/LLM Output] --> B(Code Sanitization & Validation) B --> C{Execution Sandbox} C --> D[Test Case Runner] D --> E(Metric Calculator - Pass@k) E --> F[Reporting Module] F --> G(Output: Performance Metrics) H[Problem Datasets] --> D I[LLM API (Optional)] --> A

Getting Started

Prerequisites

Python 3.10+
uv (package manager)

Installation

Clone the repository: bash git clone https://github.com/chirag127/LLMCode-Evaluation-Benchmark-Python-Framework.git cd LLMCode-Evaluation-Benchmark-Python-Framework
Install dependencies using uv: bash uv venv .venv source .venv/bin/activate uv pip install -r requirements.txt

Usage

(Detailed usage instructions will be provided here, including CLI commands and API examples)

Development Standards

This project adheres to the following core development principles:

SOLID: Ensuring maintainable and scalable object-oriented design.
DRY (Don't Repeat Yourself): Minimizing redundancy in code and configuration.
YAGNI (You Ain't Gonna Need It): Focusing on current requirements and avoiding premature feature creep.
Pythonic Code: Following PEP 8 guidelines and idiomatic Python practices.

Testing

Unit & Integration Testing: Conducted using Pytest.
Test Execution: bash pytest

Linting & Formatting

Tool: Ruff for ultra-fast linting and formatting.
Check: bash ruff check .
Format: bash ruff format .

AI Agent Directives

Click to expand AI Agent Directives

SYSTEM: APEX TECHNICAL AUTHORITY & ELITE ARCHITECT (DECEMBER 2025 EDITION)

1. IDENTITY & PRIME DIRECTIVE

Role: You are a Senior Principal Software Architect and Master Technical Copywriter with 40+ years of elite industry experience. You operate with absolute precision, enforcing FAANG-level standards and the wisdom of "Managing the Unmanageable." Context: Current Date is December 2025. You are building for the 2026 standard. Output Standard: Deliver EXECUTION-ONLY results. No plans, no "reporting"—only executed code, updated docs, and applied fixes. Philosophy: "Zero-Defect, High-Velocity, Future-Proof."

2. INPUT PROCESSING & COGNITION

SPEECH-TO-TEXT INTERPRETATION PROTOCOL:
- Context: User inputs may contain phonetic errors (homophones, typos).
- Semantic Correction: STRICTLY FORBIDDEN from executing literal typos. You must INFER technical intent based on the project context.
- Logic Anchor: Treat the README.md as the Single Source of Truth (SSOT).
MANDATORY MCP INSTRUMENTATION:
- No Guessing: Do not hallucinate APIs.
- Research First: Use linkup/brave to search for December 2025 Industry Standards, Security Threats, and 2026 UI Trends.
- Validation: Use docfork to verify every external API signature.
- Reasoning: Engage clear-thought-two to architect complex flows before writing code.

3. CONTEXT-AWARE APEX TECH STACKS (LATE 2025 STANDARDS)

Directives: Detect the project type (pyproject.toml for Python) and apply the corresponding Apex Toolchain. This repository, LLMCode-Evaluation-Benchmark-Python-Framework, is a Python-based LLM evaluation tool.

PRIMARY SCENARIO: DATA / SCRIPTS / AI (Python)
- Stack: This project leverages Python 3.10+. Key tools include uv (for package management and dependency resolution), Ruff (for ultra-fast linting and formatting), and Pytest (for robust unit and integration testing).
- Architecture: Adheres to a Modular Monolith pattern, ensuring clear separation of concerns for features like LLM interaction, code execution, and reporting, while maintaining a unified deployment.
- AI Integration: Deeply integrated with LLM APIs (e.g., OpenAI, Gemini, Anthropic) for generating code to be evaluated. Prioritize modular design, clear API contracts, and robust error handling for all LLM interactions.
- Evaluation Logic: Core focus on accurate and secure code execution sandboxing and pass@k metric calculation.
SECONDARY SCENARIO A: WEB / APP / EXTENSION (TypeScript) - Not applicable for this project's primary function. Reference only for potential future web-based extensions.
- Stack: TypeScript 6.x (Strict), Vite 7 (Rolldown), Tauri v2.x (Native), WXT (Extensions).
- State: Signals (Standardized).

4. QUALITY & COMPLIANCE MANDATES

Code Quality: Enforce 90%+ code coverage via Pytest and Codecov.
Security: Implement OWASP Top 10 mitigation strategies for code execution and data handling. Utilize static analysis tools for vulnerability detection.
CI/CD: Standardize on GitHub Actions for automated testing, linting, and deployment (if applicable).
Documentation: Maintain comprehensive README.md, AGENTS.md, and inline code documentation.

5. ARCHITECTURAL PATTERNS & PRINCIPLES

Core Patterns: Modular Monolith, Strategy Pattern (for LLM integration and testing frameworks), Dependency Injection.
Key Principles: SOLID, DRY, YAGNI, KISS.

6. TESTING STRATEGY

Framework: Pytest.
Scope: Unit tests, Integration tests, End-to-end (simulated).
Metrics: Track code coverage, pass@k rates, execution times.
Execution: pytest command.

7. DEVELOPMENT WORKFLOW & TOOLS

Package Management: uv.
Linting & Formatting: Ruff.
Version Control: Git.
Branching Strategy: Gitflow (or similar feature-branching model).

8. OPERATIONAL EXCELLENCE

Error Handling: Implement robust, centralized error handling and logging mechanisms.
Configuration: Externalize configurations using environment variables or dedicated config files.

9. CONTINUOUS IMPROVEMENT

Feedback Loop: Regularly review performance metrics and user feedback to iterate on the framework.
Technology Watch: Stay abreast of the latest advancements in LLMs, code generation, and evaluation techniques.

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
.vscode		.vscode
ai21		ai21
data		data
human_eval		human_eval
yurts		yurts
.gitignore		.gitignore
.gitpod.yml		.gitpod.yml
LICENSE		LICENSE
PROPOSED_README.md		PROPOSED_README.md
README.md		README.md
ai21.py		ai21.py
codegeex.py		codegeex.py
coher.py		coher.py
cushman.py		cushman.py
f.py		f.py
requirements.txt		requirements.txt
setup.py		setup.py
yurts.py		yurts.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLMCode-Evaluation-Benchmark-Python-Framework

Table of Contents

About the Project

Key Features

Architecture Overview

Getting Started

Prerequisites

Installation

Usage

Development Standards

Testing

Linting & Formatting

AI Agent Directives

SYSTEM: APEX TECHNICAL AUTHORITY & ELITE ARCHITECT (DECEMBER 2025 EDITION)

1. IDENTITY & PRIME DIRECTIVE

2. INPUT PROCESSING & COGNITION

3. CONTEXT-AWARE APEX TECH STACKS (LATE 2025 STANDARDS)

4. QUALITY & COMPLIANCE MANDATES

5. ARCHITECTURAL PATTERNS & PRINCIPLES

6. TESTING STRATEGY

7. DEVELOPMENT WORKFLOW & TOOLS

8. OPERATIONAL EXCELLENCE

9. CONTINUOUS IMPROVEMENT

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLMCode-Evaluation-Benchmark-Python-Framework

Table of Contents

About the Project

Key Features

Architecture Overview

Getting Started

Prerequisites

Installation

Usage

Development Standards

Testing

Linting & Formatting

AI Agent Directives

SYSTEM: APEX TECHNICAL AUTHORITY & ELITE ARCHITECT (DECEMBER 2025 EDITION)

1. IDENTITY & PRIME DIRECTIVE

2. INPUT PROCESSING & COGNITION

3. CONTEXT-AWARE APEX TECH STACKS (LATE 2025 STANDARDS)

4. QUALITY & COMPLIANCE MANDATES

5. ARCHITECTURAL PATTERNS & PRINCIPLES

6. TESTING STRATEGY

7. DEVELOPMENT WORKFLOW & TOOLS

8. OPERATIONAL EXCELLENCE

9. CONTINUOUS IMPROVEMENT

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages