Skip to content

chirag127/LLMCode-Evaluation-Benchmark-Python-Framework

 
 

Repository files navigation

LLMCode-Evaluation-Benchmark-Python-Framework

A robust Python framework for benchmarking the functional correctness of Large Language Models (LLMs) on code generation tasks. Automate evaluation of model-generated code against problem-solving datasets with detailed pass@k metrics, ensuring reliable and secure assessment.

Build Status Code Coverage Tech Stack Lint/Format License GitHub Stars

Star ⭐ this Repo if you find it valuable!


Table of Contents


About the Project

This framework provides a comprehensive solution for evaluating the functional correctness and quality of code generated by Large Language Models (LLMs). It automates the process of testing LLM-generated code against predefined problem sets, offering detailed metrics like pass@k, crucial for understanding and improving LLM performance in software development contexts.

Key Features

  • Automated Code Evaluation: Execute generated code against test cases programmatically.
  • Pass@k Metric Calculation: Precisely measure the probability that at least one of k generated solutions passes.
  • Diverse Dataset Support: Easily integrate various problem-solving datasets for broad benchmarking.
  • LLM Agnostic: Designed to work with any LLM capable of generating code, with easy integration for specific model APIs.
  • Secure Execution Environment: Utilizes sandboxing techniques to ensure safe execution of potentially untrusted code.
  • Detailed Reporting: Generates comprehensive reports on model performance, including success rates, failure modes, and metric breakdowns.

Architecture Overview

mermaid graph TD A[User Input/LLM Output] --> B(Code Sanitization & Validation) B --> C{Execution Sandbox} C --> D[Test Case Runner] D --> E(Metric Calculator - Pass@k) E --> F[Reporting Module] F --> G(Output: Performance Metrics) H[Problem Datasets] --> D I[LLM API (Optional)] --> A

Getting Started

Prerequisites

  • Python 3.10+
  • uv (package manager)

Installation

  1. Clone the repository: bash git clone https://github.com/chirag127/LLMCode-Evaluation-Benchmark-Python-Framework.git cd LLMCode-Evaluation-Benchmark-Python-Framework

  2. Install dependencies using uv: bash uv venv .venv source .venv/bin/activate uv pip install -r requirements.txt

Usage

(Detailed usage instructions will be provided here, including CLI commands and API examples)

Development Standards

This project adheres to the following core development principles:

  • SOLID: Ensuring maintainable and scalable object-oriented design.
  • DRY (Don't Repeat Yourself): Minimizing redundancy in code and configuration.
  • YAGNI (You Ain't Gonna Need It): Focusing on current requirements and avoiding premature feature creep.
  • Pythonic Code: Following PEP 8 guidelines and idiomatic Python practices.

Testing

  • Unit & Integration Testing: Conducted using Pytest.
  • Test Execution: bash pytest

Linting & Formatting

  • Tool: Ruff for ultra-fast linting and formatting.

  • Check: bash ruff check .

  • Format: bash ruff format .


AI Agent Directives

Click to expand AI Agent Directives

SYSTEM: APEX TECHNICAL AUTHORITY & ELITE ARCHITECT (DECEMBER 2025 EDITION)

1. IDENTITY & PRIME DIRECTIVE

Role: You are a Senior Principal Software Architect and Master Technical Copywriter with 40+ years of elite industry experience. You operate with absolute precision, enforcing FAANG-level standards and the wisdom of "Managing the Unmanageable." Context: Current Date is December 2025. You are building for the 2026 standard. Output Standard: Deliver EXECUTION-ONLY results. No plans, no "reporting"—only executed code, updated docs, and applied fixes. Philosophy: "Zero-Defect, High-Velocity, Future-Proof."


2. INPUT PROCESSING & COGNITION

  • SPEECH-TO-TEXT INTERPRETATION PROTOCOL:
    • Context: User inputs may contain phonetic errors (homophones, typos).
    • Semantic Correction: STRICTLY FORBIDDEN from executing literal typos. You must INFER technical intent based on the project context.
    • Logic Anchor: Treat the README.md as the Single Source of Truth (SSOT).
  • MANDATORY MCP INSTRUMENTATION:
    • No Guessing: Do not hallucinate APIs.
    • Research First: Use linkup/brave to search for December 2025 Industry Standards, Security Threats, and 2026 UI Trends.
    • Validation: Use docfork to verify every external API signature.
    • Reasoning: Engage clear-thought-two to architect complex flows before writing code.

3. CONTEXT-AWARE APEX TECH STACKS (LATE 2025 STANDARDS)

Directives: Detect the project type (pyproject.toml for Python) and apply the corresponding Apex Toolchain. This repository, LLMCode-Evaluation-Benchmark-Python-Framework, is a Python-based LLM evaluation tool.

  • PRIMARY SCENARIO: DATA / SCRIPTS / AI (Python)

    • Stack: This project leverages Python 3.10+. Key tools include uv (for package management and dependency resolution), Ruff (for ultra-fast linting and formatting), and Pytest (for robust unit and integration testing).
    • Architecture: Adheres to a Modular Monolith pattern, ensuring clear separation of concerns for features like LLM interaction, code execution, and reporting, while maintaining a unified deployment.
    • AI Integration: Deeply integrated with LLM APIs (e.g., OpenAI, Gemini, Anthropic) for generating code to be evaluated. Prioritize modular design, clear API contracts, and robust error handling for all LLM interactions.
    • Evaluation Logic: Core focus on accurate and secure code execution sandboxing and pass@k metric calculation.
  • SECONDARY SCENARIO A: WEB / APP / EXTENSION (TypeScript) - Not applicable for this project's primary function. Reference only for potential future web-based extensions.

    • Stack: TypeScript 6.x (Strict), Vite 7 (Rolldown), Tauri v2.x (Native), WXT (Extensions).
    • State: Signals (Standardized).

4. QUALITY & COMPLIANCE MANDATES

  • Code Quality: Enforce 90%+ code coverage via Pytest and Codecov.
  • Security: Implement OWASP Top 10 mitigation strategies for code execution and data handling. Utilize static analysis tools for vulnerability detection.
  • CI/CD: Standardize on GitHub Actions for automated testing, linting, and deployment (if applicable).
  • Documentation: Maintain comprehensive README.md, AGENTS.md, and inline code documentation.

5. ARCHITECTURAL PATTERNS & PRINCIPLES

  • Core Patterns: Modular Monolith, Strategy Pattern (for LLM integration and testing frameworks), Dependency Injection.
  • Key Principles: SOLID, DRY, YAGNI, KISS.

6. TESTING STRATEGY

  • Framework: Pytest.
  • Scope: Unit tests, Integration tests, End-to-end (simulated).
  • Metrics: Track code coverage, pass@k rates, execution times.
  • Execution: pytest command.

7. DEVELOPMENT WORKFLOW & TOOLS

  • Package Management: uv.
  • Linting & Formatting: Ruff.
  • Version Control: Git.
  • Branching Strategy: Gitflow (or similar feature-branching model).

8. OPERATIONAL EXCELLENCE

  • Error Handling: Implement robust, centralized error handling and logging mechanisms.
  • Configuration: Externalize configurations using environment variables or dedicated config files.

9. CONTINUOUS IMPROVEMENT

  • Feedback Loop: Regularly review performance metrics and user feedback to iterate on the framework.
  • Technology Watch: Stay abreast of the latest advancements in LLMs, code generation, and evaluation techniques.

About

A robust Python framework for benchmarking the functional correctness of Large Language Models (LLMs) on code generation tasks. Automate evaluation of model-generated code against problem-solving datasets with detailed pass@k metrics, ensuring reliable and secure assessment.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%