CrisisAI: GGUF Model Evaluation Framework# CrisisAI: GGUF Model Evaluation Framework

Documentation## 1. Problem Statement

All documentation has been moved to the docs/ directory for better organization.In an emergency or crisis situation where communication infrastructure (internet, cellular networks) is unavailable, access to reliable information can be critical. The advent of small, efficient Large Language Models (LLMs) in formats like GGUF makes it possible to run a helpful AI assistant entirely offline on a device like a smartphone or laptop.

Quick StartHowever, with a multitude of open-source models available, a key challenge arises: How do we select the most effective and reliable GGUF model for providing safe, practical, and clear advice to a non-expert user under stress?

Quick Start Guide - Get started in 5 minutes
README - Complete project overview and setup instructionsAn ideal model must not only be accurate but also avoid dangerous "hallucinations," correct common misconceptions, and present information in a simple, step-by-step manner. This project provides a systematic framework to test, evaluate, and compare different GGUF models for this specific, high-stakes use case.

Documentation## 2. The Solution: A Systematic Evaluation Pipeline

Batch Organization - How batch testing and results organization works
Automated Batch Testing - Complete automation guideThis project implements a multi-stage pipeline to rigorously evaluate and compare LLMs. It uses a powerful online model (Google's Gemini) as an objective "expert" to score the performance of smaller, offline models, moving from subjective impressions to data-driven analysis.
Evaluation Viewer Guide - Interactive HTML viewer documentation
Architecture - System architecture and workflow diagramsThe workflow is managed by a series of scripts:

Development- Question Generation: We start with a curated list of plausible, non-expert questions covering various off-grid crisis scenarios (crisis_questions.json).

System Ready - Feature summary and status- Automated Testing: A Python script (run_crisis_test.py) connects to a local GGUF model (served via LM Studio) and records its answers to every question. This is repeated for each model being tested.
Timing Improvements - Performance measurement details- Expert Assessment: A second script (evaluate_models.py) aggregates all the answers and sends them, one question at a time, to the Gemini API. Gemini is prompted to provide its own "ideal" answer and then score each of the smaller models' answers on a comparative scale of 0-10, providing a justification for each score.
Fixes Applied - Technical fixes and solutions- Visual Reporting: The final script (generate_report.py) processes the evaluation data and generates a single, self-contained, interactive HTML report (evaluation_report.html) for easy analysis and comparison.

Code Files## 3. How to Use This Framework

batch_test_models.py - Automated batch testing scriptFollow these steps to evaluate your own GGUF models.
llm-crisis-questions-test.py - Core testing script
test-evaluation.py - Gemini evaluation script### Prerequisites
eval_batch.py - Batch evaluation helper
evaluation-viewer.html - Interactive HTML viewer (requires local server)- Python 3.7+: Ensure Python and Pip are installed.
html-report-generator.py - Static HTML report generator- LM Studio: Download and install from lmstudio.ai. This is used to easily serve GGUF models via a local server.
GGUF Models: Download the GGUF model files you wish to test.

Quick Start- Gemini API Key: Obtain an API key from Google AI Studio.

Automated Testing: python batch_test_models.py### Step 1: Initial Setup
Evaluate Results: python eval_batch.py
View Results: Run python serve-viewer.py and open the displayed URLClone or download this project's files into a single directory.

See docs/QUICK_START.md for detailed instructions.Install the required Python library:

pip install requests

Customize the questions in crisis_questions.json if desired.

Step 2: Test Your Local GGUF Models

🚀 RECOMMENDED: Automated Batch Testing

Test multiple models automatically in one run:

python batch_test_models.py

This will:

List all available models in LM Studio
Let you select models using checkboxes (Space to select, Enter to confirm)
Create a timestamped batch folder (e.g., test_results/2025-10-09_1/)
Automatically load, test, and unload each model
Save all results to the batch folder

See AUTOMATED_BATCH_TESTING.md for complete automation guide.

Manual Testing (Alternative)

For testing individual models manually:

Load Model in LM Studio

Open LM Studio, load a GGUF model, and navigate to the Local Server tab (<-->).

Start Server

Click Start Server.

Run Test Script

Open a terminal in the project directory and run the script:

# Test/Dry-Run
python .\llm-crisis-questions-test.py

# Or specify model name (for result file naming)
python .\llm-crisis-questions-test.py --model-name "Mistral-7B-Instruct-v0.3.Q4_K_M"

Step 3: Evaluate Results

Automated Batch Evaluation

After running batch tests, evaluate the results:

# List available batches
python eval_batch.py --list

# Evaluate latest batch (default)
python eval_batch.py

# Evaluate specific batch
python eval_batch.py --batch 2025-10-09_1

See BATCH_ORGANIZATION.md for batch folder organization guide.

Manual Evaluation (Alternative)

For manually collected results:

Set API Key

Set your Gemini API key as an environment variable.

macOS/Linux: export GEMINI_API_KEY="YOUR_API_KEY"
Windows (CMD): set GEMINI_API_KEY="YOUR_API_KEY"
Windows (PowerShell): $env:GEMINI_API_KEY="YOUR_API_KEY"

Run Evaluation Script

With all your *_results.json files in the directory, run the evaluation script:

python test-evaluation.py

Step 4: View the Visual Report

Step 4: View Interactive Results

Start a local web server and open the viewer:

# Start Python's built-in web server
python -m http.server 8000

# Then open in your browser:
# http://localhost:8000/evaluation-viewer.html

This will:

Serve the HTML viewer at http://localhost:8000/evaluation-viewer.html
Automatically load all evaluation reports from eval_results/
Display chronological tabs for each evaluation report
Provide interactive analysis and comparison tools

Features:

Automatic loading of all evaluation reports
Chronological tab navigation (newest first)
Performance summary with model rankings
Detailed question-by-question analysis
Interactive score breakdowns

Alternative: Generate Static HTML Report

If you prefer a pre-generated static report:

python html-report-generator.py

This creates evaluation_report.html with embedded data.

4. File Descriptions

Core Scripts

batch_test_models.py: 🚀 [RECOMMENDED] Automated batch testing script. Lists models, provides checkbox selection UI, automatically loads/unloads models via LM Studio CLI, creates timestamped batch folders, and runs tests sequentially.
llm-crisis-questions-test.py: Core testing script that sends crisis questions to a loaded LLM and records answers. Can be run standalone or called by batch script.
test-evaluation.py: Uses Gemini API to evaluate and score model answers. Supports both flat structure and batch folders via BATCH_FOLDER environment variable.
eval_batch.py: Helper script for batch evaluation. Lists available batches, auto-detects latest, and wraps test-evaluation.py with proper environment setup.
html-report-generator.py: Generates static HTML report with embedded evaluation data (legacy approach).

Viewer

evaluation-viewer.html: 🚀 [RECOMMENDED] Interactive standalone HTML viewer. Open in any browser, drag & drop JSON report files, view results instantly. Works completely offline - no Python or server needed!

Data Files

Crisis-Questions.json: Structured JSON file containing crisis scenario questions used for testing.
test_results/YYYY-MM-DD_N/: Batch folders containing test results (format: model-name_timestamp.json and *_runinfo.json).
gemini_evaluation_report.json: Generated master file with Gemini's expert answers and comparative scores.
evaluation_report.html: Generated interactive report for analysis.

Documentation

AUTOMATED_BATCH_TESTING.md: Complete guide for automated batch testing with LM Studio CLI integration.
BATCH_ORGANIZATION.md: Comprehensive guide for batch folder organization, evaluation workflows, and examples.
TIMING_IMPROVEMENTS.md: Documentation of timing accuracy improvements (excludes file I/O overhead).
FIXES_APPLIED.md: Record of Unicode encoding fixes for Windows subprocess handling.

Utility Scripts

list_gemini_models.py: Lists available Gemini models for evaluation.
check_test_status.py: Monitors running batch tests and shows progress.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
docs		docs
eval_results		eval_results
old		old
test_results		test_results
.gitignore		.gitignore
Crisis-Questions.json		Crisis-Questions.json
LICENSE		LICENSE
README.md		README.md
TEST_RUN_ANALYSIS.md		TEST_RUN_ANALYSIS.md
add_sizes_to_report.py		add_sizes_to_report.py
aggregated_answers.json		aggregated_answers.json
analyze_test_run.py		analyze_test_run.py
analyze_test_run_correct.py		analyze_test_run_correct.py
batch_test_models.py		batch_test_models.py
check_errors.py		check_errors.py
check_qwen_completeness.py		check_qwen_completeness.py
debug_names.py		debug_names.py
debug_null_sizes.py		debug_null_sizes.py
debug_report.py		debug_report.py
enrich_report_with_sizes.py		enrich_report_with_sizes.py
eval_batch.py		eval_batch.py
evaluation-viewer.html		evaluation-viewer.html
generate_reports_index.py		generate_reports_index.py
llm-crisis-questions-test.py		llm-crisis-questions-test.py
models_config.json		models_config.json
models_config_old.json		models_config_old.json
requirements.txt		requirements.txt
serve_viewer.py		serve_viewer.py
test-evaluation.py		test-evaluation.py
test_all_models.py		test_all_models.py
test_model_mapping.py		test_model_mapping.py
test_path_map.py		test_path_map.py
test_q8_regex.py		test_q8_regex.py
test_regex.py		test_regex.py
test_resolve.py		test_resolve.py
test_variant_id.py		test_variant_id.py
test_variants.py		test_variants.py
update_model_sizes.py		update_model_sizes.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CrisisAI: GGUF Model Evaluation Framework# CrisisAI: GGUF Model Evaluation Framework

Documentation## 1. Problem Statement

Quick StartHowever, with a multitude of open-source models available, a key challenge arises: How do we select the most effective and reliable GGUF model for providing safe, practical, and clear advice to a non-expert user under stress?

Documentation## 2. The Solution: A Systematic Evaluation Pipeline

Development- Question Generation: We start with a curated list of plausible, non-expert questions covering various off-grid crisis scenarios (crisis_questions.json).

Code Files## 3. How to Use This Framework

Quick Start- Gemini API Key: Obtain an API key from Google AI Studio.

Step 2: Test Your Local GGUF Models

🚀 RECOMMENDED: Automated Batch Testing

Manual Testing (Alternative)

Load Model in LM Studio

Start Server

Run Test Script

Step 3: Evaluate Results

Automated Batch Evaluation

Manual Evaluation (Alternative)

Set API Key

Run Evaluation Script

Step 4: View the Visual Report

Step 4: View Interactive Results

Alternative: Generate Static HTML Report

4. File Descriptions

Core Scripts

Viewer

Data Files

Documentation

Utility Scripts

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CrisisAI: GGUF Model Evaluation Framework# CrisisAI: GGUF Model Evaluation Framework

Documentation## 1. Problem Statement

Quick StartHowever, with a multitude of open-source models available, a key challenge arises: How do we select the most effective and reliable GGUF model for providing safe, practical, and clear advice to a non-expert user under stress?

Documentation## 2. The Solution: A Systematic Evaluation Pipeline

Development- Question Generation: We start with a curated list of plausible, non-expert questions covering various off-grid crisis scenarios (crisis_questions.json).

Code Files## 3. How to Use This Framework

Quick Start- Gemini API Key: Obtain an API key from Google AI Studio.

Step 2: Test Your Local GGUF Models

🚀 RECOMMENDED: Automated Batch Testing

Manual Testing (Alternative)

Load Model in LM Studio

Start Server

Run Test Script

Step 3: Evaluate Results

Automated Batch Evaluation

Manual Evaluation (Alternative)

Set API Key

Run Evaluation Script

Step 4: View the Visual Report

Step 4: View Interactive Results

Alternative: Generate Static HTML Report

4. File Descriptions

Core Scripts

Viewer

Data Files

Documentation

Utility Scripts

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages