Test results are now organized in timestamped batch folders for better management and tracking.
test_results/
├── 2025-10-09_1/ ← First batch run on Oct 9
│ ├── smollm2-360m-instruct_2025-10-09_22-30-00.json
│ ├── smollm2-360m-instruct_2025-10-09_22-30-00_runinfo.json
│ ├── phi-3.5-mini-instruct_2025-10-09_22-35-15.json
│ └── phi-3.5-mini-instruct_2025-10-09_22-35-15_runinfo.json
├── 2025-10-09_2/ ← Second batch run on Oct 9
│ ├── qwen2.5-1.5b-instruct_2025-10-09_23-15-00.json
│ └── ...
└── 2025-10-10_1/ ← First batch run on Oct 10
└── ...
Format: YYYY-MM-DD_N where N is incremental for same-day runs
python batch_test_models.pyOutput:
🚀 Starting Batch Test Run - 3 model(s)
Started at: 2025-10-09 22:30:00
Results folder: 2025-10-09_1 ← Auto-created batch folder
[1/3] Testing: smollm2-360m-instruct
→ Unloading previous model...
→ Loading model...
✓ Model loaded
→ Running crisis questions test...
✓ Completed in 180 seconds
...
✨ All done!
Results saved to: test_results/2025-10-09_1/
python eval_batch.py --listOutput:
📊 Available Batch Folders:
2025-10-10_1 (5 models)
2025-10-09_2 (3 models)
2025-10-09_1 (8 models)
python eval_batch.pyOutput:
Using latest batch: 2025-10-10_1
======================================================================
Evaluating Batch: 2025-10-10_1
Location: test_results\2025-10-10_1
Models: 5
======================================================================
Using batch folder from environment: test_results\2025-10-10_1
Found 5 model result files
...
python eval_batch.py --batch 2025-10-09_1Output:
======================================================================
Evaluating Batch: 2025-10-09_1
Location: test_results\2025-10-09_1
Models: 8
======================================================================
...
New Function:
def create_batch_folder() -> str:
"""
Create a new batch folder in test_results with format YYYY-MM-DD_N
where N is incremental for same-day runs.
"""
today = datetime.now().strftime("%Y-%m-%d")
# Find existing folders for today
existing = [...]
# Get next number
next_num = max(existing) + 1 if existing else 1
batch_folder_name = f"{today}_{next_num}"
...Usage:
- Creates batch folder at start of run
- Passes folder path to
llm-crisis-questions-test.py - All results saved to that batch folder
Updated Signature:
def main(model_name: str | None = None, results_dir: str | None = None):
"""
Args:
model_name: Name to use for the model in output files
results_dir: Directory to save results in (defaults to RESULTS_DIR)
"""
output_dir = results_dir if results_dir else RESULTS_DIR
...Change:
- Accepts optional
results_dirparameter - Saves to specified directory instead of default
New Support:
BATCH_FOLDER = os.getenv('BATCH_FOLDER')
if BATCH_FOLDER:
INPUT_FILE_PATTERN = os.path.join(BATCH_FOLDER, '*.json')
print(f"Using batch folder from environment: {BATCH_FOLDER}")
else:
INPUT_FILE_PATTERN = os.path.join('test_results', '*.json')Changes:
- Checks for
BATCH_FOLDERenvironment variable - Filters out
_runinfo.jsonfiles automatically - Can be controlled via
eval_batch.pywrapper
Purpose: Helper script to easily evaluate batch folders
Features:
- List all available batches
- Auto-detect latest batch
- Evaluate specific batch
- Sets environment variable for
test-evaluation.py
- Each batch test run is isolated
- Easy to identify when tests were run
- Multiple runs per day supported
- Compare different batch runs easily
- Keep historical results
- Track model improvements over time
- No mixed results in one folder
- Easy to delete old batches
- Clear naming convention
- Scripts auto-create folders
- Automatic incrementing
- No manual folder creation needed
# Select and test models
python batch_test_models.py
# Select: smollm2-360m, phi-3.5-mini, qwen2.5-1.5b
# Results saved to: test_results/2025-10-09_1/
# Evaluate the batch
python eval_batch.py
# Uses latest: 2025-10-09_1# Morning batch - small models
python batch_test_models.py
# Select: smollm2-360m, gemma-2-2b
# → Saved to: test_results/2025-10-09_1/
# Afternoon batch - larger models
python batch_test_models.py
# Select: phi-4-reasoning-plus, qwen3-4b
# → Saved to: test_results/2025-10-09_2/
# List all batches
python eval_batch.py --list
# Shows both: 2025-10-09_1 and 2025-10-09_2
# Evaluate specific batch
python eval_batch.py --batch 2025-10-09_1# Evaluate yesterday's batch
python eval_batch.py --batch 2025-10-08_1
# Evaluate today's batch
python eval_batch.py --batch 2025-10-09_1
# Compare the generated reportsOld flat structure still works!
If you have old results in test_results/*.json (not in batch folders):
test-evaluation.pystill finds them when run directlyeval_batch.pyonly looks at batch folders- You can move old files into a batch folder manually:
# Create a batch folder for old results
mkdir test_results/2025-10-08_1
# Move old files
move test_results/*.json test_results/2025-10-08_1/| Task | Command |
|---|---|
| Run batch tests | python batch_test_models.py |
| List batches | python eval_batch.py --list |
| Eval latest | python eval_batch.py |
| Eval specific | python eval_batch.py --batch 2025-10-09_1 |
| Check test status | python check_test_status.py |
Solution: Run python batch_test_models.py first to create batch results.
Solution: Check folder name with python eval_batch.py --list
Solution: Run python test-evaluation.py directly (without eval_batch.py)
✅ Organized - Each batch run in its own folder
✅ Automatic - Folders created automatically with incrementing numbers
✅ Easy Evaluation - Simple commands to evaluate any batch
✅ Historical - Keep and compare multiple test runs
✅ Backward Compatible - Old results still work
Your test results are now much better organized! 🎯