This is the companion code for Evals 101 for web developers. This repo includes an example evals system that evaluates AI-generated outputs, including rule-based and LLM-as-a-judge evals.
This repo includes:
- A simple web application, ThemeBuilder, that generates a brand identity (motto, color palette, typography) based on a company description and target audience.
- A rule-based evaluator for the application's outputs.
- An LLM-as-a-judge evaluator for the application's outputs.
- Tests for the evaluators themselves.
- Tests for the application's outputs, based on the evaluator.
graph TD
UI[User Prompt constraints] --> TB[ThemeBuilder service]
TB --> Output["App output: motto, color palette"]
Output --> RB[Rule-based evals: data format, contrast]
Output --> LLM[LLM judge evals: brand fit, toxicity]
RB --> R[eval result PASS/FAIL]
LLM --> R
- Create a Gemini API key.
- Create an
.envfile in theevals-servicedirectory with yourGEMINI_API_KEY. - Install dependencies:
npm install(run from the root ofevals-course)
Running evaluations from the evals-course root:
These tests check the evaluator functions themselves. You'd typically run these tests while developing the evaluators to assess the correctness of the criteria and LLM scoring logic.
Run tests for rule-based evaluators:
npm run test:rule-based-evalsRun basic tests for LLM judge evaluators (alignment% only):
All tests:
npm run test:llm-judge-evals-basicnpm run test:llm-judge-evals-basic-no-bootstrapnpm run test:llm-judge-evals-basic-bootstrapnpm run test:llm-judge-evals-basic-consistencynpm run test:llm-judge-evals-basic-final-examRun advanced tests for LLM judge evaluators (alignment%, Cohen's Kappa, precision, recall):
All tests:
npm run test:llm-judge-evalsThese tests evaluate the application outputs using the evaluators.
They executes the real ThemeBuilder application service against a dataset of prompts, using both static rules and our evaluators (rule-based and LLM judge) to grade ThemeBuilder's AI-generated outputs.
Run unit testing for the AI application:
npm run test:unit-evalsYou can append -fast to any script to run in fast mode (e.g., npm run test:unit-evals-fast or npm run test:all-fast).
Fast mode caps evaluation scenarios to a small number of samples per suite. Recommended for rapid local iteration and debugging to avoid long wait times.
Every time you run npm run test:unit-evals (or npm run test:unit-evals-fast), the evaluation suite generates a detailed HTML report and updates the multi-run dashboard index at evals-service/reports.
-
Automatic Startup: The test runner automatically attempts to serve the dashboard upon completion on port 8085:
🌐 Live Dashboard served at: http://localhost:8085 -
Manual Startup: If you want to start the dashboard server manually without running the tests again, run the following command from the
evals-servicedirectory:npx http-server reports -p 8085
Then open your browser at http://localhost:8085.
Run the service: npm start (or npm run dev for development)
- Note: The service runs on port 8080 by default.
Once the service is running, you can evaluate data by sending a POST request to /api/evaluate.
Here is an example using curl:
curl -X POST http://localhost:8080/api/evaluate \
-H "Content-Type: application/json" \
-d '{
"data": [
{
"id": "brand-003",
"userInput": {
"companyName": "Loom",
"description": "A boutique textile mill specializing in traditional indigo-dyeing and hand-loomed linens.",
"audience": "interior designers and slow-fashion advocates",
"tone": ["tactile", "minimalist", "earthy"]
},
"appOutput": {
"motto": "Woven by hand and time.",
"colorPalette": {
"textColor": "#262626",
"backgroundColor": "#F5F5F4",
"primary": "#312E81",
"secondary": "#A8A29E"
}
}
}
]
}'This will return an evaluation result containing the format validation label and several LLM-as-a-judge checks.
For example:
{
"results": [
{
"id": "brand-003",
"dataFormat": {
"label": "PASS",
"rationale": "Format is valid."
},
"mottoBrandFit": {
"label": "PASS",
"rationale": "The motto aligns perfectly with the brand's commitment to slow craftsmanship and tradition. 'Woven by hand' emphasizes the tactile and artisanal nature of the product, while 'time' appeals to the slow-fashion ethos. The brevity of the phrase maintains a minimalist and sophisticated tone suitable for the target audience."
}
}
],
"modelVersion": "gemini-3-flash-preview"
}