Name	Name	Last commit message	Last commit date
parent directory ..
evals-service	evals-service
theme-builder	theme-builder
.gitignore	.gitignore
README.md	README.md
package-lock.json	package-lock.json
package.json	package.json

Evals 101 for web developers

This is the companion code for Evals 101 for web developers. This repo includes an example evals system that evaluates AI-generated outputs, including rule-based and LLM-as-a-judge evals.

Overview

This repo includes:

A simple web application, ThemeBuilder, that generates a brand identity (motto, color palette, typography) based on a company description and target audience.
A rule-based evaluator for the application's outputs.
An LLM-as-a-judge evaluator for the application's outputs.
Tests for the evaluators themselves.
Tests for the application's outputs, based on the evaluator.

graph TD
    UI[User Prompt constraints] --> TB[ThemeBuilder service]
    TB --> Output["App output: motto, color palette"]
    Output --> RB[Rule-based evals: data format, contrast]
    Output --> LLM[LLM judge evals: brand fit, toxicity]
    RB --> R[eval result PASS/FAIL]
    LLM --> R

Set up

Create a Gemini API key.
Create an .env file in the evals-service directory with your GEMINI_API_KEY.
Install dependencies: npm install (run from the root of evals-course)

Testing and running evals

Running evaluations from the evals-course root:

Testing the evaluators

These tests check the evaluator functions themselves. You'd typically run these tests while developing the evaluators to assess the correctness of the criteria and LLM scoring logic.

Rule-based

Run tests for rule-based evaluators:

npm run test:rule-based-evals

Basic LLM judge

Run basic tests for LLM judge evaluators (alignment% only):

All tests:

npm run test:llm-judge-evals-basic

Basic judge no bootstrap

npm run test:llm-judge-evals-basic-no-bootstrap

Basic judge with bootstrap

npm run test:llm-judge-evals-basic-bootstrap

Basic judge self-consistency

npm run test:llm-judge-evals-basic-consistency

Basic judge final exam

npm run test:llm-judge-evals-basic-final-exam

Advanced LLM judge

Run advanced tests for LLM judge evaluators (alignment%, Cohen's Kappa, precision, recall):

All tests:

npm run test:llm-judge-evals

Evaluating application outputs

These tests evaluate the application outputs using the evaluators. They executes the real ThemeBuilder application service against a dataset of prompts, using both static rules and our evaluators (rule-based and LLM judge) to grade ThemeBuilder's AI-generated outputs.

Run unit testing for the AI application:

npm run test:unit-evals

Fast mode

You can append -fast to any script to run in fast mode (e.g., npm run test:unit-evals-fast or npm run test:all-fast).

Fast mode caps evaluation scenarios to a small number of samples per suite. Recommended for rapid local iteration and debugging to avoid long wait times.

Evals Dashboard UI & Reports

Every time you run npm run test:unit-evals (or npm run test:unit-evals-fast), the evaluation suite generates a detailed HTML report and updates the multi-run dashboard index at evals-service/reports.

Viewing the Dashboard

Automatic Startup: The test runner automatically attempts to serve the dashboard upon completion on port 8085:
```
🌐 Live Dashboard served at: http://localhost:8085
```
Manual Startup: If you want to start the dashboard server manually without running the tests again, run the following command from the evals-service directory:
```
npx http-server reports -p 8085
```
Then open your browser at http://localhost:8085.

Running the eval service

Run the service: npm start (or npm run dev for development)

Note: The service runs on port 8080 by default.

Once the service is running, you can evaluate data by sending a POST request to /api/evaluate.

Here is an example using curl:

curl -X POST http://localhost:8080/api/evaluate \
  -H "Content-Type: application/json" \
  -d '{
    "data": [
      {
        "id": "brand-003",
        "userInput": {
          "companyName": "Loom",
          "description": "A boutique textile mill specializing in traditional indigo-dyeing and hand-loomed linens.",
          "audience": "interior designers and slow-fashion advocates",
          "tone": ["tactile", "minimalist", "earthy"]
        },
        "appOutput": {
          "motto": "Woven by hand and time.",
          "colorPalette": {
            "textColor": "#262626",
            "backgroundColor": "#F5F5F4",
            "primary": "#312E81",
            "secondary": "#A8A29E"
          }
        }
      }
    ]
  }'

This will return an evaluation result containing the format validation label and several LLM-as-a-judge checks.

For example:

{
  "results": [
    {
      "id": "brand-003",
      "dataFormat": {
        "label": "PASS",
        "rationale": "Format is valid."
      },
      "mottoBrandFit": {
        "label": "PASS",
        "rationale": "The motto aligns perfectly with the brand's commitment to slow craftsmanship and tradition. 'Woven by hand' emphasizes the tactile and artisanal nature of the product, while 'time' appeals to the slow-fashion ethos. The brevity of the phrase maintains a minimalist and sophisticated tone suitable for the target audience."
      }
    }
  ],
  "modelVersion": "gemini-3-flash-preview"
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Evals 101 for web developers

Overview

Set up

Testing and running evals

Testing the evaluators

Rule-based

Basic LLM judge

Basic judge no bootstrap

Basic judge with bootstrap

Basic judge self-consistency

Basic judge final exam

Advanced LLM judge

Evaluating application outputs

Fast mode

Evals Dashboard UI & Reports

Viewing the Dashboard

Running the eval service

FilesExpand file tree

evals-course

Directory actions

More options

Directory actions

More options

Latest commit

History

evals-course

Folders and files

parent directory

README.md

Evals 101 for web developers

Overview

Set up

Testing and running evals

Testing the evaluators

Rule-based

Basic LLM judge

Basic judge no bootstrap

Basic judge with bootstrap

Basic judge self-consistency

Basic judge final exam

Advanced LLM judge

Evaluating application outputs

Fast mode

Evals Dashboard UI & Reports

Viewing the Dashboard

Running the eval service