bioproxy

A reverse proxy for llama.cpp that enables automatic KV cache warmup for templated prompts.

Quick Start

Prerequisites

llama.cpp server running with KV cache support:

cd /path/to/llama.cpp
./llama-server -m models/model.gguf --port 8081 --slot-save-path ./kv_cache

Note: --slot-save-path is required for KV cache warmup to work.

Installation

Option 1: Build from source

git clone https://github.com/okuvshynov/bioproxy
cd bioproxy
go build -o bioproxy ./cmd/bioproxy

Option 2: Cross-platform builds

Build binaries for multiple platforms:

./build.sh

This creates optimized binaries in the build/ directory:

bioproxy-darwin-arm64 - Apple Silicon (M1/M2/M3 Macs)
bioproxy-linux-arm64 - Linux ARM64 (aarch64)

Each binary includes a SHA256 checksum file (.sha256) for verification.

To add more platforms, edit build.sh and add calls to build_platform with desired GOOS and GOARCH values.

Setup with Templates

1. Use the example configuration:

The repository includes example configuration and templates in the examples/ directory:

# Copy example config to use as a starting point
cp examples/config.json config.json

# Or use it directly
./bioproxy -config examples/config.json

The example includes:

examples/config.json - Full configuration with template mappings
examples/templates/code_assistant.txt - Coding assistant template
examples/templates/debug_helper.txt - Debugging helper with file inclusion
examples/templates/debugging_guide.txt - Included debugging reference

2. Run bioproxy:

./bioproxy -config examples/config.json

You should see:

🚀 Starting bioproxy - llama.cpp reverse proxy with KV cache warmup

Configuration:
  Proxy listening on: http://localhost:8088
  Backend llama.cpp:  http://localhost:8081
  Admin server:       http://localhost:8089
  Warmup interval:    30s
  Templates:          2 configured

INFO: Creating template watcher...
INFO: Added template @code from examples/templates/code_assistant.txt (needs warmup)
INFO: Added template @debug from examples/templates/debug_helper.txt (needs warmup)
INFO: Starting warmup manager...
INFO: Warmup manager background loop started

✅ Servers are running!

Immediately after startup, you'll see the warmup process:

INFO: Performing initial warmup check...
INFO: Checking templates for changes...
INFO: Found 2 template(s) that need warmup: [@code @debug]
INFO: Starting warmup for @code
INFO: Sending warmup request for @code
INFO: Warmup request completed for @code (1.2s)
INFO: Template @code warmup complete

Templates are now ready to use! The warmup happens immediately on startup instead of waiting for the first interval.

Using Templates

Use templates by prefixing your message with a configured template prefix:

curl http://localhost:8088/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "@code How do I reverse a string in Python?"}]
  }'

The @code prefix triggers template substitution. The proxy:

Detects the @code prefix
Processes the template with your message
Restores the pre-warmed KV cache (if needed)
Sends the expanded template to llama.cpp
Streams the response back to you

The pre-warmed KV cache makes the first response much faster!

Basic Usage (Without Templates)

Run without configuration for basic proxying:

./bioproxy

Test the proxy:

# Health check
curl http://localhost:8088/health

# Chat completion
curl http://localhost:8088/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 50
  }'

# Check metrics
curl http://localhost:8089/metrics

Monitoring and Metrics

The admin server (default port 8089) exposes Prometheus metrics:

Key metrics:

bioproxy_requests_total{prefix="@code"} - Total requests per template prefix
bioproxy_warmup_total{prefix="@code"} - Completed warmup operations
bioproxy_warmup_cancellations_total{prefix="@code"} - Warmups cancelled by user requests
bioproxy_kv_cache_saves_total{prefix="@code"} - KV cache save operations
bioproxy_kv_cache_restores_total{prefix="@code"} - KV cache restore operations

Example output:

bioproxy_requests_total{prefix="@code"} 42
bioproxy_warmup_total{prefix="@code"} 5
bioproxy_warmup_cancellations_total{prefix="@code"} 2

Request Prioritization: When a user request arrives while a warmup is in progress, the warmup is automatically cancelled to ensure instant response. The warmup_cancellations_total metric tracks how often this occurs.


## Command-Line Options

```bash
./bioproxy --help

Options:

-config - Path to config file (default: ~/.config/bioproxy/config.json)
-host - Proxy host (overrides config)
-port - Proxy port (overrides config)
-admin-host - Admin server host (overrides config)
-admin-port - Admin server port (overrides config)
-backend - Backend llama.cpp URL (overrides config)

Example:

./bioproxy -config config.json -port 9000

Configuration Reference

See examples/config.json for a complete example.

Required fields:

backend_url - llama.cpp server URL

Optional fields:

proxy_host - Proxy bind address (default: "localhost")
proxy_port - Proxy port (default: 8088)
admin_host - Admin bind address (default: "localhost")
admin_port - Admin port (default: 8089)
warmup_check_interval - Template check interval in seconds (default: 30)
prefixes - Template prefix mappings (object of prefix → file path)

Template Syntax

Templates use <{...}> placeholders. See examples/templates/ for working examples.

Message placeholder:

System prompt here.

User: <{message}>
Assistant:

Example: See examples/templates/code_assistant.txt

File inclusion:

Reference documentation: <{examples/templates/debugging_guide.txt}>

Problem: <{message}>

Example: See examples/templates/debug_helper.txt

When processed, the file content replaces the placeholder:

Reference documentation: Common debugging steps:
1. Reproduce the issue consistently
2. Isolate the problem area
...

Problem: [user's actual message]

Note: Placeholder replacement is non-recursive - patterns in substituted content are NOT processed. This prevents infinite loops and unexpected behavior.

Architecture

Client → Proxy (8088) → llama.cpp (8081)
            ↓
        Metrics
            ↓
    Admin Server (8089)
            ↓
    Template Watcher
            ↓
    Warmup Manager

Components:

Proxy (port 8088) - Forwards requests to llama.cpp, collects metrics
Admin (port 8089) - Health status and Prometheus metrics
Template Watcher - Monitors template files for changes
Warmup Manager - Automatically warms up changed templates

Current Features

✅ Reverse proxy - Forwards all requests to llama.cpp with minimal overhead
✅ Template injection - Automatically injects templates when user messages start with @prefix
✅ Smart KV cache - State tracking optimizes saves/restores (95% reduction in disk I/O)
✅ Immediate warmup - Templates warm up on startup, no waiting for first interval
✅ Request prioritization - User requests automatically cancel warmup operations for instant response
✅ Atomic admission control - Race-free state machine ensures correct request coordination
✅ Admin endpoints - Health and Prometheus metrics on separate port
✅ Template system - File-based templates with message substitution and file inclusion
✅ Template monitoring - Detects file changes via hash comparison
✅ Automatic warmup - Background process warms templates at configurable intervals
✅ Streaming support - Full SSE streaming for chat completions
✅ Cross-platform builds - Build script for darwin/arm64 and linux/arm64 binaries

Roadmap

Phase 1: ✅ Basic Proxy - Request forwarding and metrics Phase 2: ✅ Admin Server - Health and metrics endpoints Phase 3: ✅ Template System - File watching and processing Phase 4: ✅ Warmup Manager - Automatic KV cache warmup Phase 5: ✅ Template Injection - Intercept @prefix in user messages Phase 6: ✅ Smart KV Cache - State tracking to optimize save/restore operations

Future Enhancements:

GitHub Actions automated releases (auto-build on tags)
Additional platform support (linux/amd64, darwin/amd64, windows/amd64)
Multi-backend load balancing

Development

Running Tests

Unit tests (fast, no llama.cpp needed):

go test ./...

Manual tests (requires llama.cpp with --slot-save-path):

# See docs/MANUAL_TESTING.md for complete guide
go clean -testcache && go test -tags=manual -v ./...

Project Structure

bioproxy/
├── cmd/bioproxy/          - Main executable
├── internal/
│   ├── config/           - Configuration management
│   ├── proxy/            - Reverse proxy with template injection
│   ├── admin/            - Admin server (health, metrics)
│   ├── template/         - Template watching and processing
│   ├── warmup/           - KV cache warmup manager
│   ├── state/            - Backend state tracking for KV cache optimization
│   └── admission/        - Atomic admission control for request coordination
├── examples/             - Example configuration and templates
│   ├── config.json       - Example configuration file
│   └── templates/        - Example template files
└── docs/                 - Documentation
    ├── WARMUP_DESIGN.md  - Warmup architecture design
    └── MANUAL_TESTING.md - Manual testing guide

Documentation

docs/WARMUP_DESIGN.md - Complete warmup architecture and design decisions
docs/MANUAL_TESTING.md - Guide for running manual tests with llama.cpp
internal/*/README.md - Module-specific documentation

License

See LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.github/workflows		.github/workflows
cmd/bioproxy		cmd/bioproxy
deploy/linux-arm64		deploy/linux-arm64
docs		docs
examples		examples
internal		internal
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.sh		build.sh
config.example.json		config.example.json
go.mod		go.mod

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

bioproxy

Quick Start

Prerequisites

Installation

Setup with Templates

Using Templates

Basic Usage (Without Templates)

Monitoring and Metrics

Configuration Reference

Template Syntax

Architecture

Current Features

Roadmap

Development

Running Tests

Project Structure

Documentation

License

About

Uh oh!

Releases 1

Packages

Contributors 2

Uh oh!

Languages

License

okuvshynov/bioproxy

Folders and files

Latest commit

History

Repository files navigation

bioproxy

Quick Start

Prerequisites

Installation

Setup with Templates

Using Templates

Basic Usage (Without Templates)

Monitoring and Metrics

Configuration Reference

Template Syntax

Architecture

Current Features

Roadmap

Development

Running Tests

Project Structure

Documentation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Uh oh!

Languages

Packages