A reverse proxy for llama.cpp that enables automatic KV cache warmup for templated prompts.
llama.cpp server running with KV cache support:
cd /path/to/llama.cpp
./llama-server -m models/model.gguf --port 8081 --slot-save-path ./kv_cacheNote: --slot-save-path is required for KV cache warmup to work.
Option 1: Build from source
git clone https://github.com/okuvshynov/bioproxy
cd bioproxy
go build -o bioproxy ./cmd/bioproxyOption 2: Cross-platform builds
Build binaries for multiple platforms:
./build.shThis creates optimized binaries in the build/ directory:
bioproxy-darwin-arm64- Apple Silicon (M1/M2/M3 Macs)bioproxy-linux-arm64- Linux ARM64 (aarch64)
Each binary includes a SHA256 checksum file (.sha256) for verification.
To add more platforms, edit build.sh and add calls to build_platform with desired GOOS and GOARCH values.
1. Use the example configuration:
The repository includes example configuration and templates in the examples/ directory:
# Copy example config to use as a starting point
cp examples/config.json config.json
# Or use it directly
./bioproxy -config examples/config.jsonThe example includes:
examples/config.json- Full configuration with template mappingsexamples/templates/code_assistant.txt- Coding assistant templateexamples/templates/debug_helper.txt- Debugging helper with file inclusionexamples/templates/debugging_guide.txt- Included debugging reference
2. Run bioproxy:
./bioproxy -config examples/config.jsonYou should see:
π Starting bioproxy - llama.cpp reverse proxy with KV cache warmup
Configuration:
Proxy listening on: http://localhost:8088
Backend llama.cpp: http://localhost:8081
Admin server: http://localhost:8089
Warmup interval: 30s
Templates: 2 configured
INFO: Creating template watcher...
INFO: Added template @code from examples/templates/code_assistant.txt (needs warmup)
INFO: Added template @debug from examples/templates/debug_helper.txt (needs warmup)
INFO: Starting warmup manager...
INFO: Warmup manager background loop started
β
Servers are running!
Immediately after startup, you'll see the warmup process:
INFO: Performing initial warmup check...
INFO: Checking templates for changes...
INFO: Found 2 template(s) that need warmup: [@code @debug]
INFO: Starting warmup for @code
INFO: Sending warmup request for @code
INFO: Warmup request completed for @code (1.2s)
INFO: Template @code warmup complete
Templates are now ready to use! The warmup happens immediately on startup instead of waiting for the first interval.
Use templates by prefixing your message with a configured template prefix:
curl http://localhost:8088/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "@code How do I reverse a string in Python?"}]
}'The @code prefix triggers template substitution. The proxy:
- Detects the
@codeprefix - Processes the template with your message
- Restores the pre-warmed KV cache (if needed)
- Sends the expanded template to llama.cpp
- Streams the response back to you
The pre-warmed KV cache makes the first response much faster!
Run without configuration for basic proxying:
./bioproxyTest the proxy:
# Health check
curl http://localhost:8088/health
# Chat completion
curl http://localhost:8088/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 50
}'
# Check metrics
curl http://localhost:8089/metricsThe admin server (default port 8089) exposes Prometheus metrics:
Key metrics:
bioproxy_requests_total{prefix="@code"}- Total requests per template prefixbioproxy_warmup_total{prefix="@code"}- Completed warmup operationsbioproxy_warmup_cancellations_total{prefix="@code"}- Warmups cancelled by user requestsbioproxy_kv_cache_saves_total{prefix="@code"}- KV cache save operationsbioproxy_kv_cache_restores_total{prefix="@code"}- KV cache restore operations
Example output:
bioproxy_requests_total{prefix="@code"} 42
bioproxy_warmup_total{prefix="@code"} 5
bioproxy_warmup_cancellations_total{prefix="@code"} 2
Request Prioritization:
When a user request arrives while a warmup is in progress, the warmup is automatically cancelled to ensure instant response. The warmup_cancellations_total metric tracks how often this occurs.
## Command-Line Options
```bash
./bioproxy --help
Options:
-config- Path to config file (default:~/.config/bioproxy/config.json)-host- Proxy host (overrides config)-port- Proxy port (overrides config)-admin-host- Admin server host (overrides config)-admin-port- Admin server port (overrides config)-backend- Backend llama.cpp URL (overrides config)
Example:
./bioproxy -config config.json -port 9000See examples/config.json for a complete example.
Required fields:
backend_url- llama.cpp server URL
Optional fields:
proxy_host- Proxy bind address (default: "localhost")proxy_port- Proxy port (default: 8088)admin_host- Admin bind address (default: "localhost")admin_port- Admin port (default: 8089)warmup_check_interval- Template check interval in seconds (default: 30)prefixes- Template prefix mappings (object of prefix β file path)
Templates use <{...}> placeholders. See examples/templates/ for working examples.
Message placeholder:
System prompt here.
User: <{message}>
Assistant:
Example: See examples/templates/code_assistant.txt
File inclusion:
Reference documentation: <{examples/templates/debugging_guide.txt}>
Problem: <{message}>
Example: See examples/templates/debug_helper.txt
When processed, the file content replaces the placeholder:
Reference documentation: Common debugging steps:
1. Reproduce the issue consistently
2. Isolate the problem area
...
Problem: [user's actual message]
Note: Placeholder replacement is non-recursive - patterns in substituted content are NOT processed. This prevents infinite loops and unexpected behavior.
Client β Proxy (8088) β llama.cpp (8081)
β
Metrics
β
Admin Server (8089)
β
Template Watcher
β
Warmup Manager
Components:
- Proxy (port 8088) - Forwards requests to llama.cpp, collects metrics
- Admin (port 8089) - Health status and Prometheus metrics
- Template Watcher - Monitors template files for changes
- Warmup Manager - Automatically warms up changed templates
- β Reverse proxy - Forwards all requests to llama.cpp with minimal overhead
- β Template injection - Automatically injects templates when user messages start with @prefix
- β Smart KV cache - State tracking optimizes saves/restores (95% reduction in disk I/O)
- β Immediate warmup - Templates warm up on startup, no waiting for first interval
- β Request prioritization - User requests automatically cancel warmup operations for instant response
- β Atomic admission control - Race-free state machine ensures correct request coordination
- β Admin endpoints - Health and Prometheus metrics on separate port
- β Template system - File-based templates with message substitution and file inclusion
- β Template monitoring - Detects file changes via hash comparison
- β Automatic warmup - Background process warms templates at configurable intervals
- β Streaming support - Full SSE streaming for chat completions
- β Cross-platform builds - Build script for darwin/arm64 and linux/arm64 binaries
Phase 1: β Basic Proxy - Request forwarding and metrics Phase 2: β Admin Server - Health and metrics endpoints Phase 3: β Template System - File watching and processing Phase 4: β Warmup Manager - Automatic KV cache warmup Phase 5: β Template Injection - Intercept @prefix in user messages Phase 6: β Smart KV Cache - State tracking to optimize save/restore operations
Future Enhancements:
- GitHub Actions automated releases (auto-build on tags)
- Additional platform support (linux/amd64, darwin/amd64, windows/amd64)
- Multi-backend load balancing
Unit tests (fast, no llama.cpp needed):
go test ./...Manual tests (requires llama.cpp with --slot-save-path):
# See docs/MANUAL_TESTING.md for complete guide
go clean -testcache && go test -tags=manual -v ./...bioproxy/
βββ cmd/bioproxy/ - Main executable
βββ internal/
β βββ config/ - Configuration management
β βββ proxy/ - Reverse proxy with template injection
β βββ admin/ - Admin server (health, metrics)
β βββ template/ - Template watching and processing
β βββ warmup/ - KV cache warmup manager
β βββ state/ - Backend state tracking for KV cache optimization
β βββ admission/ - Atomic admission control for request coordination
βββ examples/ - Example configuration and templates
β βββ config.json - Example configuration file
β βββ templates/ - Example template files
βββ docs/ - Documentation
βββ WARMUP_DESIGN.md - Warmup architecture design
βββ MANUAL_TESTING.md - Manual testing guide
- docs/WARMUP_DESIGN.md - Complete warmup architecture and design decisions
- docs/MANUAL_TESTING.md - Guide for running manual tests with llama.cpp
- internal/*/README.md - Module-specific documentation
See LICENSE file.