Skip to content

okuvshynov/bioproxy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

41 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

bioproxy

A reverse proxy for llama.cpp that enables automatic KV cache warmup for templated prompts.

Quick Start

Prerequisites

llama.cpp server running with KV cache support:

cd /path/to/llama.cpp
./llama-server -m models/model.gguf --port 8081 --slot-save-path ./kv_cache

Note: --slot-save-path is required for KV cache warmup to work.

Installation

Option 1: Build from source

git clone https://github.com/okuvshynov/bioproxy
cd bioproxy
go build -o bioproxy ./cmd/bioproxy

Option 2: Cross-platform builds

Build binaries for multiple platforms:

./build.sh

This creates optimized binaries in the build/ directory:

  • bioproxy-darwin-arm64 - Apple Silicon (M1/M2/M3 Macs)
  • bioproxy-linux-arm64 - Linux ARM64 (aarch64)

Each binary includes a SHA256 checksum file (.sha256) for verification.

To add more platforms, edit build.sh and add calls to build_platform with desired GOOS and GOARCH values.

Setup with Templates

1. Use the example configuration:

The repository includes example configuration and templates in the examples/ directory:

# Copy example config to use as a starting point
cp examples/config.json config.json

# Or use it directly
./bioproxy -config examples/config.json

The example includes:

  • examples/config.json - Full configuration with template mappings
  • examples/templates/code_assistant.txt - Coding assistant template
  • examples/templates/debug_helper.txt - Debugging helper with file inclusion
  • examples/templates/debugging_guide.txt - Included debugging reference

2. Run bioproxy:

./bioproxy -config examples/config.json

You should see:

πŸš€ Starting bioproxy - llama.cpp reverse proxy with KV cache warmup

Configuration:
  Proxy listening on: http://localhost:8088
  Backend llama.cpp:  http://localhost:8081
  Admin server:       http://localhost:8089
  Warmup interval:    30s
  Templates:          2 configured

INFO: Creating template watcher...
INFO: Added template @code from examples/templates/code_assistant.txt (needs warmup)
INFO: Added template @debug from examples/templates/debug_helper.txt (needs warmup)
INFO: Starting warmup manager...
INFO: Warmup manager background loop started

βœ… Servers are running!

Immediately after startup, you'll see the warmup process:

INFO: Performing initial warmup check...
INFO: Checking templates for changes...
INFO: Found 2 template(s) that need warmup: [@code @debug]
INFO: Starting warmup for @code
INFO: Sending warmup request for @code
INFO: Warmup request completed for @code (1.2s)
INFO: Template @code warmup complete

Templates are now ready to use! The warmup happens immediately on startup instead of waiting for the first interval.

Using Templates

Use templates by prefixing your message with a configured template prefix:

curl http://localhost:8088/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "@code How do I reverse a string in Python?"}]
  }'

The @code prefix triggers template substitution. The proxy:

  1. Detects the @code prefix
  2. Processes the template with your message
  3. Restores the pre-warmed KV cache (if needed)
  4. Sends the expanded template to llama.cpp
  5. Streams the response back to you

The pre-warmed KV cache makes the first response much faster!

Basic Usage (Without Templates)

Run without configuration for basic proxying:

./bioproxy

Test the proxy:

# Health check
curl http://localhost:8088/health

# Chat completion
curl http://localhost:8088/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 50
  }'

# Check metrics
curl http://localhost:8089/metrics

Monitoring and Metrics

The admin server (default port 8089) exposes Prometheus metrics:

Key metrics:

  • bioproxy_requests_total{prefix="@code"} - Total requests per template prefix
  • bioproxy_warmup_total{prefix="@code"} - Completed warmup operations
  • bioproxy_warmup_cancellations_total{prefix="@code"} - Warmups cancelled by user requests
  • bioproxy_kv_cache_saves_total{prefix="@code"} - KV cache save operations
  • bioproxy_kv_cache_restores_total{prefix="@code"} - KV cache restore operations

Example output:

bioproxy_requests_total{prefix="@code"} 42
bioproxy_warmup_total{prefix="@code"} 5
bioproxy_warmup_cancellations_total{prefix="@code"} 2

Request Prioritization: When a user request arrives while a warmup is in progress, the warmup is automatically cancelled to ensure instant response. The warmup_cancellations_total metric tracks how often this occurs.


## Command-Line Options

```bash
./bioproxy --help

Options:

  • -config - Path to config file (default: ~/.config/bioproxy/config.json)
  • -host - Proxy host (overrides config)
  • -port - Proxy port (overrides config)
  • -admin-host - Admin server host (overrides config)
  • -admin-port - Admin server port (overrides config)
  • -backend - Backend llama.cpp URL (overrides config)

Example:

./bioproxy -config config.json -port 9000

Configuration Reference

See examples/config.json for a complete example.

Required fields:

  • backend_url - llama.cpp server URL

Optional fields:

  • proxy_host - Proxy bind address (default: "localhost")
  • proxy_port - Proxy port (default: 8088)
  • admin_host - Admin bind address (default: "localhost")
  • admin_port - Admin port (default: 8089)
  • warmup_check_interval - Template check interval in seconds (default: 30)
  • prefixes - Template prefix mappings (object of prefix β†’ file path)

Template Syntax

Templates use <{...}> placeholders. See examples/templates/ for working examples.

Message placeholder:

System prompt here.

User: <{message}>
Assistant:

Example: See examples/templates/code_assistant.txt

File inclusion:

Reference documentation: <{examples/templates/debugging_guide.txt}>

Problem: <{message}>

Example: See examples/templates/debug_helper.txt

When processed, the file content replaces the placeholder:

Reference documentation: Common debugging steps:
1. Reproduce the issue consistently
2. Isolate the problem area
...

Problem: [user's actual message]

Note: Placeholder replacement is non-recursive - patterns in substituted content are NOT processed. This prevents infinite loops and unexpected behavior.

Architecture

Client β†’ Proxy (8088) β†’ llama.cpp (8081)
            ↓
        Metrics
            ↓
    Admin Server (8089)
            ↓
    Template Watcher
            ↓
    Warmup Manager

Components:

  • Proxy (port 8088) - Forwards requests to llama.cpp, collects metrics
  • Admin (port 8089) - Health status and Prometheus metrics
  • Template Watcher - Monitors template files for changes
  • Warmup Manager - Automatically warms up changed templates

Current Features

  • βœ… Reverse proxy - Forwards all requests to llama.cpp with minimal overhead
  • βœ… Template injection - Automatically injects templates when user messages start with @prefix
  • βœ… Smart KV cache - State tracking optimizes saves/restores (95% reduction in disk I/O)
  • βœ… Immediate warmup - Templates warm up on startup, no waiting for first interval
  • βœ… Request prioritization - User requests automatically cancel warmup operations for instant response
  • βœ… Atomic admission control - Race-free state machine ensures correct request coordination
  • βœ… Admin endpoints - Health and Prometheus metrics on separate port
  • βœ… Template system - File-based templates with message substitution and file inclusion
  • βœ… Template monitoring - Detects file changes via hash comparison
  • βœ… Automatic warmup - Background process warms templates at configurable intervals
  • βœ… Streaming support - Full SSE streaming for chat completions
  • βœ… Cross-platform builds - Build script for darwin/arm64 and linux/arm64 binaries

Roadmap

Phase 1: βœ… Basic Proxy - Request forwarding and metrics Phase 2: βœ… Admin Server - Health and metrics endpoints Phase 3: βœ… Template System - File watching and processing Phase 4: βœ… Warmup Manager - Automatic KV cache warmup Phase 5: βœ… Template Injection - Intercept @prefix in user messages Phase 6: βœ… Smart KV Cache - State tracking to optimize save/restore operations

Future Enhancements:

  • GitHub Actions automated releases (auto-build on tags)
  • Additional platform support (linux/amd64, darwin/amd64, windows/amd64)
  • Multi-backend load balancing

Development

Running Tests

Unit tests (fast, no llama.cpp needed):

go test ./...

Manual tests (requires llama.cpp with --slot-save-path):

# See docs/MANUAL_TESTING.md for complete guide
go clean -testcache && go test -tags=manual -v ./...

Project Structure

bioproxy/
β”œβ”€β”€ cmd/bioproxy/          - Main executable
β”œβ”€β”€ internal/
β”‚   β”œβ”€β”€ config/           - Configuration management
β”‚   β”œβ”€β”€ proxy/            - Reverse proxy with template injection
β”‚   β”œβ”€β”€ admin/            - Admin server (health, metrics)
β”‚   β”œβ”€β”€ template/         - Template watching and processing
β”‚   β”œβ”€β”€ warmup/           - KV cache warmup manager
β”‚   β”œβ”€β”€ state/            - Backend state tracking for KV cache optimization
β”‚   └── admission/        - Atomic admission control for request coordination
β”œβ”€β”€ examples/             - Example configuration and templates
β”‚   β”œβ”€β”€ config.json       - Example configuration file
β”‚   └── templates/        - Example template files
└── docs/                 - Documentation
    β”œβ”€β”€ WARMUP_DESIGN.md  - Warmup architecture design
    └── MANUAL_TESTING.md - Manual testing guide

Documentation

License

See LICENSE file.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •