QuentinFuxa/WhisperLiveKit: Real-time & local speech-to-text, translation, and speaker diarization. With server & web UI.
02/09/25, 09.30
QuentinFuxa WhisperLiveKit Type / to search
Code Issues 56 Pull requests 9 Discussions Actions Projects Security Insights
WhisperLiveKit Public
5 Branches 15 Tags Go to file t Go to file Add file About
Code
Real-time & local speech-to-text,
QuentinFuxa update architecture 50b0527 · 7 hours ago 488 Commits
translation, and speaker diarization. With
server & web UI.
whisperlivekit mlx/fasterWhisper encoders… 16 hours ago
Readme
.gitignore update LICENSE with Simul… 2 months ago
View license
CONTRIBUTING.md Update CONTRIBUTING.md 3 weeks ago Contributing
Activity
DEV_NOTES.md mlx/fasterWhisper encoders… 16 hours ago
5.6k stars
Dockerfile explanations about model p… 3 days ago 39 watching
484 forks
Dockerfile.cpu new dockerfile for cpu only. … 2 weeks ago
Report repository
LICENSE update LICENSE with Simul… 2 months ago
Releases 14
README.md get_web_interface_html to … 3 days ago
0.2.7 Latest
ReadmeJP.md Translate README.md to Ja… 3 days ago 5 days ago
architecture.png update architecture 7 hours ago + 13 releases
available_models.md indications on how to choos… last year
Packages
demo.png add microphone picker 2 days ago
No packages published
pyproject.toml remove triton <3 condition 5 days ago
Contributors 25
README Contributing License
WhisperLiveKit + 11 contributors
Languages
Python 91.8% JavaScript 4.8%
CSS 2.1% Other 1.3%
https://github.com/QuentinFuxa/WhisperLiveKit Page 1 of 6
QuentinFuxa/WhisperLiveKit: Real-time & local speech-to-text, translation, and speaker diarization. With server & web UI. 02/09/25, 09.30
Real-time, Fully Local Speech-to-Text with Speaker Identification
pypi v0.2.7 installations 18k python 3.9-3.13 License MIT/Dual Licensed
Real-time speech transcription directly to your browser, with a ready-to-use backend+server and a simple frontend.
Powered by Leading Research:
SimulStreaming (SOTA 2025) - Ultra-low latency transcription with AlignAtt policy
WhisperStreaming (SOTA 2023) - Low latency transcription with LocalAgreement policy
Streaming Sortformer (SOTA 2025) - Advanced real-time speaker diarization
Diart (SOTA 2021) - Real-time speaker diarization
Silero VAD (2024) - Enterprise-grade Voice Activity Detection
Why not just run a simple Whisper model on every audio batch? Whisper is designed for complete utterances, not real-time chunks.
Processing small segments loses context, cuts off words mid-syllable, and produces poor transcription. WhisperLiveKit uses state-of-
the-art simultaneous speech research for intelligent buffering and incremental processing.
Architecture
The backend supports multiple concurrent users. Voice Activity Detection reduces overhead when no voice is detected.
Installation & Quick Start
pip install whisperlivekit
FFmpeg is required and must be installed before using WhisperLiveKit
OS How to install
Ubuntu/Debian sudo apt install ffmpeg
MacOS brew install ffmpeg
Windows Download .exe from https://ffmpeg.org/download.html and add to PATH
Quick Start
1. Start the transcription server:
whisperlivekit-server --model base --language en
2. Open your browser and navigate to http://localhost:8000 . Start speaking and watch your words appear in real-time!
See tokenizer.py for the list of all available languages.
For HTTPS requirements, see the Parameters section for SSL configuration options.
https://github.com/QuentinFuxa/WhisperLiveKit Page 2 of 6
QuentinFuxa/WhisperLiveKit: Real-time & local speech-to-text, translation, and speaker diarization. With server & web UI. 02/09/25, 09.30
Optional Dependencies
Optional pip install
Speaker diarization with Sortformer git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr]
Speaker diarization with Diart diart
Original Whisper backend whisper
Improved timestamps backend whisper-timestamped
Apple Silicon optimization backend mlx-whisper
OpenAI API backend openai
See Parameters & Configuration below on how to use them.
Usage Examples
Command-line Interface: Start the transcription server with various options:
# Use better model than default (small)
whisperlivekit-server --model large-v3
# Advanced configuration with diarization and language
whisperlivekit-server --host 0.0.0.0 --port 8000 --model medium --diarization --language fr
Python API Integration: Check basic_server for a more complete example of how to use the functions and classes.
from whisperlivekit import TranscriptionEngine, AudioProcessor, parse_args
from fastapi import FastAPI, WebSocket, WebSocketDisconnect
from fastapi.responses import HTMLResponse
from contextlib import asynccontextmanager
import asyncio
transcription_engine = None
@asynccontextmanager
async def lifespan(app: FastAPI):
global transcription_engine
transcription_engine = TranscriptionEngine(model="medium", diarization=True, lan="en")
yield
app = FastAPI(lifespan=lifespan)
async def handle_websocket_results(websocket: WebSocket, results_generator):
async for response in results_generator:
await websocket.send_json(response)
await websocket.send_json({"type": "ready_to_stop"})
@app.websocket("/asr")
async def websocket_endpoint(websocket: WebSocket):
global transcription_engine
# Create a new AudioProcessor for each connection, passing the shared engine
audio_processor = AudioProcessor(transcription_engine=transcription_engine)
results_generator = await audio_processor.create_tasks()
results_task = asyncio.create_task(handle_websocket_results(websocket, results_generator))
await websocket.accept()
while True:
message = await websocket.receive_bytes()
await audio_processor.process_audio(message)
Frontend Implementation: The package includes an HTML/JavaScript implementation here. You can also import it using from
whisperlivekit import get_inline_ui_html & page = get_inline_ui_html()
Parameters & Configuration
An important list of parameters can be changed. But what should you change?
https://github.com/QuentinFuxa/WhisperLiveKit Page 3 of 6
QuentinFuxa/WhisperLiveKit: Real-time & local speech-to-text, translation, and speaker diarization. With server & web UI. 02/09/25, 09.30
the --model size. List and recommandations here
the --language . List here. If you use auto , the model attempts to detect the language automatically, but it tends to bias towards
English.
the --backend ? you can switch to --backend faster-whisper if simulstreaming does not work correctly or if you prefer to avoid
the dual-license requirements.
--warmup-file , if you have one
--host , --port , --ssl-certfile , --ssl-keyfile , if you set up a server
--diarization , if you want to use it.
The rest I don't recommend. But below are your options.
Parameter Description Default
--model Whisper model size. small
--language Source language code or auto auto
--task transcribe or translate transcribe
--backend Processing backend simulstreaming
--min-chunk-size Minimum audio chunk size (seconds) 1.0
--no-vac Disable Voice Activity Controller False
--no-vad Disable Voice Activity Detection False
--warmup-file Audio file path for model warmup jfk.wav
--host Server host address localhost
--port Server port 8000
--ssl-certfile Path to the SSL certificate file (for HTTPS support) None
--ssl-keyfile Path to the SSL private key file (for HTTPS support) None
WhisperStreaming backend options Description Default
--confidence-validation Use confidence scores for faster validation False
--buffer_trimming Buffer trimming strategy ( sentence or segment ) segment
SimulStreaming
Description Default
backend options
--frame-threshold AlignAtt frame threshold (lower = faster, higher = more accurate) 25
--beams Number of beams for beam search (1 = greedy decoding) 1
--decoder Force decoder type ( beam or greedy ) auto
--audio-max-len Maximum audio buffer length (seconds) 30.0
--audio-min-len Minimum audio length to process (seconds) 0.0
--cif-ckpt-path Path to CIF model for word boundary detection None
--never-fire Never truncate incomplete words False
--init-prompt Initial prompt for the model None
--static-init-prompt Static prompt that doesn't scroll None
--max-context-tokens Maximum context tokens None
--model-path Direct path to .pt model file. Download it if not found ./base.pt
--preloaded-model- Optional. Number of models to preload in memory to speed up loading (set up to the
1
count expected number of concurrent users)
https://github.com/QuentinFuxa/WhisperLiveKit Page 4 of 6
QuentinFuxa/WhisperLiveKit: Real-time & local speech-to-text, translation, and speaker diarization. With server & web UI. 02/09/25, 09.30
Diarization options Description Default
--diarization Enable speaker identification False
--diarization-
diart or sortformer sortformer
backend
--segmentation- Hugging Face model ID for Diart segmentation model. Available
pyannote/segmentation-3.0
model models
Hugging Face model ID for Diart embedding model. Available speechbrain/spkrec-ecapa-
--embedding-model
models voxceleb
For diarization using Diart, you need access to pyannote.audio models:
1. Accept user conditions for the pyannote/segmentation model
2. Accept user conditions for the pyannote/segmentation-3.0 model
3. Accept user conditions for the pyannote/embedding model
4. Login with HuggingFace: huggingface-cli login
Deployment Guide
To deploy WhisperLiveKit in production:
1. Server Setup: Install production ASGI server & launch with multiple workers
pip install uvicorn gunicorn
gunicorn -k uvicorn.workers.UvicornWorker -w 4 your_app:app
2. Frontend: Host your customized version of the html example & ensure WebSocket connection points correctly
3. Nginx Configuration (recommended for production):
server {
listen 80;
server_name your-domain.com;
location / {
proxy_pass http://localhost:8000;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
}}
4. HTTPS Support: For secure deployments, use "wss://" instead of "ws://" in WebSocket URL
Docker
Deploy the application easily using Docker with GPU or CPU support.
Prerequisites
Docker installed on your system
For GPU support: NVIDIA Docker runtime installed
Quick Start
With GPU acceleration (recommended):
docker build -t wlk .
docker run --gpus all -p 8000:8000 --name wlk wlk
CPU only:
https://github.com/QuentinFuxa/WhisperLiveKit Page 5 of 6
QuentinFuxa/WhisperLiveKit: Real-time & local speech-to-text, translation, and speaker diarization. With server & web UI. 02/09/25, 09.30
docker build -f Dockerfile.cpu -t wlk .
docker run -p 8000:8000 --name wlk wlk
Advanced Usage
Custom configuration:
# Example with custom model and language
docker run --gpus all -p 8000:8000 --name wlk wlk --model large-v3 --language fr
Memory Requirements
Large models: Ensure your Docker runtime has sufficient memory allocated
Customization
--build-arg Options:
EXTRAS="whisper-timestamped" - Add extras to the image's installation (no spaces). Remember to set necessary container
options!
HF_PRECACHE_DIR="./.cache/" - Pre-load a model cache for faster first-time start
HF_TKN_FILE="./token" - Add your Hugging Face Hub access token to download gated models
Use Cases
Capture discussions in real-time for meeting transcription, help hearing-impaired users follow conversations through accessibility tools,
transcribe podcasts or videos automatically for content creation, transcribe support calls with speaker identification for customer service...
https://github.com/QuentinFuxa/WhisperLiveKit Page 6 of 6