A Telegram bot that automatically transcribes and summarizes voice messages longer than 10 seconds using local AI models.
- 🎤 Transcribes voice messages using OpenAI Whisper large-v3 (optimized for Portuguese)
- 📝 Summarizes transcriptions using Ollama with llama3.1:8b (strict factual mode)
- ⚡ Processes voice messages > 10 seconds automatically
- 🔒 All processing done locally (no external APIs)
- 🇧🇷 Excellent Portuguese language support
-
Python 3.10+
-
uv - Fast Python package installer
-
Ollama - For running local LLM
# Install Ollama (macOS) brew install ollama # Start Ollama service ollama serve # Pull the model (in another terminal) ollama pull llama3.1:8b
-
FFmpeg - For audio conversion
# macOS brew install ffmpeg
-
Install dependencies:
uv pip install -e . -
Create a Telegram bot:
- Talk to @BotFather on Telegram
- Create a new bot with
/newbot - Copy the bot token
-
Set environment variable:
export TELEGRAM_BOT_TOKEN="your-token-here" # Optional: Set custom Ollama host (defaults to http://localhost:11434) export OLLAMA_HOST="http://localhost:11434"
-
Add bot to group:
- Add your bot to a Telegram group
- Make sure the bot has permission to read messages
telegram-summarizer
# or
python -m telegram_summarizerdocker run -d \
-e TELEGRAM_BOT_TOKEN=your-token \
-e OLLAMA_HOST=http://host.docker.internal:11434 \
-v whisper-cache:/data/.cache \
ghcr.io/caarlos0/telegram-summarizer:latestThe -v whisper-cache:/data/.cache volume persists the ~3GB Whisper model between restarts.
The bot will:
- Listen for voice messages in groups
- Ignore messages <= 10 seconds
- For longer messages:
- React with 🙉 while processing
- Transcribe and extract core idea
- Reply only if relevant content found
Environment variables:
TELEGRAM_BOT_TOKEN- Your bot token (required)OLLAMA_HOST- Ollama server URL (optional, defaults tohttp://localhost:11434)
Code customization - Edit src/telegram_summarizer/__main__.py:
whisper.load_model("large-v3")- Current: best quality for Portuguese. Can change tomedium,small, orbasefor faster processingllama3.1:8b- Current model. Can usellama3.2:3b(faster) orllama3.2:1b(even faster but less reliable)voice.duration <= 10- Change minimum duration threshold
Current configuration (optimized for Portuguese accuracy):
- Whisper large-v3: ~3GB, best quality especially for Portuguese
- llama3.1:8b: ~4.7GB, excellent accuracy with strict factual prompting
Alternative models (faster but less accurate):
- Whisper:
medium(~1.5GB),small(~500MB),base(~140MB) - Ollama:
llama3.2:3b(~2GB),llama3.2:1b(~1.3GB)
For even better quality:
- Ollama:
llama3.1:70b(~40GB) if you have powerful hardware
MIT