Tortoise is a text-to-speech program built with the following priorities:
- Strong multi-voice capabilities.
- Highly realistic prosody and intonation.
This repo contains all the code needed to run Tortoise TTS in inference mode.
Manuscript: https://arxiv.org/abs/2305.07243
A live demo is hosted on Hugging Face Spaces. If you'd like to avoid a queue, please duplicate the Space and add a GPU. Please note that CPU-only spaces do not work for this demo.
https://huggingface.co/spaces/Manmay/tortoise-tts
New! Web UI for easy TTS generation with visual interface, real-time progress tracking, and audio playlist.
conda create --name tortoise python=3.11 numba inflect -y
conda activate tortoise
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
git clone https://github.com/neonbjb/tortoise-tts.git
cd tortoise-tts
pip install -e .
pip install flask soundfile
# Start the web interface
python web_ui.pyThen open http://localhost:5000 in your browser!
pip install tortoise-ttsIf you would like to install the latest development version, you can also install it directly from the git repository:
pip install git+https://github.com/neonbjb/tortoise-ttsI'm naming my speech-related repos after Mojave desert flora and fauna. Tortoise is a bit tongue in cheek: this model is insanely slow. It leverages both an autoregressive decoder and a diffusion decoder; both known for their low sampling rates. On a K80, expect to generate a medium sized sentence every 2 minutes.
well..... not so slow anymore now we can get a 0.25-0.3 RTF on 4GB vram and with streaming we can get < 500 ms latency !!!
See this page for a large list of example outputs.
A cool application of Tortoise + GPT-3 (not affiliated with this repository): https://twitter.com/lexman_ai. Unfortunately, this project seems no longer to be active.
If you want to use this on your own computer, you must have an NVIDIA GPU.
Tip
On Windows, I highly recommend using the Conda installation method. I have been told that if you do not do this, you will spend a lot of time chasing dependency problems.
First, install miniconda: https://docs.conda.io/en/latest/miniconda.html
Then run the following commands, using anaconda prompt as the terminal (or any other terminal configured to work with conda)
This will:
- create conda environment with minimal dependencies specified
- activate the environment
- install pytorch with the command provided here: https://pytorch.org/get-started/locally/
- clone tortoise-tts
- change the current directory to tortoise-tts
- run tortoise python setup install script
conda create --name tortoise python=3.11 numba inflect -y
conda activate tortoise
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install transformers==4.29.2
git clone https://github.com/neonbjb/tortoise-tts.git
cd tortoise-tts
pip install -e .
pip install flask soundfile==0.12.1Important
Python 3.11 is required - PyTorch does not support Python 3.13+. Use soundfile==0.12.1 to avoid compatibility issues.
Optionally, pytorch can be installed in the base environment, so that other conda environments can use it too. To do this, simply send the conda install pytorch... line before activating the tortoise environment.
Note
When you want to use tortoise-tts, you will always have to ensure the tortoise conda environment is activated.
An easy way to hit the ground running and a good jumping off point depending on your use case.
git clone https://github.com/neonbjb/tortoise-tts.git
cd tortoise-tts
docker build . -t tts
docker run --gpus all \
-e TORTOISE_MODELS_DIR=/models \
-v /mnt/user/data/tortoise_tts/models:/models \
-v /mnt/user/data/tortoise_tts/results:/results \
-v /mnt/user/data/.cache/huggingface:/root/.cache/huggingface \
-v /root:/work \
-it ttsThis gives you an interactive terminal in an environment that's ready to do some tts. Now you can explore the different interfaces that tortoise exposes for tts.
For example:
cd app
conda activate tortoise
time python tortoise/do_tts.py \
--output_path /results \
--preset ultra_fast \
--voice geralt \
--text "Time flies like an arrow; fruit flies like a bananna."On macOS 13+ with M1/M2 chips you need to install the nighly version of PyTorch, as stated in the official page you can do:
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpuBe sure to do that after you activate the environment. If you don't use conda the commands would look like this:
python3.10 -m venv .venv
source .venv/bin/activate
pip install numba inflect psutil
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu
pip install transformers
git clone https://github.com/neonbjb/tortoise-tts.git
cd tortoise-tts
pip install .Be aware that DeepSpeed is disabled on Apple Silicon since it does not work. The flag --use_deepspeed is ignored.
You may need to prepend PYTORCH_ENABLE_MPS_FALLBACK=1 to the commands below to make them work since MPS does not support all the operations in Pytorch.
The easiest way to use Tortoise TTS! A full-featured web interface with:
- π¨ 4-column layout: Settings, content, audio playlist, and debug console
- π― Stage-based progress tracking: Real-time progress with accurate stage detection
- π΅ Audio playlist: Persistent playlist with playback controls (localStorage)
- π Smart file management: Auto-save to Music folder with intelligent naming
- π€ Voice management: Upload custom voices, delete voices, batch generate .pth files
- π§ Service controls: Restart, stop, open output folder
- π Debug console: Real-time color-coded logs
- βΉοΈ Cancel generation: Stop long-running generations
- π System monitoring: CPU, RAM, and GPU usage
# Start the web UI
python web_ui.py
# Or use the provided scripts
start_webui.bat # Windows batch file
start_webui.ps1 # PowerShell scriptThen open http://localhost:5000 in your browser.
Output Location: Files are automatically saved to %USERPROFILE%\Music\Tortoise Output
File Naming: {voice}-{preset}-{candidates}x-{number}.wav (e.g., tom-fast-1x-001.wav)
Two ways to add your own voice with fully automatic preprocessing!
Upload audio files - system automatically processes:
- βοΈ Splits into 10-second segments
- π΅ Resamples to 22050Hz
- οΏ½ Normalizes to [-1, 1] range (critical for voice matching)
- οΏ½πΎ Converts to standard 16-bit PCM WAV format
- π Converts stereo to mono
- β Validates audio quality (detects silent/corrupted files)
- β‘ Automatically generates .pth file for instant loading!
Record directly in the browser:
- π€ Record 7-10 clips of 30 seconds each
- π Read random paragraphs displayed on screen
- π Automatic processing and .pth generation
- β‘ Voice ready instantly!
Quick start:
- Click "Manage Voices" tab in web UI
- Choose "Upload Files" or "Record Voice"
- Upload 70+ seconds OR record 7-10 clips
- System automatically processes and creates .pth file
- Voice ready instantly - select and generate!
Critical Minimum: 5 segments (50 seconds) - poor quality
Recommended Minimum: 7 segments (70 seconds) - acceptable quality
Optimal: 10+ segments (100+ seconds) - best quality
See VOICE_CLONING_GUIDE.md for detailed instructions and best practices.
For faster voice loading, pre-compute conditioning latents (.pth files):
# Generate .pth for a single voice
python tortoise\get_conditioning_latents.py --voice VOICE_NAME --output_path tortoise\voices\VOICE_NAME
# Batch generate .pth for all voices (Windows)
generate_all_pth.bat # Batch script
generate_all_pth.ps1 # PowerShell scriptPre-computed .pth files reduce voice loading time from 10-30 seconds to instant.
This script allows you to speak a single phrase with one or more voices.
python tortoise/do_tts.py --text "I'm going to speak this" --voice random --preset fastpython tortoise/socket_server.py
will listen at port 5000
This script provides tools for reading large amounts of text.
python tortoise/read_fast.py --textfile <your text to be read> --voice randomThis script provides tools for reading large amounts of text.
python tortoise/read.py --textfile <your text to be read> --voice randomThis will break up the textfile into sentences, and then convert them to speech one at a time. It will output a series of spoken clips as they are generated. Once all the clips are generated, it will combine them into a single file and output that as well.
Sometimes Tortoise screws up an output. You can re-generate any bad clips by re-running read.py with the --regenerate
argument.
Tortoise can be used programmatically, like so:
reference_clips = [utils.audio.load_audio(p, 22050) for p in clips_paths]
tts = api.TextToSpeech()
pcm_audio = tts.tts_with_preset("your text here", voice_samples=reference_clips, preset='fast')To use deepspeed:
reference_clips = [utils.audio.load_audio(p, 22050) for p in clips_paths]
tts = api.TextToSpeech(use_deepspeed=True)
pcm_audio = tts.tts_with_preset("your text here", voice_samples=reference_clips, preset='fast')To use kv cache:
reference_clips = [utils.audio.load_audio(p, 22050) for p in clips_paths]
tts = api.TextToSpeech(kv_cache=True)
pcm_audio = tts.tts_with_preset("your text here", voice_samples=reference_clips, preset='fast')To run model in float16:
reference_clips = [utils.audio.load_audio(p, 22050) for p in clips_paths]
tts = api.TextToSpeech(half=True)
pcm_audio = tts.tts_with_preset("your text here", voice_samples=reference_clips, preset='fast')for Faster runs use all three:
reference_clips = [utils.audio.load_audio(p, 22050) for p in clips_paths]
tts = api.TextToSpeech(use_deepspeed=True, kv_cache=True, half=True)
pcm_audio = tts.tts_with_preset("your text here", voice_samples=reference_clips, preset='fast')Warning
Tortoise TTS is slow by design. The autoregressive architecture processes sequentially, making it 10-100x slower than diffusion-only models like Stable Diffusion.
Expected generation times (RTX 3060, batch_size=4):
ultra_fast: ~30 seconds (16 autoregressive samples)fast: ~8 minutes (96 samples, 24 batches)standard: ~20 minutes (256 samples, 64 batches)high_quality: ~40+ minutes (256 samples, 100 diffusion steps)
Optimization tips:
- Use
ultra_fastfor quick testing - Pre-compute voice .pth files (instant loading vs 10-30s)
- Reduce
autoregressive_batch_sizeif experiencing system instability (default: 4) - Monitor GPU memory usage - reduce candidates or preset if OOM errors occur
See SPEED_GUIDE.md for detailed performance information.
WEB_UI_README.md- Comprehensive web UI documentationVOICE_CLONING_GUIDE.md- NEW! Voice cloning with automatic preprocessingQUICK_START.md- Quick reference guideSPEED_GUIDE.md- Performance characteristics explainedTROUBLESHOOTING_RESULTS.md- Common issues and solutionsSOUNDFILE_FIX.md- soundfile compatibility fixvoice_customization_guide.md- Creating custom voicesAdvanced_Usage.md- Advanced features and techniques
This project has garnered more praise than I expected. I am standing on the shoulders of giants, though, and I want to credit a few of the amazing folks in the community that have helped make this happen:
- Hugging Face, who wrote the GPT model and the generate API used by Tortoise, and who hosts the model weights.
- Ramesh et al who authored the DALLE paper, which is the inspiration behind Tortoise.
- Nichol and Dhariwal who authored the (revision of) the code that drives the diffusion model.
- Jang et al who developed and open-sourced univnet, the vocoder this repo uses.
- Kim and Jung who implemented univnet pytorch model.
- lucidrains who writes awesome open source pytorch models, many of which are used here.
- Patrick von Platen whose guides on setting up wav2vec were invaluable to building my dataset.
- dil-mange-amore who developed the comprehensive Web UI with stage-based progress tracking, audio playlist, debug console, voice management, batch processing scripts, and extensive documentation improvements.
Tortoise was built entirely by the author (James Betker) using their own hardware. Their employer was not involved in any facet of Tortoise's development.
Tortoise TTS is licensed under the Apache 2.0 license.
If you use this repo or the ideas therein for your research, please cite it! A bibtex entree can be found in the right pane on GitHub.