1.下载cosyvoice2 git 下载zip也可以
git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git
# If you failed to clone the submodule due to network failures, please run the following command until success
cd CosyVoice
git submodule update --init --recursive
2. 安装conda
Install Conda: please see https://docs.conda.io/en/latest/miniconda.html
3.新建cosyvoice的conda 环境
conda create -n cosyvoicev2 -y python=3.10
conda activate cosyvoicev2
# pynini is required by WeTextProcessing, use conda to install it as it can be executed on all platforms.
conda install -y -c conda-forge pynini==2.1.5
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
4.下载模型
# git模型下载,请确保已安装git lfs
mkdir pretrained_models
git clone https://www.modelscope.cn/iic/CosyVoice2-0.5B.git pretrained_models/CosyVoice2-0.5B
git clone https://www.modelscope.cn/iic/CosyVoice-300M.git pretrained_models/CosyVoice-300M
git clone https://www.modelscope.cn/iic/CosyVoice-300M-SFT.git pretrained_models/CosyVoice-300M-SFT
git clone https://www.modelscope.cn/iic/CosyVoice-300M-Instruct.git pretrained_models/CosyVoice-300M-Instruct
git clone https://www.modelscope.cn/iic/CosyVoice-ttsfrd.git pretrained_models/CosyVoice-ttsfrd
在cosyvoice目录新建pretrained_models
-
CosyVoice2-0.5B
这是 CosyVoice2 系列的大型模型,参数规模为 5 亿 (0.5B)。相比基础版本,它可能在语音合成质量、多语言支持或音色表现力上更优,适合需要高精度语音生成的场景。 -
CosyVoice-300M
基础版本的 CosyVoice 模型,参数规模 3 亿 (300M)。提供了语音合成的核心能力,适用于一般场景的语音生成任务。 -
CosyVoice-300M-SFT
SFT (Supervised Fine-Tuning) 表示该模型经过了监督微调。通常基于基础模型,针对特定任务或数据进行了优化,可能在特定领域的语音合成效果更好。 -
CosyVoice-300M-Instruct
Instruct 表示支持指令微调,意味着模型可以通过自然语言指令灵活控制语音合成的风格、情感等属性,增强了交互性和可控性。 -
CosyVoice-ttsfrd
专注于 TTS (Text-to-Speech) 友好度优化的版本,可能在发音准确性、韵律自然度或特定语言 / 方言支持上有针对性改进。
5. 启动web服务, 参数为使用的模型
python webui.py --port 50000 --model_dir pretrained_models/CosyVoice-300M
python webui.py --port 50000 --model_dir pretrained_models/CosyVoice-300M-Instruct
http://localhost:50000/?
6.代码调用
import sys
sys.path.append('third_party/Matcha-TTS')
from cosyvoice.cli.cosyvoice import CosyVoice, CosyVoice2
from cosyvoice.utils.file_utils import load_wav
import torchaudio
cosyvoice = CosyVoice2('pretrained_models/CosyVoice2-0.5B', load_jit=False, load_trt=False, load_vllm=False, fp16=False)
# NOTE if you want to reproduce the results on https://funaudiollm.github.io/cosyvoice2, please add text_frontend=False during inference
# zero_shot usage
prompt_speech_16k = load_wav('./asset/zero_shot_prompt.wav', 16000)
for i, j in enumerate(cosyvoice.inference_zero_shot('收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。', '希望你以后能够做的比我还好呦。', prompt_speech_16k, stream=False)):
torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
# save zero_shot spk for future usage
assert cosyvoice.add_zero_shot_spk('希望你以后能够做的比我还好呦。', prompt_speech_16k, 'my_zero_shot_spk') is True
for i, j in enumerate(cosyvoice.inference_zero_shot('收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。', '', '', zero_shot_spk_id='my_zero_shot_spk', stream=False)):
torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
cosyvoice.save_spkinfo()
# fine grained control, for supported control, check cosyvoice/tokenizer/tokenizer.py#L248
for i, j in enumerate(cosyvoice.inference_cross_lingual('在他讲述那个荒诞故事的过程中,他突然[laughter]停下来,因为他自己也被逗笑了[laughter]。', prompt_speech_16k, stream=False)):
torchaudio.save('fine_grained_control_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
# instruct usage
for i, j in enumerate(cosyvoice.inference_instruct2('收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。', '用四川话说这句话', prompt_speech_16k, stream=False)):
torchaudio.save('instruct_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
# bistream usage, you can use generator as input, this is useful when using text llm model as input
# NOTE you should still have some basic sentence split logic because llm can not handle arbitrary sentence length
def text_generator():
yield '收到好友从远方寄来的生日礼物,'
yield '那份意外的惊喜与深深的祝福'
yield '让我心中充满了甜蜜的快乐,'
yield '笑容如花儿般绽放。'
for i, j in enumerate(cosyvoice.inference_zero_shot(text_generator(), '希望你以后能够做的比我还好呦。', prompt_speech_16k, stream=False)):
torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
C:\Users\Administrator\anaconda3\envs\cosyvoice\python.exe D:\LLM\CosyVoice\CosyVoice\test.py
failed to import ttsfrd, use WeTextProcessing instead
Traceback (most recent call last):
File "C:\Users\Administrator\anaconda3\envs\cosyvoice\lib\pydoc.py", line 439, in safeimport
module = __import__(path)
File "D:\LLM\CosyVoice\CosyVoice\cosyvoice\flow\flow_matching.py", line 17, in <module>
from matcha.models.components.flow_matching import BASECFM
ModuleNotFoundError: No module named 'matcha.models'; 'matcha' is not a package
During handling of the above exception, another exception occurred:
错误解决
pip install Matcha-TTS
conda 环境安装Matcha-TTS 依赖
(cosyvoice) D:\LLM\CosyVoice\CosyVoice\third_party\Matcha-TTS>pip install -r requirements.txt
7.部署docker(运行失败)
cd runtime/python
docker build -t cosyvoice:v2.0 .
# change pretrained_models/CosyVoice-300M to pretrained_models/CosyVoice-300M-Instruct if you want to use instruct inference
# for grpc usage
docker run -d --runtime=nvidia -p 50000:50000 cosyvoice:v2.0 /bin/bash -c "cd /opt/CosyVoice/CosyVoice/runtime/python/grpc && python server.py --port 50000 --max_conc 4 --model_dir pretrained_models/CosyVoice-300M-Instruct && sleep infinity"
cd grpc && python3 client.py --port 50000 --mode <sft|zero_shot|cross_lingual|instruct>
# for fastapi usage
docker run -d --runtime=nvidia -p 50000:50000 cosyvoice:v2.0 /bin/bash -c "cd /opt/CosyVoice/CosyVoice/runtime/python/fastapi && python server.py --port 50000 --model_dir pretrained_models/CosyVoice-300M-Instruct && sleep infinity"
cd fastapi && python3 client.py --port 50000 --mode <sft|zero_shot|cross_lingual|instruct>
8.1 系统找不到指定的文件 生成错误
尝试安装ffmpeg
8.2 不能读出方言
安装ttsfrd依赖
(cosyvoicev2) D:\LLM\CosyVoice\CosyVoice\pretrained_models\CosyVoice-ttsfrd>pip install ttsfrd_dependency-0.1-py3-none-any.whl
Processing d:\llm\cosyvoice\cosyvoice\pretrained_models\cosyvoice-ttsfrd\ttsfrd_dependency-0.1-py3-none-any.whl
Installing collected packages: ttsfrd-dependency
Successfully installed ttsfrd-dependency-0.1
安装 ttsfrd
(cosyvoicev2) D:\LLM\CosyVoice\CosyVoice\pretrained_models\CosyVoice-ttsfrd>pip install ttsfrd
Collecting ttsfrd
Downloading ttsfrd-0.1.0-py3-none-any.whl.metadata (291 bytes)
Downloading ttsfrd-0.1.0-py3-none-any.whl (1.3 kB)
Installing collected packages: ttsfrd
Successfully installed ttsfrd-0.1.0
安装后启动时候报错
(cosyvoicev2) D:\LLM\CosyVoice\CosyVoice>python webui.py --port 50000 --model_dir pretrained_models/CosyVoice-300M-Instruct
C:\Users\Administrator\anaconda3\envs\cosyvoicev2\lib\site-packages\diffusers\models\lora.py:393: FutureWarning: `LoRACompatibleLinear` is deprecated and will be removed in version 1.0.0. Use of `LoRACompatibleLinear` is deprecated. Please switch to PEFT backend by installing PEFT: `pip install peft`.
deprecate("LoRACompatibleLinear", "1.0.0", deprecation_message)
2025-06-29 16:52:14,383 INFO input frame rate=50
C:\Users\Administrator\anaconda3\envs\cosyvoicev2\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py:69: UserWarning: Specified provider 'CUDAExecutionProvider' is not in available provider names.Available providers: 'AzureExecutionProvider, CPUExecutionProvider'
warnings.warn(
Traceback (most recent call last):
File "D:\LLM\CosyVoice\CosyVoice\webui.py", line 188, in <module>
cosyvoice = CosyVoice(args.model_dir)
File "D:\LLM\CosyVoice\CosyVoice\cosyvoice\cli\cosyvoice.py", line 41, in __init__
self.frontend = CosyVoiceFrontEnd(configs['get_tokenizer'],
File "D:\LLM\CosyVoice\CosyVoice\cosyvoice\cli\frontend.py", line 65, in __init__
self.frd = ttsfrd.TtsFrontendEngine()
AttributeError: module 'ttsfrd' has no attribute 'TtsFrontendEngine'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:\LLM\CosyVoice\CosyVoice\webui.py", line 191, in <module>
cosyvoice = CosyVoice2(args.model_dir)
File "D:\LLM\CosyVoice\CosyVoice\cosyvoice\cli\cosyvoice.py", line 152, in __init__
raise ValueError('{} not found!'.format(hyper_yaml_path))
ValueError: pretrained_models/CosyVoice-300M-Instruct/cosyvoice2.yaml not found!
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:\LLM\CosyVoice\CosyVoice\webui.py", line 193, in <module>
raise TypeError('no valid model_type!')
TypeError: no valid model_type!
原始是安装了错误的ttsfrd
(cosyvoicev2) D:\LLM\CosyVoice\CosyVoice\pretrained_models\CosyVoice-ttsfrd>pip install ttsfrd
使用pip uninstall ttsfrd,卸载错误版本
没有适用于windows的ttsfrd 版本
使用WeTextProcessing 代替ttsfrd,但是不能说方言
9.方言最终方案使用整合包
可以输出四川话
API调用方法
promoteAudio/sichuan.wav 为样例四川话音频
prompt_text 为样例四川话音频内容
result = client.predict(
tts_text=command,
mode_checkbox_group="自然语言控制",
prompt_text="要不要嘛 要不要得 你要爪子 老子数到三 把你龟儿卡撕烂 莫挨老子 个人爬 滚",
prompt_wav_upload=file('promoteAudio/sichuan.wav'),
prompt_wav_record=file('https://github.com/gradio-app/gradio/raw/main/test/test_files/audio_sample.wav'),
instruct_text="用四川话说这段话",
seed=0,
stream="false",
speed=1,
api_name="/generate_audio"
)
Web使用方法