Cosyvoice2

iffy1

已于 2025-06-30 23:43:06 修改

阅读量271

点赞数 3

CC 4.0 BY-SA版权

文章标签： java 前端 linux

于 2025-06-28 22:18:10 首次发布

本文链接：https://blog.csdn.net/iffy1/article/details/148982942

1.下载cosyvoice2 git 下载zip也可以

git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git
# If you failed to clone the submodule due to network failures, please run the following command until success
cd CosyVoice
git submodule update --init --recursive

2. 安装conda

Install Conda: please see https://docs.conda.io/en/latest/miniconda.html

3.新建cosyvoice的conda 环境

conda create -n cosyvoicev2 -y python=3.10
conda activate cosyvoicev2
# pynini is required by WeTextProcessing, use conda to install it as it can be executed on all platforms.
conda install -y -c conda-forge pynini==2.1.5
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com

4.下载模型

# git模型下载，请确保已安装git lfs
mkdir pretrained_models
git clone https://www.modelscope.cn/iic/CosyVoice2-0.5B.git pretrained_models/CosyVoice2-0.5B
git clone https://www.modelscope.cn/iic/CosyVoice-300M.git pretrained_models/CosyVoice-300M
git clone https://www.modelscope.cn/iic/CosyVoice-300M-SFT.git pretrained_models/CosyVoice-300M-SFT
git clone https://www.modelscope.cn/iic/CosyVoice-300M-Instruct.git pretrained_models/CosyVoice-300M-Instruct
git clone https://www.modelscope.cn/iic/CosyVoice-ttsfrd.git pretrained_models/CosyVoice-ttsfrd

在cosyvoice目录新建pretrained_models

CosyVoice2-0.5B
这是 CosyVoice2 系列的大型模型，参数规模为 5 亿 (0.5B)。相比基础版本，它可能在语音合成质量、多语言支持或音色表现力上更优，适合需要高精度语音生成的场景。
CosyVoice-300M
基础版本的 CosyVoice 模型，参数规模 3 亿 (300M)。提供了语音合成的核心能力，适用于一般场景的语音生成任务。
CosyVoice-300M-SFT
SFT (Supervised Fine-Tuning) 表示该模型经过了监督微调。通常基于基础模型，针对特定任务或数据进行了优化，可能在特定领域的语音合成效果更好。
CosyVoice-300M-Instruct
Instruct 表示支持指令微调，意味着模型可以通过自然语言指令灵活控制语音合成的风格、情感等属性，增强了交互性和可控性。
CosyVoice-ttsfrd
专注于 TTS (Text-to-Speech) 友好度优化的版本，可能在发音准确性、韵律自然度或特定语言 / 方言支持上有针对性改进。

5. 启动web服务，参数为使用的模型

python webui.py --port 50000 --model_dir pretrained_models/CosyVoice-300M

python webui.py --port 50000 --model_dir pretrained_models/CosyVoice-300M-Instruct

http://localhost:50000/?

6.代码调用

import sys
sys.path.append('third_party/Matcha-TTS')
from cosyvoice.cli.cosyvoice import CosyVoice, CosyVoice2
from cosyvoice.utils.file_utils import load_wav
import torchaudio


cosyvoice = CosyVoice2('pretrained_models/CosyVoice2-0.5B', load_jit=False, load_trt=False, load_vllm=False, fp16=False)

# NOTE if you want to reproduce the results on https://funaudiollm.github.io/cosyvoice2, please add text_frontend=False during inference
# zero_shot usage
prompt_speech_16k = load_wav('./asset/zero_shot_prompt.wav', 16000)
for i, j in enumerate(cosyvoice.inference_zero_shot('收到好友从远方寄来的生日礼物，那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐，笑容如花儿般绽放。', '希望你以后能够做的比我还好呦。', prompt_speech_16k, stream=False)):
    torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

# save zero_shot spk for future usage
assert cosyvoice.add_zero_shot_spk('希望你以后能够做的比我还好呦。', prompt_speech_16k, 'my_zero_shot_spk') is True
for i, j in enumerate(cosyvoice.inference_zero_shot('收到好友从远方寄来的生日礼物，那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐，笑容如花儿般绽放。', '', '', zero_shot_spk_id='my_zero_shot_spk', stream=False)):
    torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
cosyvoice.save_spkinfo()

# fine grained control, for supported control, check cosyvoice/tokenizer/tokenizer.py#L248
for i, j in enumerate(cosyvoice.inference_cross_lingual('在他讲述那个荒诞故事的过程中，他突然[laughter]停下来，因为他自己也被逗笑了[laughter]。', prompt_speech_16k, stream=False)):
    torchaudio.save('fine_grained_control_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

# instruct usage
for i, j in enumerate(cosyvoice.inference_instruct2('收到好友从远方寄来的生日礼物，那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐，笑容如花儿般绽放。', '用四川话说这句话', prompt_speech_16k, stream=False)):
    torchaudio.save('instruct_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

# bistream usage, you can use generator as input, this is useful when using text llm model as input
# NOTE you should still have some basic sentence split logic because llm can not handle arbitrary sentence length
def text_generator():
    yield '收到好友从远方寄来的生日礼物，'
    yield '那份意外的惊喜与深深的祝福'
    yield '让我心中充满了甜蜜的快乐，'
    yield '笑容如花儿般绽放。'
for i, j in enumerate(cosyvoice.inference_zero_shot(text_generator(), '希望你以后能够做的比我还好呦。', prompt_speech_16k, stream=False)):
    torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

C:\Users\Administrator\anaconda3\envs\cosyvoice\python.exe D:\LLM\CosyVoice\CosyVoice\test.py 
failed to import ttsfrd, use WeTextProcessing instead
Traceback (most recent call last):
  File "C:\Users\Administrator\anaconda3\envs\cosyvoice\lib\pydoc.py", line 439, in safeimport
    module = __import__(path)
  File "D:\LLM\CosyVoice\CosyVoice\cosyvoice\flow\flow_matching.py", line 17, in <module>
    from matcha.models.components.flow_matching import BASECFM
ModuleNotFoundError: No module named 'matcha.models'; 'matcha' is not a package

During handling of the above exception, another exception occurred:

错误解决

pip install Matcha-TTS

conda 环境安装Matcha-TTS 依赖

(cosyvoice) D:\LLM\CosyVoice\CosyVoice\third_party\Matcha-TTS>pip install -r requirements.txt

7.部署docker（运行失败）

cd runtime/python
docker build -t cosyvoice:v2.0 .
# change pretrained_models/CosyVoice-300M to pretrained_models/CosyVoice-300M-Instruct if you want to use instruct inference

# for grpc usage
docker run -d --runtime=nvidia -p 50000:50000 cosyvoice:v2.0 /bin/bash -c "cd /opt/CosyVoice/CosyVoice/runtime/python/grpc && python server.py --port 50000 --max_conc 4 --model_dir pretrained_models/CosyVoice-300M-Instruct && sleep infinity"
cd grpc && python3 client.py --port 50000 --mode <sft|zero_shot|cross_lingual|instruct>

# for fastapi usage
docker run -d --runtime=nvidia -p 50000:50000 cosyvoice:v2.0 /bin/bash -c "cd /opt/CosyVoice/CosyVoice/runtime/python/fastapi && python server.py --port 50000 --model_dir pretrained_models/CosyVoice-300M-Instruct && sleep infinity"
cd fastapi && python3 client.py --port 50000 --mode <sft|zero_shot|cross_lingual|instruct>

8.1 系统找不到指定的文件生成错误

尝试安装ffmpeg

https://www.gyan.dev/ffmpeg/builds/https://www.gyan.dev/ffmpeg/builds/

https://github.com/BtbN/FFmpeg-Builds/releaseshttps://github.com/BtbN/FFmpeg-Builds/releases

8.2 不能读出方言

安装ttsfrd依赖

(cosyvoicev2) D:\LLM\CosyVoice\CosyVoice\pretrained_models\CosyVoice-ttsfrd>pip install ttsfrd_dependency-0.1-py3-none-any.whl
Processing d:\llm\cosyvoice\cosyvoice\pretrained_models\cosyvoice-ttsfrd\ttsfrd_dependency-0.1-py3-none-any.whl
Installing collected packages: ttsfrd-dependency
Successfully installed ttsfrd-dependency-0.1

安装 ttsfrd

(cosyvoicev2) D:\LLM\CosyVoice\CosyVoice\pretrained_models\CosyVoice-ttsfrd>pip install ttsfrd
Collecting ttsfrd
  Downloading ttsfrd-0.1.0-py3-none-any.whl.metadata (291 bytes)
Downloading ttsfrd-0.1.0-py3-none-any.whl (1.3 kB)
Installing collected packages: ttsfrd
Successfully installed ttsfrd-0.1.0

安装后启动时候报错

(cosyvoicev2) D:\LLM\CosyVoice\CosyVoice>python webui.py --port 50000 --model_dir pretrained_models/CosyVoice-300M-Instruct
C:\Users\Administrator\anaconda3\envs\cosyvoicev2\lib\site-packages\diffusers\models\lora.py:393: FutureWarning: `LoRACompatibleLinear` is deprecated and will be removed in version 1.0.0. Use of `LoRACompatibleLinear` is deprecated. Please switch to PEFT backend by installing PEFT: `pip install peft`.
  deprecate("LoRACompatibleLinear", "1.0.0", deprecation_message)
2025-06-29 16:52:14,383 INFO input frame rate=50
C:\Users\Administrator\anaconda3\envs\cosyvoicev2\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py:69: UserWarning: Specified provider 'CUDAExecutionProvider' is not in available provider names.Available providers: 'AzureExecutionProvider, CPUExecutionProvider'
  warnings.warn(
Traceback (most recent call last):
  File "D:\LLM\CosyVoice\CosyVoice\webui.py", line 188, in <module>
    cosyvoice = CosyVoice(args.model_dir)
  File "D:\LLM\CosyVoice\CosyVoice\cosyvoice\cli\cosyvoice.py", line 41, in __init__
    self.frontend = CosyVoiceFrontEnd(configs['get_tokenizer'],
  File "D:\LLM\CosyVoice\CosyVoice\cosyvoice\cli\frontend.py", line 65, in __init__
    self.frd = ttsfrd.TtsFrontendEngine()
AttributeError: module 'ttsfrd' has no attribute 'TtsFrontendEngine'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:\LLM\CosyVoice\CosyVoice\webui.py", line 191, in <module>
    cosyvoice = CosyVoice2(args.model_dir)
  File "D:\LLM\CosyVoice\CosyVoice\cosyvoice\cli\cosyvoice.py", line 152, in __init__
    raise ValueError('{} not found!'.format(hyper_yaml_path))
ValueError: pretrained_models/CosyVoice-300M-Instruct/cosyvoice2.yaml not found!

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:\LLM\CosyVoice\CosyVoice\webui.py", line 193, in <module>
    raise TypeError('no valid model_type!')
TypeError: no valid model_type!

原始是安装了错误的ttsfrd

(cosyvoicev2) D:\LLM\CosyVoice\CosyVoice\pretrained_models\CosyVoice-ttsfrd>pip install ttsfrd

使用pip uninstall ttsfrd,卸载错误版本

没有适用于windows的ttsfrd 版本

使用WeTextProcessing 代替ttsfrd，但是不能说方言

9.方言最终方案使用整合包

可以输出四川话

API调用方法

promoteAudio/sichuan.wav 为样例四川话音频

prompt_text 为样例四川话音频内容

    result = client.predict(
		tts_text=command,
		mode_checkbox_group="自然语言控制",
		prompt_text="要不要嘛  要不要得 你要爪子  老子数到三 把你龟儿卡撕烂  莫挨老子  个人爬 滚",
		prompt_wav_upload=file('promoteAudio/sichuan.wav'),
		prompt_wav_record=file('https://github.com/gradio-app/gradio/raw/main/test/test_files/audio_sample.wav'),
		instruct_text="用四川话说这段话",
		seed=0,
		stream="false",
		speed=1,
		api_name="/generate_audio"
        )

Web使用方法