【论文阅读】-Grad-TTS

最新推荐文章于 2025-08-03 23:43:26 发布

小方abc

最新推荐文章于 2025-08-03 23:43:26 发布

阅读量903

点赞数 8

CC 4.0 BY-SA版权

文章标签：论文阅读

本文链接：https://blog.csdn.net/qq_38657433/article/details/146138681

Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech

Grad-TTS：一种用于文本到语音的扩散概率模型

Abstract
摘要
Recently, denoising diffusion probabilistic models and generative score matching have shown high potential in modelling complex data distributions while stochastic calculus has provided a unified point of view on these techniques allowing for flexible inference schemes. In this paper we introduce Grad-TTS, a novel text-to-speech model with score-based decoder producing mel-spectrograms by gradually transforming noise predicted by encoder and aligned with text input by means of Monotonic Alignment Search. The framework of stochastic differential equations helps us to generalize conventional diffusion probabilistic models to the case of reconstructing data from noise with different parameters and allows to make this reconstruction flexible by explicitly controlling trade-off between sound quality and inference speed. Subjective human evaluation shows that Grad-TTS is competitive with state-of-the-art text-to-speech approaches in terms of Mean Opinion Score. The code is publicly available at https://github.com/huawei-noah/Speech-Backbones/tree/main/Grad-TTS.

近年来，去噪扩散概率模型和生成评分匹配在建模复杂数据分布方面展现出巨大潜力，而随机微积分为这些技术提供了统一视角，支持灵活的推理方案。本文提出了一种新颖的文本到语音模型Grad-TTS，其基于评分的解码器通过逐步将编码器预测的噪声（通过与文本输入的单调对齐搜索对齐）转换为梅尔频谱图。随机微分方程框架帮助我们将传统扩散概率模型推广到从具有不同参数的噪声中重建数据的场景，并通过显式控制音质与推理速度之间的权衡实现灵活重建。主观评测表明，Grad-TTS在平均意见评分（MOS）上与现有先进文本到语音方法具有竞争力。代码已公开发布于https://github.com/huawei-noah/Speech-Backbones/tree/main/Grad-TTS。

1. Introduction
1. 引言

Deep generative modelling proved to be effective in various machine learning fields, and speech synthesis is no exception. Modern text-to-speech (TTS) systems often consist of two parts designed as deep neural networks: the first part converts the input text into time-frequency domain acoustic features (feature generator), and the second one synthesizes raw waveform conditioned on these features (vocoder).

深度生成模型在多个机器学习领域被证明是有效的，语音合成也不例外。现代文本到语音（TTS）系统通常由两个深度神经网络部分组成：第一部分将输入文本转换为时频域声学特征（特征生成器），第二部分基于这些特征合成语音原始波形（声码器）。

The introduction of the conventional state-of-the-art autoregressive models such as Tacotron2 (Shen et al., 2018) used for feature generation and WaveNet (van den Oord et al., 2016) used as vocoder marked the beginning of the neural TTS era.

Tacotron2（Shen等人，2018）作为特征生成器和WaveNet（van den Oord等人，2016）作为声码器的引入，标志着神经TTS时代的开始。

Later, other popular generative modelling frameworks such as Generative Adversarial Networks (Goodfellow et al., 2014) and Normalizing Flows (Rezende & Mohamed, 2015) were used in the design of TTS engines for parallel generation with comparable quality of the synthesized speech.

随后，其他流行的生成建模框架（如生成对抗网络（GAN）（Goodfellow等人，2014）和正规化流（Normalizing Flows）（Rezende & Mohamed，2015））也被用于TTS引擎设计，以实现并行生成且合成语音质量相当。

Since the publication of the WaveNet paper (2016), there have been various attempts to propose a parallel non-autoregressive vocoder, which could synthesize high-quality speech. Popular architectures based on Normalizing Flows like Parallel WaveNet (van den Oord et al., 2018) and WaveGlow (Prenger et al., 2019) managed to accelerate inference while keeping synthesis quality at a very high level but demonstrated fast synthesis on GPU devices only.

自WaveNet论文（2016年）发表以来，研究者提出了多种并行非自回归声码器的尝试，以合成高质量语音。基于正规化流的流行架构（如Parallel WaveNet（van den Oord等人，2018）和WaveGlow（Prenger等人，2019））成功加速了推理速度，同时保持高合成质量，但仅在GPU设备上实现了快速合成。

Eventually, parallel GAN-based vocoders such as Parallel Wave-GAN (Yamamoto et al., 2020), MelGAN (Kumar et al., 2019), and HiFi-GAN (Kong et al., 2020) greatly improved the performance of waveform generation on CPU devices. Furthermore, the latter model is reported to produce speech samples of state-of-the-art quality outperforming WaveNet.

最终，基于并行GAN的声码器（如Parallel Wave-GAN（Yamamoto等人，2020）、MelGAN（Kumar等人，2019）和HiFi-GAN（Kong等人，2020））显著提升了CPU设备上的波形生成性能。此外，据报道，后者生成的语音样本质量达到最先进水平，甚至超越了WaveNet。

Among feature generators, Tacotron2 (Shen et al., 2018) and Transformer-TTS (Li et al., 2019) enabled highly natural speech synthesis. Producing acoustic features frame by frame, they achieve almost perfect mel-spectrogram reconstruction from input text. Nonetheless, they often suffer from computational inefficiency and pronunciation issues coming from attention failures.

在特征生成器中，Tacotron2（Shen等人，2018）和Transformer-TTS（Li等人，2019）实现了高度自然的语音合成。它们通过逐帧生成声学特征，几乎完美地从输入文本中重建梅尔频谱图。然而，这些模型常因计算效率低和注意力机制失效导致的发音问题而受限。

Addressing these problems, such models as FastSpeech (Ren et al., 2019) and Parallel Tacotron (Elias et al., 2020) substantially improved inference speed and pronunciation robustness by utilizing non-autoregressive architectures and building hard monotonic alignments from estimated token lengths. However, in order to learn character duration, they still require pre-computed alignment from the teacher model.

为解决这些问题，FastSpeech（Ren等人，2019）和Parallel Tacotron（Elias等人，2020）等模型通过非自回归架构和基于预估音素长度的硬性单调对齐，显著提升了推理速度和发音鲁棒性。然而，它们仍需依赖教师模型预计算的音素对齐信息来学习字符时长。

Finally, the recently proposed Non-Attentive Tacotron framework (Shen et al., 2020) managed to learn durations implicitly by employing the Variational Autoencoder concept.

最近提出的Non-Attentive Tacotron框架（Shen等人，2020）通过变分自编码器（VAE）概念，隐式地学习了字符时长。

Glow-TTS feature generator (Kim et al., 2020) based on Normalizing Flows can be considered as one of the most successful attempts to overcome pronunciation and computational latency issues typical for autoregressive solutions. Glow-TTS model made use of Monotonic Alignment Search algorithm (an adoption of Viterbi training (Rabiner, 1989) finding the most likely hidden alignment between two sequences) proposed to map the input text to mel-spectrograms efficiently.

基于正规化流的Glow-TTS特征生成器（Kim等人，2020）被认为是克服自回归方案中典型发音和计算延迟问题的最成功尝试之一。Glow-TTS模型采用单调对齐搜索算法（一种改进的Viterbi训练方法，用于寻找序列间的最可能隐藏对齐），高效地将输入文本映射到梅尔频谱图。

The alignment learned by Glow-TTS is intentionally designed to avoid some of the pronunciation problems models like Tacotron2 suffer from. Also, in order to enable parallel synthesis, Glow-TTS borrows encoder architecture from Transformer-TTS (Li et al., 2019) and decoder architecture from Glow (Kingma & Dhariwal, 2018).

Glow-TTS设计的对齐机制有意避免了Tacotron2等模型的发音问题。此外，为支持并行合成，Glow-TTS从Transformer-TTS（Li等人，2019）借鉴了编码器架构，并从Glow（Kingma & Dhariwal，2018）中采用了解码器架构。

Thus, compared with Tacotron2, Glow-TTS achieves much faster inference making fewer alignment mistakes. Besides, in contrast to other parallel TTS solutions such as FastSpeech, Glow-TTS does not require an external aligner to obtain token duration information as Monotonic Alignment Search (MAS) operates in an unsupervised way.

因此，相比Tacotron2，Glow-TTS在减少对齐错误的同时实现了更快推理。此外，与FastSpeech等其他并行TTS方案不同，Glow-TTS无需外部对齐器获取音素时长信息，因其单调对齐搜索（MAS）以无监督方式运行。

Lately, another family of generative models called Diffusion Probabilistic Models (DPMs) (Sohl-Dickstein et al., 2015) has started to prove its capability to model complex data distributions such as images (Ho et al., 2020), shapes (Cai et al., 2020), graphs (Niu et al., 2020), handwriting (Luhman & Luhman, 2020).

近年来，另一类生成模型——扩散概率模型（DPMs）（Sohl-Dickstein等人，2015）已证明其在建模复杂数据分布（如图像（Ho等人，2020）、形状（Cai等人，2020）、图（Niu等人，2020）、手写（Luhman & Luhman，2020））方面的能力。

The basic idea behind DPMs is as follows: we build a forward diffusion process by iteratively destroying original data until we get some simple distribution (usually standard normal), and then we try to build a reverse diffusion parameterized with a neural network so that it follows the trajectories of the reverse-time forward diffusion.

DPM的基本思想是通过迭代破坏原始数据构建前向扩散过程，直至数据收敛到简单分布（通常为标准正态分布），然后通过神经网络参数化反向扩散过程，使其沿前向扩散的逆轨迹运行。

Stochastic calculus offers a continuous easy-to-use framework for training DPMs (Song et al., 2021) and, which is perhaps more important, provides a number of flexible inference schemes based on numerical differential equation solvers.

随机微积分为训练DPM提供了连续且易用的框架（Song等人，2021），更重要的是，它基于数值微分方程求解器提供了灵活的推理方案。

As far as text-to-speech applications are concerned, two vocoders representing the DPM family showed impressive results in raw waveform reconstruction: WaveGrad (Chen et al., 2021) and DiffWave (Kong et al., 2021) were shown to reproduce the fine-grained structure of human speech and match strong autoregressive baselines such as WaveNet in terms of synthesis quality while at the same time requiring much fewer sequential operations.

在语音合成应用中，代表DPM家族的两个声码器WaveGrad（Chen等人，2021）和DiffWave（Kong等人，2021）在原始波形重建中表现出色：它们不仅能重现人类语音的精细结构，合成质量可与WaveNet等强自回归基线媲美，且所需序列操作大幅减少。

However, despite such success in neural vocoding, no feature generator based on diffusion probabilistic modelling is known so far.

然而，尽管DPM在声码器领域取得了成功，目前尚未出现基于扩散概率模型的特征生成器。

This paper introduces Grad-TTS, an acoustic feature generator with a score-based decoder using recent diffusion probabilistic modelling insights. In Grad-TTS, MAS-aligned encoder outputs are passed to the decoder that transforms Gaussian noise parameterized by these outputs into a mel-spectrogram.

本文提出Grad-TTS——一种基于评分解码器的声学特征生成器，结合了最新的扩散概率模型思想。在Grad-TTS中，通过单调对齐搜索（MAS）对齐的编码器输出被传递到解码器，后者将高斯噪声（由这些输出参数化）转换为梅尔频谱图。

To cope with the task of reconstructing data from Gaussian noise with varying parameters, we write down a generalized version of conventional forward and reverse diffusions. One of the remarkable features of our model is that it provides explicit control of the trade-off between output mel-spectrogram quality and inference speed.

为解决从含噪高斯分布重建数据的任务，我们提出了传统前向和反向扩散过程的广义版本。模型的一个显著特点是能够显式控制输出梅尔频谱图质量与推理速度之间的权衡。

In particular, we find that Grad-TTS is capable of generating mel-spectrograms of high quality with only as few as ten iterations of reverse diffusion, which makes it possible to outperform Tacotron2 in terms of speed on GPU devices.

实验表明，Grad-TTS仅需10次反向扩散迭代即可生成高质量梅尔频谱图，使其在GPU设备上的推理速度超越Tacotron2。

Additionally, we show that it is possible to train Grad-TTS as an end-to-end TTS pipeline (i.e., vocoder and feature generator are combined in a single model) by replacing its output domain from mel-spectrogram to raw waveform.

此外，通过将输出域从梅尔频谱图替换为原始波形，Grad-TTS可作为端到端TTS流程训练（即声码器与特征生成器集成于单一模型）。

2. Diffusion probabilistic modelling
2. 扩散概率模型

Loosely speaking, a process of the diffusion type is a stochastic process that satisfies a stochastic differential equation (SDE)
$dX_{t}=b(X_{t},t)dt+a(X_{t},t)dW_{t},\qquad(1)$
其中 $W_{t}$ 是标准布朗运动， $t∈[0,T]t\in[0,T]$ 表示有限时间范围，系数 $b$ 和 $a$ （分别称为漂移项和扩散项）满足可测性条件。严格定义见(Liptser & Shiryaev, 1978)。

直观而言，扩散过程是通过满足随机微分方程（SDE）的随机过程：
$dX_{t}=b(X_{t},t)dt+a(X_{t},t)dW_{t},\qquad(1)$
其中 $W_{t}$ 为标准布朗运动，时间范围 $t∈[0,T]t\in[0,T]$ ，系数 $b$ （漂移项）和 $a$ （扩散项）需满足可测性条件。严格定义参见(Liptser & Shiryaev, 1978)。

It is easy to find such a stochastic process that terminal distribution $Law(X_{T})$ converges to standard normal $N(0,I)\mathcal{N}(0,I)$ when $T→∞T\rightarrow\infty$ for any initial data distribution $Law(X_{0})$ . In fact, there are lots of such processes…

可以很容易找到这样的随机过程：当 $T→∞T\rightarrow\infty$ 时，终端分布 $Law(X_{T})$ 收敛于标准正态分布 $N(0,I)\mathcal{N}(0,I)$ ，无论初始分布 $Law(X_{0})$ 如何。实际上，存在大量此类过程…

The goal of diffusion probabilistic modelling is to find a reverse diffusion such that its trajectories closely follow those of the forward diffusion but in reverse time order.

扩散概率模型的目标是找到反向扩散过程，使其轨迹与前向扩散过程的时间反演轨迹高度吻合。

Stochastic calculus offers a continuous easy-to-use framework for training DPMs and provides flexible inference schemes based on numerical differential equation solvers.

随机微积分为DPM训练提供了连续易用的框架，并基于数值微分方程求解器实现灵活推理。

2.1. Forward diffusion
2.1 前向扩散

We define the forward diffusion process as an SDE transforming any data distribution into Gaussian noise:
$dX_{t}=\frac{1}{2}\Sigma^{-1}(\mu-X_{t})\beta_{t}dt+\sqrt{\beta_{t}}dW_{t},\quad t\in[0,T]\qquad(2)$
其解为：
$X_{t}=\left(I-e^{-\frac{1}{2}\Sigma^{-1}\int_{0}^{t}\beta_{s}ds}\right)\mu+e^{-\frac{1}{2}\Sigma^{-1}\int_{0}^{t}\beta_{s}ds}X_{0}+\int_{0}^{t}\sqrt{\beta_{s}}e^{-\frac{1}{2}\Sigma^{-1}\int_{s}^{t}\beta_{u}du}dW_{s}.\qquad(3)$

前向扩散过程定义为将任意数据分布转化为高斯噪声的SDE：
$dX_{t}=\frac{1}{2}\Sigma^{-1}(\mu-X_{t})\beta_{t}dt+\sqrt{\beta_{t}}dW_{t},\quad t\in[0,T]\qquad(2)$
解为：
$X_{t}=\left(I-e^{-\frac{1}{2}\Sigma^{-1}\int_{0}^{t}\beta_{s}ds}\right)\mu+e^{-\frac{1}{2}\Sigma^{-1}\int_{0}^{t}\beta_{s}ds}X_{0}+\int_{0}^{t}\sqrt{\beta_{s}}e^{-\frac{1}{2}\Sigma^{-1}\int_{s}^{t}\beta_{u}du}dW_{s}.\qquad(3)$

When $lim⁡t→∞e−∫0tβsds=0\lim_{t\rightarrow\infty} e^{-\int_{0}^{t}\beta_{s}ds}=0$ , we have $Xt∣X0→dN(μ,Σ)X_{t}\mid X_{0}\xrightarrow{d}\mathcal{N}(\mu,\Sigma)$ .

当 $lim⁡t→∞e−∫0tβsds=0\lim_{t\rightarrow\infty} e^{-\int_{0}^{t}\beta_{s}ds}=0$ 时， $X_{t}$ 的分布收敛于 $N(μ,Σ)\mathcal{N}(\mu,\Sigma)$ 。

2.2. Reverse diffusion
2.2 反向扩散

The reverse SDE proposed by Song et al. (2021) is:
$X_{t}=\left(\frac{1}{2}\Sigma^{-1}(\mu-X_{t})-\nabla\log p_{t}(X_{t})\right)\beta_{t}dt+\sqrt{\beta_{t}}d\widetilde{W}_{t},\qquad(8)$
其等价ODE形式为：
$dX_{t}=\frac{1}{2}\left(\Sigma^{-1}\left(\mu-X_{t}\right)-\nabla\log p_{t}\left(X_{t}\right)\right)\beta_{t} dt.\qquad(9)$

Song等人（2021）提出的反向SDE为：
$X_{t}=\left(\frac{1}{2}\Sigma^{-1}(\mu-X_{t})-\nabla\log p_{t}(X_{t})\right)\beta_{t}dt+\sqrt{\beta_{t}}d\widetilde{W}_{t},\qquad(8)$
其等效ODE形式为：
$dX_{t}=\frac{1}{2}\left(\Sigma^{-1}\left(\mu-X_{t}\right)-\nabla\log p_{t}\left(X_{t}\right)\right)\beta_{t} dt.\qquad(9)$

By estimating $∇log⁡pt(Xt)\nabla\log p_{t}(X_{t})$ with a neural network $sθs_{\theta}$ , we can model the data distribution via reverse-time sampling.

通过神经网络 $sθs_{\theta}$ 估计 $∇log⁡pt(Xt)\nabla\log p_{t}(X_{t})$ ，即可通过反向时间采样建模数据分布。

2.3. Loss function
2.3 损失函数

The score matching loss is defined as:
$\mathcal{L}_{t}\left(X_{0}\right)=E_{\epsilon_{t}}\left[\left\|s_{\theta}\left(X_{t}, t\right)+\lambda(\Sigma, t)^{-1}\epsilon_{t}\right\|_{2}^{2}\right],\quad(12)$
其中 $ϵt∼N(0,λ(Σ,t))\epsilon_{t}\sim\mathcal{N}(0,\lambda(\Sigma, t))$ ， $λt=1−e−∫0tβsds\lambda_{t}=1-e^{-\int_{0}^{t}\beta_{s}ds}$ 。

评分匹配损失函数为：
$\mathcal{L}_{t}\left(X_{0}\right)=E_{\epsilon_{t}}\left[\left\|s_{\theta}\left(X_{t}, t\right)+\lambda(\Sigma, t)^{-1}\epsilon_{t}\right\|_{2}^{2}\right],\quad(12)$
其中 $ϵt∼N(0,λ(Σ,t))\epsilon_{t}\sim\mathcal{N}(0,\lambda(\Sigma, t))$ ， $λt=1−e−∫0tβsds\lambda_{t}=1-e^{-\int_{0}^{t}\beta_{s}ds}$ 。