0% found this document useful (0 votes)
3 views5 pages

Research Paper

The document presents a novel end-to-end text-to-speech (E2E-TTS) model that jointly trains FastSpeech2 and HiFi-GAN, simplifying the training pipeline and eliminating the need for external speech-text alignment tools. This model achieves high-quality speech synthesis directly from text without intermediate mel-spectrograms, outperforming existing state-of-the-art implementations in both subjective and objective evaluations. Key contributions include the incorporation of an alignment learning framework to enhance training efficiency and synthesis quality.

Uploaded by

niranjana ramesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views5 pages

Research Paper

The document presents a novel end-to-end text-to-speech (E2E-TTS) model that jointly trains FastSpeech2 and HiFi-GAN, simplifying the training pipeline and eliminating the need for external speech-text alignment tools. This model achieves high-quality speech synthesis directly from text without intermediate mel-spectrograms, outperforming existing state-of-the-art implementations in both subjective and objective evaluations. Key contributions include the incorporation of an alignment learning framework to enhance training efficiency and synthesis quality.

Uploaded by

niranjana ramesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to

Speech
Dan Lim, Sunghee Jung, Eesung Kim

Kakao Enterprise Corporation, Seongnam, Republic of Korea


{satoshi.2020, [Link], [Link]}@[Link]

Abstract somewhat complicated in that the former requires additional


training steps and the latter requires completion of training of
In neural text-to-speech (TTS), two-stage system or a cascade
an acoustic feature generator prior to vocoder training stage.
of separately learned models have shown synthesis quality close
to human speech. For example, FastSpeech2 transforms an in- On the other hand, end-to-end text-to-speech (E2E-TTS)
put text to a mel-spectrogram and then HiFi-GAN generates a [5], [13], [14], [15], [16] is a recent research trend in which
raw waveform from a mel-spectogram where they are called an a speech waveform is directly generated from an input text in
acoustic feature generator and a neural vocoder respectively. a single stage without distinction between an acoustic feature
arXiv:2203.16852v2 [[Link]] 1 Jul 2022

However, their training pipeline is somewhat cumbersome in generator and a neural vocoder. Although there is no interme-
that it requires a fine-tuning and an accurate speech-text align- diate conversion to human-designed acoustic features such as
ment for optimal performance. In this work, we present end- mel-spectrogram, it has shown comparable performance to the
to-end text-to-speech (E2E-TTS) model which has a simpli- two-stage TTS systems. Since E2E-TTS doesn’t have a prob-
fied training pipeline and outperforms a cascade of separately lem of an acoustic feature mismatch, it usually doesn’t require
learned models. Specifically, our proposed model is jointly fine-tuning or sequential training. Moreover, some works [13],
trained FastSpeech2 and HiFi-GAN with an alignment module. [14] further simplify the training pipeline by incorporating an
Since there is no acoustic feature mismatch between training alignment learning module so that the model can be trained
and inference, it does not requires fine-tuning. Furthermore, without dependency on external speech-text alignments tools.
we remove dependency on an external speech-text alignment In this work, we propose E2E-TTS with a simplified train-
tool by adopting an alignment learning objective in our joint ing pipeline and high-quality speech synthesis. Our work is
training framework. Experiments on LJSpeech corpus shows similar to [17] in that joint training of an acoustic feature gener-
that the proposed model outperforms publicly available, state- ator and a neural vocoder is researched and the experiments are
of-the-art implementations of ESPNet2-TTS on subjective eval- based on the ESPNet2 toolkit. However, our proposed model
uation (MOS) and some objective evaluations. directly synthesizes raw waveform from an input text without
Index Terms: end to end text to speech, joint training, espnet an intermediate mel-spectrogram. Moreover, we incorporate an
alignment learning objective so that the proposed model can be
trained in single-stage training without dependency on external
1. Introduction alignments models. The contributions of our work can be sum-
Text-to-speech (TTS) based on the neural network has signifi- marized as follows.
cantly improved synthesized speech quality over the past years.
Generally, a task of neural TTS is divided into more manage- • We make the E2E-TTS model by jointly training an
able sub-tasks using an acoustic feature generator and a neural acoustic feature generator and a neural vocoder, which
vocoder. In this two-stage system, an acoustic feature generator are FastSpeech2 and HiFi-GAN respectively. It does
generates an acoustic feature from an input text first and then not require pre-training or fine-tuning and it synthe-
a neural vocoder synthesizes a raw waveform from an acoustic sizes high-quality speech without an intermediate mel-
feature. Those models are trained separately and then joined for spectrogram.
inference. An acoustic feature generator can be autoregressive
• We leverage an alignment learning framework [18] to
and attention-based for implicit speech-text alignments [1], [2]
obtain token duration on the fly during the training. Thus
or it can be non-autoregressive for efficient parallel inference
the training of our proposed model does not require ex-
and duration informed for robustness on synthesis error [3], [4],
ternal speech-text alignments models.
[5]. There are lots of research on neural vocoder as well and
some of the famous, widely used include [6], [7], normalizing • The proposed model outperforms state-of-the-art imple-
flow based one [8] and generative adversarial network (GAN) mentations of ESPNet2-TTS [17] on both subjective and
based ones [9], [10], [11], [12]. objective evaluations.
Although the two-stage system is the dominant approach
for TTS, training two models separately may result in degrada- 2. Related work
tion of synthesis quality due to an acoustic feature mismatch.
Note that a neural vocoder takes the ground-truth acoustic fea- There are several E2E-TTS research that directly generates
tures for training and the predicted ones from an acoustic feature speech waveform from an input text. For examples, Fast-
generator for inference. For optimal performance, we can fur- Speech2s [5] is similar to our work in that it uses FastSpeech2
ther train a pre-trained neural vocoder with predicted acoustic and GAN-based vocoder; Parallel WaveGAN [10]. However,
features, which is called fine-tuning [12], [13]. Or we can train it requires an auxiliary mel-spectrogram decoder and a prepa-
a neural vocoder with predicted acoustic feature from the be- ration of speech-text alignments to train the model. Although
ginning [1]. However, both methods make the training pipeline LiteTTS [19] also combines an acoustic feature generator with
Figure 1: An architecture of proposed model (discriminators are
omitted for brevity)
Figure 2: Variance adaptor

HiFi-GAN, it still depends on external alignments models and


1D convolution-based variance adaptor. Figure 1 depicts each
focuses more on lightweight structures for on-device uses.
module in the proposed model. Specifically, the encoder en-
On the other hand, EATS [14] integrates alignment learning
codes an input text as text embeddings h, and the variance
into its adversarial training framework and it improves align-
adaptor adds variance information to the text embeddings and
ment learning stability by employing soft dynamic time warp-
expands according to each token duration for the decoder.
ing to spectrogram prediction loss. VITS [13] also learns align-
Figure 2 depicts the structure of the variance adaptor which
ments during the training in the process of maximizing the like-
consists of pitch, energy, and duration predictor. Pitch and en-
lihood of data and it improves expressiveness by utilizing varia-
ergy predictors are trained to minimize token-wise pitch and
tional inference and normalizing flow in an adversarial training
energy respectively following the FastSpeech2 implementation
framework. In EFTS-Wav [15], they adopt MelGAN and devise
of ESPNet2-TTS [17] or FastPitch [21] instead of frame-wise
a novel monotonic alignment strategy with mel-spectrogram de-
as in [5]. During training, required token-wise pitch and energy
coder for alignment learning. Wave-Tacotron [16] adopts an
p, e is computed on the fly by averaging frame-wise ground-
attention-based Tacotron [1] with the normalizing flow and it
truth pitch and energy according to token duration d. The to-
is optimized to simply maximize the likelihood of the training
ken duration is defined as the number of mel-frame assigned to
data.
each input text token and is obtained from the alignment module
In [17], joint training of an acoustic feature generator and which will be explained later. After text embeddings are added
a neural vocoder was conducted and it proved its effectiveness with pitch and energy, it is expanded by a length regulator (LR)
at solving the problem of acoustic feature mismatch by show- according to the token duration. We use gaussian upsampling
ing significant improvement compared to the separately learned with fixed temperature, also known as softmax-based aligner
model. However, the performance of the jointly trained model [14], instead of vanilla upsampling by repetition [3].
could not match that of a separately learned, fine-tuned model. Note that although we adopt FastSpeech2 for our joint train-
ing, we exclude its mel-spectrogram loss so that the proposed
3. Model description model is trained to synthesize raw waveform directly from an
input text without intermediate mel-spectrogram. Thus there
The proposed model is E2E-TTS which is jointly trained Fast-
remains a variance loss that minimizes each variance with L2
Speech2 and HiFi-GAN with an alignment module. In this sec-
loss.
tion, we describe each component in order.

3.1. FastSpeech2 Lvar = ||d − d̂||2 + ||p − p̂||2 + ||e − ê||2 (1)
We adopt FastSpeech2 [5] as one of the components of the pro- where d, p, e are ground-truth duration, pitch and energy fea-
posed model. It is a non-autoregressive acoustic feature gen-
ture sequences respectively whereas d̂, p̂, ê are predicted ones
erator with fast and high-quality speech synthesis. By explic-
from the model respectively.
itly modeling token duration with a duration predictor, it im-
proves robustness on synthesis errors such as phoneme repeat
3.2. HiFi-GAN
and skips. Compared to its previous work; FastSpeech [3], it
achieves significant improvement in speech quality by employ- HiFi-GAN [11] is one of the most famous, GAN-based neural
ing additional variance information which is pitch and energy. vocoders with fast and efficient parallel synthesis. In the GAN
For our proposed model, We follow the structure of [5], which training framework, a model is trained by adversarial feed-
is a feed-forward Transformer-based [20] encoder, decoder, and back where a generator is trained to fake a discriminator, and
a discriminator is trained to discriminate between the ground-
truth sample and the predicted sample of the generator alter-
Di,j = distL2 (henc
i , menc
j ) (3)
nately. Discriminators of HiFi-GAN are designed to improve
fidelity by considering a property of speech waveform, which Asof t = softmax(−D, dim = 0) (4)
are multi period discriminator (MPD) and multi scale discrimi- where henc , menc is the encoded text embeddings and mel-
i j
nator (MSD). MPD handles diverse periodic patterns of speech spectrogram at timestep i, j respectively.
waveform whereas MSD operates on the consecutive waveform From soft alignment distribution Asof t , we can compute
at different scales with a wide receptive field. the likelihood of all valid monotonic alignments which is the
As depicted in figure 1, we adopt the HiFi-GAN genera- alignment learning objective to be maximized.
tor for synthesizing raw waveform from the output of the de-
coder. HiFi-GAN generator upsamples the output of the de- T
X Y
coder through transposed convolution to match the length of P (S(h)|m) = P (st |mt ) (5)
the raw waveform where an output of the decoder has the same s∈S(h) t=1
length as mel-spectrogram of the ground-truth waveform. It where s is a specific alignment between a text and mel-
has not only adversarial loss but also auxiliary losses which are spectrogram (e.g., s1 = h1 , s2 = h2 , ..., sT = hN ), S(h) is
feature matching loss [9] and mel-spectrogram loss for the im- the set of all valid monotonic alignments and T, N is the length
provement of speech quality and training stability. Note that of mel-spectrogram and text token respectively. A forward-
auxiliary mel-spectrogram loss here is L1 loss between mel- sum algorithm is used for computing the alignment learning
spectrogram of synthesized waveform and that of the ground- objective and we define negative of it as forward sum loss
truth waveform, which is devised and used for training HiFi- Lf orward sum . Notably it can be efficiently trained with off-
GAN [11]. The auxiliary mel-spectrogram loss is different from the-shelf CTC [25] loss implementation.
the mel-spectrogram loss of FastSpeech2 [5]. The training ob- To obtain token duration d, the monotonic alignment
jective of HiFi-GAN follows LSGAN [22] and the generator search (MAS) [24] is used to convert soft alignment Asof t
loss consists of an adversarial loss and auxiliary losses as fol- to monotonic, binarized hard alignment Ahard wherein
lows.
P T
j=1 Ahard,i,j represents each token duration. Thus each to-
ken duration is the number of mel-frames assigned to each in-
Lg = Lg,adv + λf m Lf m + λmel Lmel (2) put text tokens and the sum of duration equals the length of
where Lg,adv is adversarial loss based on least-squares loss mel-spectrogram. There are additional binarization loss Lbin
function and λf m , λmel is scaling factor for auxiliary feature which enforces Asof t matches Ahard by minimizing their KL-
matching and mel-spectrogram loss respectively. divergence. Note that we also apply beta-binomial alignment
prior as in [18], [26] which multiplies 2d static prior to Asof t to
3.3. Alignment Learning Framework accelerate the alignment learning by making the near-diagonal
path more probable.
Speech-text alignment is crucial in duration informed networks
[3], [4], [5] where the TTS model has a separate duration model
and requires explicit duration for its model training as in Fast- Lbin = −Ahard logAsof t (6)
Speech2. In our proposed model, each token duration d is Lalign = Lf orward sum + Lbin (7)
used for training a duration predictor, for computing token-
where is Hadamard product and Lalign is final loss for align-
averaged pitch, energy from frame-wised ones, and for upsam-
ments.
pling the text embeddings. The token duration can be obtained
from a pre-trained autoregressive TTS model [2] as in [3] or
3.4. Final Loss
from speech-text alignment tool such as montreal forced aligner
(MFA) as in [4], [5]. Moreover, the training pipeline can be As depicted in figure 1, the proposed model consists of the
more simplified by incorporating alignment learning so that the encoder, variance adaptor, decoder, HiFi-GAN generator, and
required token duration is obtained during the model training alignment module where the alignment module is used for train-
on the fly [15], [18], [23], [24]. ing only. It is trained to directly synthesize raw waveform from
In this work, we incorporate an alignment learning frame- an input text without intermediate mel-spectrogram loss in the
work [18] into our joint training framework for obtaining the GAN training framework. Note that we use discriminators of
required token duration d during the training on the fly. An HiFi-GAN for the training of the proposed model though it is
alignment learning framework has shown an improved speech omitted from figure 1. Consequently, the loss of the proposed
quality as well as fast alignment convergence by devising an model is GAN training loss integrated with the variance loss
alignment learning objective, which can be applied to both au- and the alignment loss as follows.
toregressive and non-autoregressive TTS models. An alignment
learning objective can be computed efficiently using a forward- L = Lg + λvar Lvar + λalign Lalign (8)
sum algorithm. An alignment module in figure 1 represents where we used 1 for λvar and 2 for λalign as scaling factor of
the proposed module of an alignment learning framework [18], the variance and alignments loss respectively.
from which an alignment learning objective as well as each to-
ken duration are obtained.
Specifically, an alignment module encodes the text embed-
4. Experiments
dings h and mel-spectrogram m as henc , menc with 2 and For reproducible research, we conducted all experiments in-
3 1D convolution layers respectively. After that, it computes cluding data preparation, model training, and evaluation using
soft alignment distribution Asof t which is softmax normalized ESPNet2-TTS [17] toolkit. The ESPNet2-TTS is a famous,
across text domain based on the learned pairwise affinity be- open-sourced speech processing toolkit and it provides various
tween all text tokens and mel-frames. recipes for reproducing state-of-the-art TTS results.
4.1. Dataset Table 1: Results on LJSpeech corpus, where ”STD” represents
standard deviation and ”CI” represents 95% confidence inter-
We experimented with LJSpeech corpus [27] which is an En- vals.
glish single female speaker dataset. It consists of 24 hours of
speech recorded with a 22.05kHz sampling rate and 16bits. Fol- Method MCD ± STD F0 RMSE ± STD CER MOS ± CI
lowing the recipe in egs2/ljspeech/tts1 in the toolkit, GT N/A N/A 1.0 4.08 ± 0.07
we used 12,600 utterances for training, 250 for validation and
CF2 (+joint-ft) 6.73 ± 0.62 0.219 ± 0.034 1.5 3.96 ± 0.08
250 for evaluation. CF2 (+joint-tr) 6.80 ± 0.54 0.218 ± 0.035 1.5 3.93 ± 0.08
Mel-spectrogram, which is used as an auxiliary loss and VITS 6.99 ± 0.63 0.234 ± 0.037 3.6 3.82 ± 0.09
an input for an alignment module in the proposed model, was Proposed model 7.16 ± 0.55 0.215 ± 0.034 1.3 4.02 ± 0.07
computed with 80 dimensions, 1024 fft size, and 256 hop size.
For a fair comparison, g2p-en 1 without word separators was
used as a G2P function, which is the same configuration as the
Table 1 shows the results on GT (ground-truth recordings),
baseline models of ESPNet2-TTS which will be explained later.
baseline models, and the proposed model. We obtained con-
sistent outcomes as the previous work [17] in that the base-
4.2. Model configuration
line models achieved high MOS values in the order of CF2
We implemented the proposed model using the ESPNet2- (+joint-ft), CF2 (+joint-tr), and VITS. Interestingly, our pro-
TTS toolkit following the configurations and training methods posed model outperformed all of the baselines on MOS as well
of train joint conformer fastspeech2 hifigan as objective metrics; F0 RMSE, CER.
in the same recipe of the toolkit used for data preparation. The When it comes to the acoustic feature mismatch, the pro-
differences are that the transformer was used for an encoder and posed model addressed the problem by the E2E approach which
decoder type instead of a conformer. And we used 256 for the trains the model to generate raw waveform directly from an
attention dimension and 1024 for the number of the encoder and input text without an intermediate mel-spectrogram. Whereas
decoder ff units respectively. In the case of an alignment mod- CF2 (+joint-ft) and CF2 (+joint-tr) solved the problem by
ule, we simply followed the proposed structure in [18]. Note jointly fine-tuning and jointly training from scratch respectively.
that generally a neural vocoder is trained to generate only part of Thus we conjecture that the E2E approach was more effective
the speech waveform from the corresponding portion of an in- for improvement than joint fine-tuning or simply joint training
put sequence for training efficiency. A related hyper-parameter of an acoustic feature generator with a vocoder. Another differ-
in the toolkit is called segment size which determines the length ence compared to CF2 (+joint-ft) and CF2 (+joint-tr) was that
of the randomly sliced output sequence of the decoder. We used the proposed model incorporates alignment learning in its joint
64 for this hyper-parameter. training framework. It seems that those factors not only have
For the comparative experiment, we prepared a conven- simplified the training pipeline but also may improve the syn-
tional two-stage, cascaded TTS model as well as another thesized speech quality although we didn’t investigate how they
E2E-TTS model. Specifically, we compared the proposed are related to the model performance thoroughly in this paper.
model with state-of-the-art implementations of ESPNet2-TTS. In the case of VITS, which is also an E2E model with align-
It provides the pre-trained models for public use including ment learning capability, it achieved the worst results in our ex-
CF2 (+joint-ft), CF2 (+joint-tr), and VITS. CF2 (+joint-ft) is periment. One of the reasons other than the weakness on the
Conformer-based [28] FastSpeech2 with HiFi-GAN vocoder g2p error as reported in [17], could be its training difficulty due
which are separately trained and jointly fine-tuned. CF2 (+joint- to the somewhat complicated model structure compared to our
tr) is also Conformer-based FastSpeech2 with HiFi-GAN but it proposed model. Note that VITS utilizes variational autoen-
is jointly trained from scratch. VITS is E2E-TTS implementa- coder and normalizing flow [13].
tion of the paper [13].

4.3. Evaluation 5. Conclusions


We evaluated the performance of TTS models in objective and In this paper, we proposed an end-to-end text-to-speech model
subjective metrics. For objective evaluations, mel-cepstral dis- which is the jointly trained FastSpeech2 and HiFi-GAN with
tortion (MCD), log-F0 root mean square error (F0 RMSE), an alignment module. The proposed model directly generates
and character error rate (CER) were computed using evaluation speech waveform from an input text without intermediate con-
scripts provided by the ESPNet2-TTS toolkit. We computed version to an explicit human-designed acoustic feature. The
CER using the same pre-trained ESPNet2-ASR model 2 which training of the proposed model does not have fine-tuning which
was used in [17]. For subjective evaluation, we conducted a is required in the two-stage, separately learned text-to-speech
crowdsourced Mean Opinion Score (MOS) test via Amazon models due to the problem of an acoustic feature mismatch.
Mechanical Turk where each participant, located in the United Moreover, we adopt an alignment learning framework so that
States, scored each audio sample from different models (includ- the proposed model does not depend on external alignment tools
ing ground-truth audio sample) for naturalness on 5 point scale: for training. Consequently, the proposed model has a simpli-
5 for excellent, 4 for good, 3 for fair, 2 for poor, and 1 for bad. fied training pipeline that is jointly trained in a single stage.
Randomly selected 20 utterances from the evaluation set were For evaluations, we compared the proposed model with pub-
used for the MOS test and each utterance was listened to by 20 licly available implementations of the ESPNet2-TTS toolkit on
different participants. Audio samples are available online 3 the English LJSpeech corpus, and the proposed model achieved
state-of-the-art results. It would be interesting for future works
1 [Link] to investigate other combinations of joint training other than
2 [Link] FastSpeech2 and HiFi-GAN, or to evaluate on multi-speaker
3 [Link] dataset.
6. References [16] R. J. Weiss, R. Skerry-Ryan, E. Battenberg, S. Mariooryad,
and D. P. Kingma, “Wave-tacotron: Spectrogram-free end-to-end
[1] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, text-to-speech synthesis,” in ICASSP 2021-2021 IEEE Interna-
Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan et al., “Natural tional Conference on Acoustics, Speech and Signal Processing
tts synthesis by conditioning wavenet on mel spectrogram pre- (ICASSP). IEEE, 2021, pp. 5679–5683.
dictions,” in 2018 IEEE international conference on acoustics,
speech and signal processing (ICASSP). IEEE, 2018, pp. 4779– [17] T. Hayashi, R. Yamamoto, T. Yoshimura, P. Wu, J. Shi,
4783. T. Saeki, Y. Ju, Y. Yasuda, S. Takamichi, and S. Watanabe,
“Espnet2-tts: Extending the edge of tts research,” arXiv preprint
[2] N. Li, S. Liu, Y. Liu, S. Zhao, and M. Liu, “Neural speech synthe- arXiv:2110.07840, 2021.
sis with transformer network,” in Proceedings of the AAAI Con-
ference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 6706– [18] R. Badlani, A. Łańcucki, K. J. Shih, R. Valle, W. Ping, and
6713. B. Catanzaro, “One tts alignment to rule them all,” in ICASSP
2022 - 2022 IEEE International Conference on Acoustics, Speech
[3] Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, and Signal Processing (ICASSP), 2022, pp. 6092–6096.
“Fastspeech: Fast, robust and controllable text to speech,” Ad-
[19] H.-K. Nguyen, K. Jeong, S. Um, M.-J. Hwang, E. Song, and H.-
vances in Neural Information Processing Systems, vol. 32, 2019.
G. Kang, “LiteTTS: A Lightweight Mel-Spectrogram-Free Text-
[4] C. Yu, H. Lu, N. Hu, M. Yu, C. Weng, K. Xu, P. Liu, D. Tuo, to-Wave Synthesizer Based on Generative Adversarial Networks,”
S. Kang, G. Lei et al., “Durian: Duration informed attention net- in Proc. Interspeech 2021, 2021, pp. 3595–3599.
work for speech synthesis.” in INTERSPEECH, 2020, pp. 2027– [20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
2031. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you
[5] Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T. Liu, need,” in Advances in Neural Information Processing Systems,
“Fastspeech 2: Fast and high-quality end-to-end text to speech,” in I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus,
9th International Conference on Learning Representations, ICLR S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Asso-
2021, Virtual Event, Austria, May 3-7, 2021, 2021. ciates, Inc., 2017.

[6] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, [21] A. Łańcucki, “Fastpitch: Parallel text-to-speech with pitch pre-
A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, diction,” in ICASSP 2021-2021 IEEE International Conference
“Wavenet: A generative model for raw audio,” arXiv preprint on Acoustics, Speech and Signal Processing (ICASSP). IEEE,
arXiv:1609.03499, 2016. 2021, pp. 6588–6592.
[22] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. Smolley,
[7] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, “Least squares generative adversarial networks,” in 2017 IEEE
N. Casagrande, E. Lockhart, F. Stimberg, A. Oord, S. Dieleman, International Conference on Computer Vision (ICCV), 2017, pp.
and K. Kavukcuoglu, “Efficient neural audio synthesis,” in 2813–2821.
International Conference on Machine Learning. PMLR, 2018,
pp. 2410–2419. [23] D. Lim, W. Jang, G. O, H. Park, B. Kim, and J. Yoon, “JDI-
T: Jointly Trained Duration Informed Transformer for Text-To-
[8] R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A flow-based Speech without Explicit Alignment,” in Proc. Interspeech 2020,
generative network for speech synthesis,” in ICASSP 2019-2019 2020, pp. 4004–4008.
IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP). IEEE, 2019, pp. 3617–3621. [24] J. Kim, S. Kim, J. Kong, and S. Yoon, “Glow-tts: A generative
flow for text-to-speech via monotonic alignment search,” in Ad-
[9] K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z. Teoh, vances in Neural Information Processing Systems, H. Larochelle,
J. Sotelo, A. de Brébisson, Y. Bengio, and A. C. Courville, “Mel- M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, Eds., vol. 33.
gan: Generative adversarial networks for conditional waveform Curran Associates, Inc., 2020, pp. 8067–8077.
synthesis,” Advances in neural information processing systems,
vol. 32, 2019. [25] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Con-
nectionist temporal classification: Labelling unsegmented se-
[10] R. Yamamoto, E. Song, and J.-M. Kim, “Parallel wavegan: A fast quence data with recurrent neural networks,” in Proceedings of the
waveform generation model based on generative adversarial net- 23rd International Conference on Machine Learning, ser. ICML
works with multi-resolution spectrogram,” in ICASSP 2020-2020 ’06, 2006, p. 369–376.
IEEE International Conference on Acoustics, Speech and Signal [26] K. J. Shih, R. Valle, R. Badlani, A. Lancucki, W. Ping, and
Processing (ICASSP). IEEE, 2020, pp. 6199–6203. B. Catanzaro, “RAD-TTS: Parallel flow-based TTS with robust
[11] J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial net- alignment learning and diverse synthesis,” in ICML Workshop
works for efficient and high fidelity speech synthesis,” Advances on Invertible Neural Networks, Normalizing Flows, and Explicit
in Neural Information Processing Systems, vol. 33, pp. 17 022– Likelihood Models, 2021.
17 033, 2020. [27] K. Ito and L. Johnson, “The lj speech dataset,” [Link]
[12] W. Jang, D. Lim, J. Yoon, B. Kim, and J. Kim, “UnivNet: A Neu- com/LJ-Speech-Dataset/, 2017.
ral Vocoder with Multi-Resolution Spectrogram Discriminators [28] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu,
for High-Fidelity Waveform Generation,” in Proc. Interspeech W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer:
2021, 2021, pp. 2207–2211. Convolution-augmented Transformer for Speech Recognition,” in
Proc. Interspeech 2020, 2020, pp. 5036–5040.
[13] J. Kim, J. Kong, and J. Son, “Conditional variational autoencoder
with adversarial learning for end-to-end text-to-speech,” in Pro-
ceedings of the 38th International Conference on Machine Learn-
ing, ser. Proceedings of Machine Learning Research, vol. 139.
PMLR, 18–24 Jul 2021, pp. 5530–5540.
[14] J. Donahue, S. Dieleman, M. Binkowski, E. Elsen, and K. Si-
monyan, “End-to-end adversarial text-to-speech,” in International
Conference on Learning Representations, 2021.
[15] C. Miao, L. Shuang, Z. Liu, C. Minchuan, J. Ma, S. Wang, and
J. Xiao, “Efficienttts: An efficient and high-quality text-to-speech
architecture,” in International Conference on Machine Learning.
PMLR, 2021, pp. 7700–7709.

You might also like