Research Paper

The document presents a novel end-to-end text-to-speech (E2E-TTS) model that jointly trains FastSpeech2 and HiFi-GAN, simplifying the training pipeline and eliminating the need for external speech-text alignment tools. This model achieves high-quality speech synthesis directly from text without intermediate mel-spectrograms, outperforming existing state-of-the-art implementations in both subjective and objective evaluations. Key contributions include the incorporation of an alignment learning framework to enhance training efficiency and synthesis quality.

Uploaded by

niranjana ramesh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views5 pages

Research Paper

Uploaded by

niranjana ramesh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to

Speech
Dan Lim, Sunghee Jung, Eesung Kim

Kakao Enterprise Corporation, Seongnam, Republic of Korea

{satoshi.2020, [Link], [Link]}@[Link]

Abstract somewhat complicated in that the former requires additional

training steps and the latter requires completion of training of
In neural text-to-speech (TTS), two-stage system or a cascade
an acoustic feature generator prior to vocoder training stage.
of separately learned models have shown synthesis quality close
to human speech. For example, FastSpeech2 transforms an in- On the other hand, end-to-end text-to-speech (E2E-TTS)
put text to a mel-spectrogram and then HiFi-GAN generates a [5], [13], [14], [15], [16] is a recent research trend in which
raw waveform from a mel-spectogram where they are called an a speech waveform is directly generated from an input text in
acoustic feature generator and a neural vocoder respectively. a single stage without distinction between an acoustic feature
arXiv:2203.16852v2 [[Link]] 1 Jul 2022

However, their training pipeline is somewhat cumbersome in generator and a neural vocoder. Although there is no interme-
that it requires a fine-tuning and an accurate speech-text align- diate conversion to human-designed acoustic features such as
ment for optimal performance. In this work, we present end- mel-spectrogram, it has shown comparable performance to the
to-end text-to-speech (E2E-TTS) model which has a simpli- two-stage TTS systems. Since E2E-TTS doesn’t have a prob-
fied training pipeline and outperforms a cascade of separately lem of an acoustic feature mismatch, it usually doesn’t require
learned models. Specifically, our proposed model is jointly fine-tuning or sequential training. Moreover, some works [13],
trained FastSpeech2 and HiFi-GAN with an alignment module. [14] further simplify the training pipeline by incorporating an
Since there is no acoustic feature mismatch between training alignment learning module so that the model can be trained
and inference, it does not requires fine-tuning. Furthermore, without dependency on external speech-text alignments tools.
we remove dependency on an external speech-text alignment In this work, we propose E2E-TTS with a simplified train-
tool by adopting an alignment learning objective in our joint ing pipeline and high-quality speech synthesis. Our work is
training framework. Experiments on LJSpeech corpus shows similar to [17] in that joint training of an acoustic feature gener-
that the proposed model outperforms publicly available, state- ator and a neural vocoder is researched and the experiments are
of-the-art implementations of ESPNet2-TTS on subjective eval- based on the ESPNet2 toolkit. However, our proposed model
uation (MOS) and some objective evaluations. directly synthesizes raw waveform from an input text without
Index Terms: end to end text to speech, joint training, espnet an intermediate mel-spectrogram. Moreover, we incorporate an
alignment learning objective so that the proposed model can be
trained in single-stage training without dependency on external
1. Introduction alignments models. The contributions of our work can be sum-
Text-to-speech (TTS) based on the neural network has signifi- marized as follows.
cantly improved synthesized speech quality over the past years.
Generally, a task of neural TTS is divided into more manage- • We make the E2E-TTS model by jointly training an
able sub-tasks using an acoustic feature generator and a neural acoustic feature generator and a neural vocoder, which
vocoder. In this two-stage system, an acoustic feature generator are FastSpeech2 and HiFi-GAN respectively. It does
generates an acoustic feature from an input text first and then not require pre-training or fine-tuning and it synthe-
a neural vocoder synthesizes a raw waveform from an acoustic sizes high-quality speech without an intermediate mel-
feature. Those models are trained separately and then joined for spectrogram.
inference. An acoustic feature generator can be autoregressive
• We leverage an alignment learning framework [18] to
and attention-based for implicit speech-text alignments [1], [2]
obtain token duration on the fly during the training. Thus
or it can be non-autoregressive for efficient parallel inference
the training of our proposed model does not require ex-
and duration informed for robustness on synthesis error [3], [4],
ternal speech-text alignments models.
[5]. There are lots of research on neural vocoder as well and
some of the famous, widely used include [6], [7], normalizing • The proposed model outperforms state-of-the-art imple-
flow based one [8] and generative adversarial network (GAN) mentations of ESPNet2-TTS [17] on both subjective and
based ones [9], [10], [11], [12]. objective evaluations.
Although the two-stage system is the dominant approach
for TTS, training two models separately may result in degrada- 2. Related work
tion of synthesis quality due to an acoustic feature mismatch.
Note that a neural vocoder takes the ground-truth acoustic fea- There are several E2E-TTS research that directly generates
tures for training and the predicted ones from an acoustic feature speech waveform from an input text. For examples, Fast-
generator for inference. For optimal performance, we can fur- Speech2s [5] is similar to our work in that it uses FastSpeech2
ther train a pre-trained neural vocoder with predicted acoustic and GAN-based vocoder; Parallel WaveGAN [10]. However,
features, which is called fine-tuning [12], [13]. Or we can train it requires an auxiliary mel-spectrogram decoder and a prepa-
a neural vocoder with predicted acoustic feature from the be- ration of speech-text alignments to train the model. Although
ginning [1]. However, both methods make the training pipeline LiteTTS [19] also combines an acoustic feature generator with
Figure 1: An architecture of proposed model (discriminators are
omitted for brevity)
Figure 2: Variance adaptor

HiFi-GAN, it still depends on external alignments models and

1D convolution-based variance adaptor. Figure 1 depicts each
focuses more on lightweight structures for on-device uses.
module in the proposed model. Specifically, the encoder en-
On the other hand, EATS [14] integrates alignment learning
codes an input text as text embeddings h, and the variance
into its adversarial training framework and it improves align-
adaptor adds variance information to the text embeddings and
ment learning stability by employing soft dynamic time warp-
expands according to each token duration for the decoder.
ing to spectrogram prediction loss. VITS [13] also learns align-
Figure 2 depicts the structure of the variance adaptor which
ments during the training in the process of maximizing the like-
consists of pitch, energy, and duration predictor. Pitch and en-
lihood of data and it improves expressiveness by utilizing varia-
ergy predictors are trained to minimize token-wise pitch and
tional inference and normalizing flow in an adversarial training
energy respectively following the FastSpeech2 implementation
framework. In EFTS-Wav [15], they adopt MelGAN and devise
of ESPNet2-TTS [17] or FastPitch [21] instead of frame-wise
a novel monotonic alignment strategy with mel-spectrogram de-
as in [5]. During training, required token-wise pitch and energy
coder for alignment learning. Wave-Tacotron [16] adopts an
p, e is computed on the fly by averaging frame-wise ground-
attention-based Tacotron [1] with the normalizing flow and it
truth pitch and energy according to token duration d. The to-
is optimized to simply maximize the likelihood of the training
ken duration is defined as the number of mel-frame assigned to
data.
each input text token and is obtained from the alignment module
In [17], joint training of an acoustic feature generator and which will be explained later. After text embeddings are added
a neural vocoder was conducted and it proved its effectiveness with pitch and energy, it is expanded by a length regulator (LR)
at solving the problem of acoustic feature mismatch by show- according to the token duration. We use gaussian upsampling
ing significant improvement compared to the separately learned with fixed temperature, also known as softmax-based aligner
model. However, the performance of the jointly trained model [14], instead of vanilla upsampling by repetition [3].
could not match that of a separately learned, fine-tuned model. Note that although we adopt FastSpeech2 for our joint train-
ing, we exclude its mel-spectrogram loss so that the proposed
3. Model description model is trained to synthesize raw waveform directly from an
input text without intermediate mel-spectrogram. Thus there
The proposed model is E2E-TTS which is jointly trained Fast-
remains a variance loss that minimizes each variance with L2
Speech2 and HiFi-GAN with an alignment module. In this sec-
loss.
tion, we describe each component in order.

3.1. FastSpeech2 Lvar = ||d − d̂||2 + ||p − p̂||2 + ||e − ê||2 (1)
We adopt FastSpeech2 [5] as one of the components of the pro- where d, p, e are ground-truth duration, pitch and energy fea-
posed model. It is a non-autoregressive acoustic feature gen-
ture sequences respectively whereas d̂, p̂, ê are predicted ones
erator with fast and high-quality speech synthesis. By explic-
from the model respectively.
itly modeling token duration with a duration predictor, it im-
proves robustness on synthesis errors such as phoneme repeat
3.2. HiFi-GAN
and skips. Compared to its previous work; FastSpeech [3], it
achieves significant improvement in speech quality by employ- HiFi-GAN [11] is one of the most famous, GAN-based neural
ing additional variance information which is pitch and energy. vocoders with fast and efficient parallel synthesis. In the GAN
For our proposed model, We follow the structure of [5], which training framework, a model is trained by adversarial feed-
is a feed-forward Transformer-based [20] encoder, decoder, and back where a generator is trained to fake a discriminator, and
a discriminator is trained to discriminate between the ground-
truth sample and the predicted sample of the generator alter-
Di,j = distL2 (henc
i , menc
j ) (3)
nately. Discriminators of HiFi-GAN are designed to improve
fidelity by considering a property of speech waveform, which Asof t = softmax(−D, dim = 0) (4)
are multi period discriminator (MPD) and multi scale discrimi- where henc , menc is the encoded text embeddings and mel-
i j
nator (MSD). MPD handles diverse periodic patterns of speech spectrogram at timestep i, j respectively.
waveform whereas MSD operates on the consecutive waveform From soft alignment distribution Asof t , we can compute
at different scales with a wide receptive field. the likelihood of all valid monotonic alignments which is the
As depicted in figure 1, we adopt the HiFi-GAN genera- alignment learning objective to be maximized.
tor for synthesizing raw waveform from the output of the de-
coder. HiFi-GAN generator upsamples the output of the de- T
X Y
coder through transposed convolution to match the length of P (S(h)|m) = P (st |mt ) (5)
the raw waveform where an output of the decoder has the same s∈S(h) t=1
length as mel-spectrogram of the ground-truth waveform. It where s is a specific alignment between a text and mel-
has not only adversarial loss but also auxiliary losses which are spectrogram (e.g., s1 = h1 , s2 = h2 , ..., sT = hN ), S(h) is
feature matching loss [9] and mel-spectrogram loss for the im- the set of all valid monotonic alignments and T, N is the length
provement of speech quality and training stability. Note that of mel-spectrogram and text token respectively. A forward-
auxiliary mel-spectrogram loss here is L1 loss between mel- sum algorithm is used for computing the alignment learning
spectrogram of synthesized waveform and that of the ground- objective and we define negative of it as forward sum loss
truth waveform, which is devised and used for training HiFi- Lf orward sum . Notably it can be efficiently trained with off-
GAN [11]. The auxiliary mel-spectrogram loss is different from the-shelf CTC [25] loss implementation.
the mel-spectrogram loss of FastSpeech2 [5]. The training ob- To obtain token duration d, the monotonic alignment
jective of HiFi-GAN follows LSGAN [22] and the generator search (MAS) [24] is used to convert soft alignment Asof t
loss consists of an adversarial loss and auxiliary losses as fol- to monotonic, binarized hard alignment Ahard wherein
lows.
P T
j=1 Ahard,i,j represents each token duration. Thus each to-
ken duration is the number of mel-frames assigned to each in-
Lg = Lg,adv + λf m Lf m + λmel Lmel (2) put text tokens and the sum of duration equals the length of
where Lg,adv is adversarial loss based on least-squares loss mel-spectrogram. There are additional binarization loss Lbin
function and λf m , λmel is scaling factor for auxiliary feature which enforces Asof t matches Ahard by minimizing their KL-
matching and mel-spectrogram loss respectively. divergence. Note that we also apply beta-binomial alignment
prior as in [18], [26] which multiplies 2d static prior to Asof t to
3.3. Alignment Learning Framework accelerate the alignment learning by making the near-diagonal
path more probable.
Speech-text alignment is crucial in duration informed networks
[3], [4], [5] where the TTS model has a separate duration model
and requires explicit duration for its model training as in Fast- Lbin = −Ahard logAsof t (6)
Speech2. In our proposed model, each token duration d is Lalign = Lf orward sum + Lbin (7)
used for training a duration predictor, for computing token-
where is Hadamard product and Lalign is final loss for align-
averaged pitch, energy from frame-wised ones, and for upsam-
ments.
pling the text embeddings. The token duration can be obtained
from a pre-trained autoregressive TTS model [2] as in [3] or
3.4. Final Loss
from speech-text alignment tool such as montreal forced aligner
(MFA) as in [4], [5]. Moreover, the training pipeline can be As depicted in figure 1, the proposed model consists of the
more simplified by incorporating alignment learning so that the encoder, variance adaptor, decoder, HiFi-GAN generator, and
required token duration is obtained during the model training alignment module where the alignment module is used for train-
on the fly [15], [18], [23], [24]. ing only. It is trained to directly synthesize raw waveform from
In this work, we incorporate an alignment learning frame- an input text without intermediate mel-spectrogram loss in the
work [18] into our joint training framework for obtaining the GAN training framework. Note that we use discriminators of
required token duration d during the training on the fly. An HiFi-GAN for the training of the proposed model though it is
alignment learning framework has shown an improved speech omitted from figure 1. Consequently, the loss of the proposed
quality as well as fast alignment convergence by devising an model is GAN training loss integrated with the variance loss
alignment learning objective, which can be applied to both au- and the alignment loss as follows.
toregressive and non-autoregressive TTS models. An alignment
learning objective can be computed efficiently using a forward- L = Lg + λvar Lvar + λalign Lalign (8)
sum algorithm. An alignment module in figure 1 represents where we used 1 for λvar and 2 for λalign as scaling factor of
the proposed module of an alignment learning framework [18], the variance and alignments loss respectively.
from which an alignment learning objective as well as each to-
ken duration are obtained.
Specifically, an alignment module encodes the text embed-
4. Experiments
dings h and mel-spectrogram m as henc , menc with 2 and For reproducible research, we conducted all experiments in-
3 1D convolution layers respectively. After that, it computes cluding data preparation, model training, and evaluation using
soft alignment distribution Asof t which is softmax normalized ESPNet2-TTS [17] toolkit. The ESPNet2-TTS is a famous,
across text domain based on the learned pairwise affinity be- open-sourced speech processing toolkit and it provides various
tween all text tokens and mel-frames. recipes for reproducing state-of-the-art TTS results.
4.1. Dataset Table 1: Results on LJSpeech corpus, where ”STD” represents
standard deviation and ”CI” represents 95% confidence inter-
We experimented with LJSpeech corpus [27] which is an En- vals.
glish single female speaker dataset. It consists of 24 hours of
speech recorded with a 22.05kHz sampling rate and 16bits. Fol- Method MCD ± STD F0 RMSE ± STD CER MOS ± CI
lowing the recipe in egs2/ljspeech/tts1 in the toolkit, GT N/A N/A 1.0 4.08 ± 0.07
we used 12,600 utterances for training, 250 for validation and
CF2 (+joint-ft) 6.73 ± 0.62 0.219 ± 0.034 1.5 3.96 ± 0.08
250 for evaluation. CF2 (+joint-tr) 6.80 ± 0.54 0.218 ± 0.035 1.5 3.93 ± 0.08
Mel-spectrogram, which is used as an auxiliary loss and VITS 6.99 ± 0.63 0.234 ± 0.037 3.6 3.82 ± 0.09
an input for an alignment module in the proposed model, was Proposed model 7.16 ± 0.55 0.215 ± 0.034 1.3 4.02 ± 0.07
computed with 80 dimensions, 1024 fft size, and 256 hop size.
For a fair comparison, g2p-en 1 without word separators was
used as a G2P function, which is the same configuration as the
Table 1 shows the results on GT (ground-truth recordings),
baseline models of ESPNet2-TTS which will be explained later.
baseline models, and the proposed model. We obtained con-
sistent outcomes as the previous work [17] in that the base-
4.2. Model configuration
line models achieved high MOS values in the order of CF2
We implemented the proposed model using the ESPNet2- (+joint-ft), CF2 (+joint-tr), and VITS. Interestingly, our pro-
TTS toolkit following the configurations and training methods posed model outperformed all of the baselines on MOS as well
of train joint conformer fastspeech2 hifigan as objective metrics; F0 RMSE, CER.
in the same recipe of the toolkit used for data preparation. The When it comes to the acoustic feature mismatch, the pro-
differences are that the transformer was used for an encoder and posed model addressed the problem by the E2E approach which
decoder type instead of a conformer. And we used 256 for the trains the model to generate raw waveform directly from an
attention dimension and 1024 for the number of the encoder and input text without an intermediate mel-spectrogram. Whereas
decoder ff units respectively. In the case of an alignment mod- CF2 (+joint-ft) and CF2 (+joint-tr) solved the problem by
ule, we simply followed the proposed structure in [18]. Note jointly fine-tuning and jointly training from scratch respectively.
that generally a neural vocoder is trained to generate only part of Thus we conjecture that the E2E approach was more effective
the speech waveform from the corresponding portion of an in- for improvement than joint fine-tuning or simply joint training
put sequence for training efficiency. A related hyper-parameter of an acoustic feature generator with a vocoder. Another differ-
in the toolkit is called segment size which determines the length ence compared to CF2 (+joint-ft) and CF2 (+joint-tr) was that
of the randomly sliced output sequence of the decoder. We used the proposed model incorporates alignment learning in its joint
64 for this hyper-parameter. training framework. It seems that those factors not only have
For the comparative experiment, we prepared a conven- simplified the training pipeline but also may improve the syn-
tional two-stage, cascaded TTS model as well as another thesized speech quality although we didn’t investigate how they
E2E-TTS model. Specifically, we compared the proposed are related to the model performance thoroughly in this paper.
model with state-of-the-art implementations of ESPNet2-TTS. In the case of VITS, which is also an E2E model with align-
It provides the pre-trained models for public use including ment learning capability, it achieved the worst results in our ex-
CF2 (+joint-ft), CF2 (+joint-tr), and VITS. CF2 (+joint-ft) is periment. One of the reasons other than the weakness on the
Conformer-based [28] FastSpeech2 with HiFi-GAN vocoder g2p error as reported in [17], could be its training difficulty due
which are separately trained and jointly fine-tuned. CF2 (+joint- to the somewhat complicated model structure compared to our
tr) is also Conformer-based FastSpeech2 with HiFi-GAN but it proposed model. Note that VITS utilizes variational autoen-
is jointly trained from scratch. VITS is E2E-TTS implementa- coder and normalizing flow [13].
tion of the paper [13].

4.3. Evaluation 5. Conclusions

We evaluated the performance of TTS models in objective and In this paper, we proposed an end-to-end text-to-speech model
subjective metrics. For objective evaluations, mel-cepstral dis- which is the jointly trained FastSpeech2 and HiFi-GAN with
tortion (MCD), log-F0 root mean square error (F0 RMSE), an alignment module. The proposed model directly generates
and character error rate (CER) were computed using evaluation speech waveform from an input text without intermediate con-
scripts provided by the ESPNet2-TTS toolkit. We computed version to an explicit human-designed acoustic feature. The
CER using the same pre-trained ESPNet2-ASR model 2 which training of the proposed model does not have fine-tuning which
was used in [17]. For subjective evaluation, we conducted a is required in the two-stage, separately learned text-to-speech
crowdsourced Mean Opinion Score (MOS) test via Amazon models due to the problem of an acoustic feature mismatch.
Mechanical Turk where each participant, located in the United Moreover, we adopt an alignment learning framework so that
States, scored each audio sample from different models (includ- the proposed model does not depend on external alignment tools
ing ground-truth audio sample) for naturalness on 5 point scale: for training. Consequently, the proposed model has a simpli-
5 for excellent, 4 for good, 3 for fair, 2 for poor, and 1 for bad. fied training pipeline that is jointly trained in a single stage.
Randomly selected 20 utterances from the evaluation set were For evaluations, we compared the proposed model with pub-
used for the MOS test and each utterance was listened to by 20 licly available implementations of the ESPNet2-TTS toolkit on
different participants. Audio samples are available online 3 the English LJSpeech corpus, and the proposed model achieved
state-of-the-art results. It would be interesting for future works
1 [Link] to investigate other combinations of joint training other than
2 [Link] FastSpeech2 and HiFi-GAN, or to evaluate on multi-speaker
3 [Link] dataset.
6. References [16] R. J. Weiss, R. Skerry-Ryan, E. Battenberg, S. Mariooryad,
and D. P. Kingma, “Wave-tacotron: Spectrogram-free end-to-end
[1] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, text-to-speech synthesis,” in ICASSP 2021-2021 IEEE Interna-
Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan et al., “Natural tional Conference on Acoustics, Speech and Signal Processing
tts synthesis by conditioning wavenet on mel spectrogram pre- (ICASSP). IEEE, 2021, pp. 5679–5683.
dictions,” in 2018 IEEE international conference on acoustics,
speech and signal processing (ICASSP). IEEE, 2018, pp. 4779– [17] T. Hayashi, R. Yamamoto, T. Yoshimura, P. Wu, J. Shi,
4783. T. Saeki, Y. Ju, Y. Yasuda, S. Takamichi, and S. Watanabe,
“Espnet2-tts: Extending the edge of tts research,” arXiv preprint
[2] N. Li, S. Liu, Y. Liu, S. Zhao, and M. Liu, “Neural speech synthe- arXiv:2110.07840, 2021.
sis with transformer network,” in Proceedings of the AAAI Con-
ference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 6706– [18] R. Badlani, A. Łańcucki, K. J. Shih, R. Valle, W. Ping, and
6713. B. Catanzaro, “One tts alignment to rule them all,” in ICASSP
2022 - 2022 IEEE International Conference on Acoustics, Speech
[3] Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, and Signal Processing (ICASSP), 2022, pp. 6092–6096.
“Fastspeech: Fast, robust and controllable text to speech,” Ad-
[19] H.-K. Nguyen, K. Jeong, S. Um, M.-J. Hwang, E. Song, and H.-
vances in Neural Information Processing Systems, vol. 32, 2019.
G. Kang, “LiteTTS: A Lightweight Mel-Spectrogram-Free Text-
[4] C. Yu, H. Lu, N. Hu, M. Yu, C. Weng, K. Xu, P. Liu, D. Tuo, to-Wave Synthesizer Based on Generative Adversarial Networks,”
S. Kang, G. Lei et al., “Durian: Duration informed attention net- in Proc. Interspeech 2021, 2021, pp. 3595–3599.
work for speech synthesis.” in INTERSPEECH, 2020, pp. 2027– [20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
2031. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you
[5] Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T. Liu, need,” in Advances in Neural Information Processing Systems,
“Fastspeech 2: Fast and high-quality end-to-end text to speech,” in I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus,
9th International Conference on Learning Representations, ICLR S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Asso-
2021, Virtual Event, Austria, May 3-7, 2021, 2021. ciates, Inc., 2017.

[6] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, [21] A. Łańcucki, “Fastpitch: Parallel text-to-speech with pitch pre-
A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, diction,” in ICASSP 2021-2021 IEEE International Conference
“Wavenet: A generative model for raw audio,” arXiv preprint on Acoustics, Speech and Signal Processing (ICASSP). IEEE,
arXiv:1609.03499, 2016. 2021, pp. 6588–6592.
[22] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. Smolley,
[7] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, “Least squares generative adversarial networks,” in 2017 IEEE
N. Casagrande, E. Lockhart, F. Stimberg, A. Oord, S. Dieleman, International Conference on Computer Vision (ICCV), 2017, pp.
and K. Kavukcuoglu, “Efficient neural audio synthesis,” in 2813–2821.
International Conference on Machine Learning. PMLR, 2018,
pp. 2410–2419. [23] D. Lim, W. Jang, G. O, H. Park, B. Kim, and J. Yoon, “JDI-
T: Jointly Trained Duration Informed Transformer for Text-To-
[8] R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A flow-based Speech without Explicit Alignment,” in Proc. Interspeech 2020,
generative network for speech synthesis,” in ICASSP 2019-2019 2020, pp. 4004–4008.
IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP). IEEE, 2019, pp. 3617–3621. [24] J. Kim, S. Kim, J. Kong, and S. Yoon, “Glow-tts: A generative
flow for text-to-speech via monotonic alignment search,” in Ad-
[9] K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z. Teoh, vances in Neural Information Processing Systems, H. Larochelle,
J. Sotelo, A. de Brébisson, Y. Bengio, and A. C. Courville, “Mel- M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, Eds., vol. 33.
gan: Generative adversarial networks for conditional waveform Curran Associates, Inc., 2020, pp. 8067–8077.
synthesis,” Advances in neural information processing systems,
vol. 32, 2019. [25] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Con-
nectionist temporal classification: Labelling unsegmented se-
[10] R. Yamamoto, E. Song, and J.-M. Kim, “Parallel wavegan: A fast quence data with recurrent neural networks,” in Proceedings of the
waveform generation model based on generative adversarial net- 23rd International Conference on Machine Learning, ser. ICML
works with multi-resolution spectrogram,” in ICASSP 2020-2020 ’06, 2006, p. 369–376.
IEEE International Conference on Acoustics, Speech and Signal [26] K. J. Shih, R. Valle, R. Badlani, A. Lancucki, W. Ping, and
Processing (ICASSP). IEEE, 2020, pp. 6199–6203. B. Catanzaro, “RAD-TTS: Parallel flow-based TTS with robust
[11] J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial net- alignment learning and diverse synthesis,” in ICML Workshop
works for efficient and high fidelity speech synthesis,” Advances on Invertible Neural Networks, Normalizing Flows, and Explicit
in Neural Information Processing Systems, vol. 33, pp. 17 022– Likelihood Models, 2021.
17 033, 2020. [27] K. Ito and L. Johnson, “The lj speech dataset,” [Link]
[12] W. Jang, D. Lim, J. Yoon, B. Kim, and J. Kim, “UnivNet: A Neu- com/LJ-Speech-Dataset/, 2017.
ral Vocoder with Multi-Resolution Spectrogram Discriminators [28] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu,
for High-Fidelity Waveform Generation,” in Proc. Interspeech W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer:
2021, 2021, pp. 2207–2211. Convolution-augmented Transformer for Speech Recognition,” in
Proc. Interspeech 2020, 2020, pp. 5036–5040.
[13] J. Kim, J. Kong, and J. Son, “Conditional variational autoencoder
with adversarial learning for end-to-end text-to-speech,” in Pro-
ceedings of the 38th International Conference on Machine Learn-
ing, ser. Proceedings of Machine Learning Research, vol. 139.
PMLR, 18–24 Jul 2021, pp. 5530–5540.
[14] J. Donahue, S. Dieleman, M. Binkowski, E. Elsen, and K. Si-
monyan, “End-to-end adversarial text-to-speech,” in International
Conference on Learning Representations, 2021.
[15] C. Miao, L. Shuang, Z. Liu, C. Minchuan, J. Ma, S. Wang, and
J. Xiao, “Efficienttts: An efficient and high-quality text-to-speech
architecture,” in International Conference on Machine Learning.
PMLR, 2021, pp. 7700–7709.

End-to-End TTS with Joint Training
No ratings yet
End-to-End TTS with Joint Training
10 pages
Ieee
No ratings yet
Ieee
12 pages
Lightweight E2E Text-to-Speech Model
No ratings yet
Lightweight E2E Text-to-Speech Model
6 pages
D V 3: S T - S C S L: EEP Oice Caling EXT TO Peech With Onvolutional Equence Earning
No ratings yet
D V 3: S T - S C S L: EEP Oice Caling EXT TO Peech With Onvolutional Equence Earning
16 pages
HiFi-GAN TTS for Indian Languages
No ratings yet
HiFi-GAN TTS for Indian Languages
8 pages
Conditional Variational Autoencoder With Adversarial Learning For End-to-End Text-to-Speech
No ratings yet
Conditional Variational Autoencoder With Adversarial Learning For End-to-End Text-to-Speech
15 pages
FastSpeech 2: Enhanced TTS Model
No ratings yet
FastSpeech 2: Enhanced TTS Model
11 pages
Tomoki Hayashi Ryuichi Yamamoto Takenori Yoshimura Peter Wu Jiatong Shi Takaaki Saeki Yooncheol Ju Yusuke Yasuda Shinnosuke Takamichi Shinji Watanabe
No ratings yet
Tomoki Hayashi Ryuichi Yamamoto Takenori Yoshimura Peter Wu Jiatong Shi Takaaki Saeki Yooncheol Ju Yusuke Yasuda Shinnosuke Takamichi Shinji Watanabe
5 pages
FastSpeech 2: Enhanced TTS Model
No ratings yet
FastSpeech 2: Enhanced TTS Model
11 pages
SV2TTS: Advanced Multi-Speaker TTS System
No ratings yet
SV2TTS: Advanced Multi-Speaker TTS System
9 pages
Phonetic Enhancement for TTS Models
No ratings yet
Phonetic Enhancement for TTS Models
5 pages
Conditional Variational Autoencoder With Adversarial Learning For End-To-End Text-To-Speech
No ratings yet
Conditional Variational Autoencoder With Adversarial Learning For End-To-End Text-To-Speech
15 pages
Arabic End-to-End TTS Deep Learning Model
No ratings yet
Arabic End-to-End TTS Deep Learning Model
12 pages
Phonetic Language Modeling for TTS
No ratings yet
Phonetic Language Modeling for TTS
5 pages
The Future of
No ratings yet
The Future of
25 pages
Wave-Tacotron: End-to-End TTS Model
No ratings yet
Wave-Tacotron: End-to-End TTS Model
5 pages
Tacotron 2: Neural TTS Architecture
No ratings yet
Tacotron 2: Neural TTS Architecture
5 pages
Adaptive Text-to-Speech with WaveNet
No ratings yet
Adaptive Text-to-Speech with WaveNet
15 pages
Real-Time Voice Cloning with Deep Learning
No ratings yet
Real-Time Voice Cloning with Deep Learning
6 pages
Human-Level Text-to-Speech Synthesis
No ratings yet
Human-Level Text-to-Speech Synthesis
12 pages
SupertonicTTS: Efficient TTS System
No ratings yet
SupertonicTTS: Efficient TTS System
21 pages
Prosody-Controlled TTS with Neural HMMs
No ratings yet
Prosody-Controlled TTS with Neural HMMs
5 pages
Emotional Speech Synthesis with TTS Models
No ratings yet
Emotional Speech Synthesis with TTS Models
7 pages
DelightfulTTS 2: Advanced Speech Synthesis
No ratings yet
DelightfulTTS 2: Advanced Speech Synthesis
5 pages
Multilingual TTS via Voice Conversion
No ratings yet
Multilingual TTS via Voice Conversion
5 pages
Cycle-Consistent ASR and TTS Training
No ratings yet
Cycle-Consistent ASR and TTS Training
3 pages
NaturalSpeech: Human-Level TTS Synthesis
No ratings yet
NaturalSpeech: Human-Level TTS Synthesis
12 pages
F5-TTS: Efficient Non-Autoregressive TTS
No ratings yet
F5-TTS: Efficient Non-Autoregressive TTS
17 pages
Unsupervised TTS Synthesis Model
No ratings yet
Unsupervised TTS Synthesis Model
5 pages
Stutter-TTS: Enhancing Stuttered Speech Recognition
No ratings yet
Stutter-TTS: Enhancing Stuttered Speech Recognition
8 pages
ParrotTTS: Low-Resource Multilingual TTS
No ratings yet
ParrotTTS: Low-Resource Multilingual TTS
13 pages
F5-TTS: Efficient Non-Autoregressive TTS
No ratings yet
F5-TTS: Efficient Non-Autoregressive TTS
17 pages
Almost Unsupervised Text To Speech and Automatic Speech Recognition
No ratings yet
Almost Unsupervised Text To Speech and Automatic Speech Recognition
11 pages
SpeechT5: Unified Framework for Speech Processing
No ratings yet
SpeechT5: Unified Framework for Speech Processing
16 pages
AI Voice Cloning for Presentation Automation
No ratings yet
AI Voice Cloning for Presentation Automation
5 pages
E2 TTS: Simplified Zero-Shot Speech
No ratings yet
E2 TTS: Simplified Zero-Shot Speech
8 pages
Low-Resource Text-to-Speech Advances
No ratings yet
Low-Resource Text-to-Speech Advances
6 pages
Thesis
No ratings yet
Thesis
37 pages
Arik 17 A
No ratings yet
Arik 17 A
10 pages
Joint Speech-Text Embeddings For Multitask Speech Processing
No ratings yet
Joint Speech-Text Embeddings For Multitask Speech Processing
13 pages
Neural Network-Based Speech Synthesis
No ratings yet
Neural Network-Based Speech Synthesis
4 pages
Suoni
No ratings yet
Suoni
38 pages
Fast and Lightweight On-Device TTS With Tacotron2 and LPCNet
No ratings yet
Fast and Lightweight On-Device TTS With Tacotron2 and LPCNet
5 pages
ViSPer: Enhanced Multilingual TTS Model
No ratings yet
ViSPer: Enhanced Multilingual TTS Model
5 pages
Deep Voice: Real-Time TTS System
No ratings yet
Deep Voice: Real-Time TTS System
10 pages
ClariNet: Efficient Text-to-Wave Synthesis
No ratings yet
ClariNet: Efficient Text-to-Wave Synthesis
12 pages
Deep Learning in Expressive Speech Synthesis
No ratings yet
Deep Learning in Expressive Speech Synthesis
34 pages
VAENAR-TTS: Efficient Non-AutoRegressive TTS
No ratings yet
VAENAR-TTS: Efficient Non-AutoRegressive TTS
5 pages
MegaTTS 3: Enhanced TTS with Sparse Alignment
No ratings yet
MegaTTS 3: Enhanced TTS with Sparse Alignment
19 pages
ZMM-TTS: Zero-Shot Multilingual TTS
No ratings yet
ZMM-TTS: Zero-Shot Multilingual TTS
16 pages
Controllability in Deep Learning TTS
No ratings yet
Controllability in Deep Learning TTS
15 pages
VoiceLoop: Neural TTS with Phonological Loop
No ratings yet
VoiceLoop: Neural TTS with Phonological Loop
14 pages
Review of Text-to-Speech Technologies
No ratings yet
Review of Text-to-Speech Technologies
4 pages
IndexTTS: Advanced Zero-Shot TTS System
No ratings yet
IndexTTS: Advanced Zero-Shot TTS System
5 pages
Connectionist Temporal Classification
No ratings yet
Connectionist Temporal Classification
6 pages

Research Paper

Uploaded by

Research Paper

Uploaded by

JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to

Kakao Enterprise Corporation, Seongnam, Republic of Korea

Abstract somewhat complicated in that the former requires additional

HiFi-GAN, it still depends on external alignments models and

4.3. Evaluation 5. Conclusions

You might also like