The Text-to-speech in the Wild (TITW) Database
Jee-weon Jung†1 , Wangyou Zhang2 , Soumi Maiti1,3 , Yihan Wu4 , Xin Wang5 ,
Ji-Hoon Kim6 , Yuta Matsunaga7 , Seyun Um8 , Jinchuan Tian1 , Hye-jin Shim1 ,
Nicholas Evans9 , Joon Son Chung6 , Shinnosuke Takamichi10 , Shinji Watanabe1
1
Carnegie Mellon University, USA 2 Shanghai Jiao Tong University, China 3 Meta, USA
4
Renmin University of China, China 5 National Institute of Informatics, Japan
6
Korea Advanced Institute of Science and Technology, South Korea 7 University of Tokyo, Japan
8
Yonsei University, South Korea 9 EURECOM, France 10 Keio University, Japan
jeeweonj@[Link]
Abstract
Traditional Text-to-Speech (TTS) systems rely on studio-
arXiv:2409.08711v2 [[Link]] 1 Jun 2025
quality speech recorded in controlled settings. Recently, an ef-
fort known as “noisy-TTS training” has emerged, aiming to uti-
lize in-the-wild data. However, the lack of dedicated datasets
has been a significant limitation. We introduce the TTS In
the Wild (TITW) dataset, which is publicly available1 , created
through a fully automated pipeline applied to the VoxCeleb1 Figure 1: Fully automated Text-To-Speech Synthesis In The Wild
dataset. It comprises two training sets: TITW-Hard, derived (TITW) processing pipeline. The pipeline incorporates tran-
from the transcription, segmentation, and selection of raw Vox- scription, segmentation, data selection, enhancement, and fil-
Celeb1 data, and TITW-Easy, which incorporates additional en- tering based on DNSMOS. The TITW dataset comprises two
hancement and data selection based on DNSMOS. State-of-the- editions: TITW-easy, which can be used to successfully train
art TTS models achieve over 3.0 UTMOS score with TITW- latest TTS systems, and TITW-hard, where the quality is too
Easy, while TITW-Hard remains difficult showing UTMOS be- low to train TTS systems with current technology but aims to
low 2.8. Beyond TTS, TITW’s unique design, leveraging an au- serve the training of more advanced future TTS systems.
tomatic speaker recognition dataset, strengthens ethical efforts
to counteract malicious use of TTS models by supporting tasks
such as speech deepfake detection. studies such as DenoiSpeech [11] and MQTTS [12] demon-
Index Terms: text-to-speech synthesis, in the wild, dataset strate that competitive performance can be achieved using non-
studio-quality data, such as podcasts and YouTube audio. The
1. Introduction ASVspoof 5 challenge [13] further supports this shift, demon-
Generative speech technology is evolving rapidly, driven strating the potential of real-world sources like audiobooks for
in part by advances in diffusion models, speech codecs, and high-quality synthetic speech. However, the field of noisy-
speech-language modeling methodologies [1–5]. Among these TTS training lacks standardized datasets, protocols, and bench-
advancements, Text-to-Speech (TTS) systems have made re- marks, with most studies relying on private data or artificially
markable progress, with recent models capable of generating noised studio recordings. This gap hinders progress and repro-
speech that is nearly indistinguishable from human speech in ducibility in the field.
terms of intelligibility and naturalness. Notably, while tradi-
To this end, we introduce the Text-To-Speech Synthesis In
tional TTS systems required minutes of studio-recorded target
The Wild (TITW) database, which is publicly available. It is
speaker data, modern systems can now operate effectively with
constructed using the VoxCeleb1 database [14], a large collec-
just a few seconds of such data [6–8].
tion of YouTube speech data, and processed through an auto-
In terms of TTS training, however, studio-quality data re-
mated pipeline involving transcription, segmentation, and en-
mains the de facto standard, despite its limitations. Studio
hancement. This pipeline eliminates the need for manual pro-
recordings, while superior in audio quality, lack diversity, vari-
cessing, making the dataset scalable and accessible. TITW in-
ability, and scalability. In-the-wild speech data, on the other
cludes two training sets: TITW-Hard, comprising 189 hours of
hand, offers exposure to real-world variability, greater speaker
speech derived from raw VoxCeleb1 data with minimal process-
diversity, and nearly unlimited scalability. These benefits are
ing, and TITW-Easy, a refined subset of 173 hours enhanced
particularly valuable for underrepresented languages, as studio-
using DNSMOS-based selection and additional processing. We
based data is often limited, hindering the democratization of
also propose standardized evaluation protocols and benchmarks
speech technology. Being able to utilize publicly available
to facilitate reproducible research. Our experiments demon-
sources like YouTube for TTS training could revolutionize the
strate that four contemporary TTS models can be successfully
field, enabling the curation of target language data from acces-
trained using TITW, showcasing its practical utility.
sible, diverse, and abundant resources.
Efforts to train TTS systems with lower-quality, in-the- The selection of VoxCeleb1 as a source dataset, originally
wild data – often termed “noisy-TTS training” – have shown developed for automatic speaker recognition, offers unique and
promise. While earlier works have reported that TTS models novel benefits to TITW. Recent TTS systems, capable of pro-
cannot be trained with low-quality data [9, 10], more recent ducing speech nearly indistinguishable from human voices,
have raised concerns about malicious use and its potential to
1 [Link] damage society [15]. By training TTS models on TITW, the
† Currently at Apple. resulting synthetic speech can be paired with human speech
in TITW to support research in speech deepfake detection and
spoofing-robust automatic speaker verification [16]. This ap-
plication of TITW not only advances TTS research but also
contributes to developing robust countermeasures against the
misuse of synthetic voices. This dual benefit stems from the
dataset’s guarantee that all speech samples feature only single-
speaker audio. Consequently, our work underscores a broader
impact, fostering both innovation and ethical considerations in
the field of generative speech technology.
2. Related works
Numerous databases have been used for training TTS
models. Legacy databases such as CMU ARCTIC [17] and
VCTK [18] were carefully designed and curated. They contain Figure 2: An example of the transcription and segmentation
phonetically-balanced utterances, all recorded in highly con- in TITW automatic pipeline. A randomly selected utterance
trolled acoustic environments. Due to the high cost of record- from VoxCeleb1 goes through our transcription and segmenta-
ing, these and similar databases typically include data from a tion pipeline, deriving two segments. A segment in the middle is
single speaker or a small number of speakers. The speech data deleted because it is a non-speech segment over 500ms.
they contain is generally neutral in terms of emotions and ex-
TITW. Firstly, VoxCeleb1 is itself sourced from the wild,
pressiveness. These databases were widely used for training
specifically, YouTube, spanning diverse acoustic environments.
speaker-dependent and multiple-speaker legacy TTS systems
Secondly, as a dataset for automatic speaker recognition, each
(e.g., unit-selection [19] and HMM-based [20]).
utterance is from a single speaker. Lastly, by selecting Vox-
The revolution in deep-learning-based TTS systems called
Celeb1 as a source dataset, TTS systems trained on TITW
for larger-scale datasets. Datasets like LJSpeech [21], Multi-
can contribute to future research in speech deepfake detection
lingual Librispeech [22] and LibriTTS [23], which are sourced
and spoofing-robust automatic speaker verification, especially
from LibriVox audiobooks, are not recorded in studio-quality
requiring TTS researchers’ attention to safeguard the rapidly
environments. LJSpeech contains twenty-four hours of audio-
advancing speech generation technology from malicious us-
book recordings but from a single speaker while the other two
age. SpoofCeleb [27] exemplifies this effort, where synthesized
feature a greater number of speakers. They have been widely
(spoofed) utterances generated by 23 TTS systems trained on
used to train deep-learning-based TTS systems [24, 25]. Their
TITW-Easy are used to create a dataset for speech deepfake de-
adoption marks a shift towards using training data collected in
tection and spoofing-robust automatic speaker verification.
less controlled conditions. Even so, this data still falls short of
capturing the diversity in speaker style and acoustic conditions 3.1. Transcription and segmentation
found in truly “in-the-wild” scenarios; the signal-to-noise ratio We first transcribe and segment the utterances using pre-
remains high and utterances are generally well-enunciated.2 trained models and empirically derived heuristics, ensuring the
The TITW database introduce in this paper aims to sup- process is fully automatic without human intervention. Figure 2
port research in overcoming data constraints, often referred to displays an example of a VoxCeleb1 utterance, transcribed at
as noisy-TTS training. We envisage TTS systems that can be the word-level and then segmented into two.
trained successfully using speech data collected in uncontrolled Transcription. Since TTS training typically requires paired
conditions. We see two avenues for such research. The first, speech and text data, we generated transcriptions for the en-
most challenging direction involves the use of training data col- tire VoxCeleb1 corpus. For the sake of scalability and repro-
lected in the wild without manual human intervention, relying ducibility in future projects in various languages, we generated
solely on automatic transcriptions, segmentation, and data se- transcriptions automatically using pretrained automatic speech
lection based on heuristics. The second direction involves the recognition (ASR) models. We used the WhisperX [28] toolkit
use of a subset of data after applying additional speech enhance- to generate transcriptions with word-level timestamps. Whis-
ment and data selection based on speech quality. TITW con- perX incorporates the OpenAI Whisper Large v2 model [29]
tains in-the-wild recordings of interviews, podcasts, and more, for transcription and another phoneme-based ASR model for
all posted to social media, making it, to our best knowledge, one word-level alignment. We additionally employed the OSWMv3
of the first of its kind.3 speech foundation model [30] and transcribed the data in par-
allel. The transcriptions from the OWSMv3 model served to
3. TITW verify transcription accuracy.
For several reasons, we selected VoxCeleb1 [14], which Segmentation. We divide each sample into isolated segments
contains speech from 1, 251 speakers, as the source data for using Voice Activity Detection (VAD) embedded in Whis-
perX [28]. Practically, whenever a non-speech periods exceed
2 See Librivox documentation and guidelines for recording an audio-
500ms, we trim the silence and split it into two segments. This
book [Link]
3 We recognize EMILIA [26] as the most similar, parallel work. segmentation rule was developed through empirical, iterative
However, the goals of the two works differ. EMEILIA focuses on de- investigations. Initial attempts to train TTS models with un-
veloping a data processing pipeline that yields high-quality data from segmented data failed, revealing that excessively long silences
in-the-wild data. Therefore, it strives to provide the highest achievable within training samples were a significant issue. This procedure
quality. TITW is designed to foster research in the training of TTS sys- results in approximately 280k transcribed speech segments.
tems using more noisy and real-world data. Hence we provide not only
TITW-Easy which can be used for the training of contemporary TTS
3.2. Data selection
systems, but also TITW-Hard to challenge the development of future In order to maximize the “wildness” of the dataset, our ini-
systems. tial investigations did not apply any data selection mechanism
Table 1: Text-To-Speech Synthesis In The Wild (TITW) statistics.
Both sets involve 1, 251 speakers.
# samples Avg dur (s) Tot dur (h) Avg # words
TITW-Easy 282, 606 2.42 189 10.84
TITW-Hard 248, 024 2.51 173 10.55
Table 2: Speech quality of the TITW-Easy and -Hard sets. WER
is calculated using OWSMv3 [30]. TITW remains significantly
more challenging than VCTK or even EMILIA with a DNSMOS
of 3.20 and 3.26.
UTMOS DNSMOS WER (%)
TITW-Hard 3.00 2.38 9.30
TITW-Easy 3.32 2.78 9.10
when composing the training set for TTS. However, we em-
pirically found that attempts to train TTS models were unsuc-
cessful due to the data being excessively noisy, including issues
like mistranscription. Consequently, we developed four heuris-
tically defined rules for data selection. These heuristics emerged
from iterative efforts to train TTS models with filtered data. If
any of the following conditions are met, the data is removed and
discarded from further consideration:
• The language is not from target language, in this case, En- Figure 3: Histograms of samples in the TITW-Easy and -Hard
glish. To simplify TTS training, we use Whisper’s language sets using DNSMOS overall score shown in the x-axis. Even
recognition capability to detect and remove utterances in lan- with data selection heuristics discussed in Section 3.2, training
guages other than English. Multilingual extensions are left with TITW-Hard remains extremely challenging.
for future work.
• The segment duration is shorter than 1 second or longer than trained speech enhancement model, DEMUCS [31]6 to reduce
8 seconds. Empirical evidence suggests that using utterances additive, background noise.7 We then apply a second round of
with such a semi-consistent duration benefits training stabil- data selection, this time to the enhanced data. This is done by
ity. estimating DNSMOS scores [32, 33] for each utterance. Then,
• The per-word duration is longer than 500 ms. The typical all utterances for which the DNSMOS score is below a thresh-
speaking rate is in the order of 2 words per second. Out- old of 3.0 are removed. An exception is made for segments
liers often correspond to emotional or pathological speech, from speakers included in the generation protocol (Section 4).
or long intervals of non-speech, all of which can destabilize Figure 3 presents histograms of DNSMOS scores for both
TTS training and are hence removed. the TITW-Hard and TITW-Easy databases. Is is clearly shown
• The automatic transcription is empty. Such cases indicate that most of the low-quality samples with low DNSMOS scores
a non-speech segment or ASR failure. In either case, they have been filtered out. Table 1 presents statistics and Table 2
cannot be used for TTS training and are removed.4 further details the UTMOS, DNSMOS overall, and Word Er-
ror Rate (WER) of TITW-Easy and -Hard providing a compre-
The application of transcription, segmentation and data se- hensive measure of the overall quality and intelligibility. WER
lection results in the “TITW-Hard” database. Since the raw data was calculated by comparing TITW’s transcript with OWSMv3.
is collected from videos posted to social media, utterances in The low DNSMOS scores confirm that both TITW training sets
the TITW-Hard database still contain background noise or low- retain their challenging nature.
quality speech. Preliminary experiments have revealed that the
training of TTS models with TITW-Hard data is extremely chal- 4. Evaluation and benchmarking
lenging; most attempts failed to converge even after applying
the aforementioned data selection heuristics. Once a TTS model is trained using the TITW database, it
can be evaluated with one of the two protocols for generating
3.3. Enhancement and DNSMOS-based further data selec- new synthetic speech.
tion
TITW-KSKT (Known Speaker, Known Text) is designed to
Given the challenges of training TTS models using the
generate synthetic speech for speakers and text that are both
TITW-Hard database, we created a second, relatively less chal-
present in the TITW-Hard and TITW-Easy training sets. Both
lenging dataset named “TITW-Easy.”5 First, we apply a pre-
sets of speakers and text are randomly extracted from those used
4 Note that despite the application of these data selection heuristics, in the VoxCeleb1-O automatic speaker verification evaluation
TITW retains a higher level of variability and naturalness compared to
protocol. Consequently, the number of speakers here matches
most existing corpora. that of the VoxCeleb1-O protocol, at 40. However, due to our
5 We believe that future TTS models and training schematics will
6 [Link]
enable successful training with TITW-Hard. Nonetheless, we introduce
TITW-Easy, which contemporary state-of-the-art TTS architectures and 7 The application of denoising does not compromise our objective to
training schemes can effectively utilize, as a stepping stone towards re- train TTS models with automatically collected data from the wild – the
search in noisy-TTS training. entire pipeline remains automated, reproducible, and scalable.
Table 3: Speech quality of the segments generated from the baselines on the TITW-KSKT and -KSUT protocols. All models were trained
using the TITW-Easy data. MCD is only applicable for TITW-KSKT where it has the reference samples.
TITW-KSKT TITW-KSUT
System
MCD↓ UTMOS↑ DNSMOS↑ WER (%) ↓ MCD↓ UTMOS↑ DNSMOS↑ WER (%) ↓
TransformerTTS-ParallelWaveGAN 11.68 2.06 2.50 24.90 N/A 1.79 2.54 107.90
GradTTS-DiffWave 6.76 2.18 2.39 11.90 N/A 2.30 2.54 54.00
VITS 8.61 2.77 2.74 53.00 N/A 2.78 2.81 120.50
MQTTS 6.99 3.08 2.83 23.30 N/A 3.20 2.94 67.10
Table 4: Comparative results of the identical TTS system being with ParallelWaveGAN [43]; (ii) GradTTS-DiffWave [44, 45];
trained with TITW-Easy and TITW-Hard data. Metrics are re- (iii) VITS [46]; (iv) MQTTS [12]. The choice of these baseline
ported using the TITW-KSKT protocol. “GTmel” uses ground models aims to offer a diverse representation of contemporary
truth mel-spectrograms in place of GradTTS to solely measure TTS technologies. All models were trained with an open-source
vocoder’s performance. recipe for reproducibility with detailed recipes at [27].
Table 3 displays the results for these four TTS models
System Train MCD UTMOS DNSMOS WER when trained with the TITW-Easy dataset, evaluated under both
GTmel-DiffWave Easy 5.05 2.63 2.68 11.90 protocols. UTMOS and DNSMOS scores of the synthesized
GTmel-DiffWave Hard 4.97 2.24 2.30 12.20 speech being comparable with those in Table 2 show that it
GradTTS-DiffWave Easy 6.76 2.18 2.39 11.90 matches the quality of training data. This indicates that sys-
GradTTS-DiffWave Hard 8.23 1.29 1.47 26.20 tems were successfully trained. Yet, they struggle with the in-
VITS Easy 8.61 2.77 2.74 53.00 herent challenges of the TITW data. The WER significantly
VITS Hard 9.06 2.48 2.69 59.50 increases in most cases. These baseline performances are fur-
data preparation processes outlined in Section 3, the number of ther challenged by the TITW-KSUT protocol results, where all
segments has increased from 4, 708 to 9, 113. performances degrade compared to the TITW-KSKT protocol.
Table 4 compares the performance of models trained on
TITW-KSUT (Known Speaker, Unknown Text) aims to gen- TITW-Easy and TITW-Hard datasets. Here, we focus on two
erate synthetic speech with text that is unseen in both the TITW- systems, GradTTS-DiffWave and VITS, as the other two base-
Hard and TITW-Easy datasets. We employ two text sources line systems failed to converge when trained with TITW-Hard.
for this: The first is the Rainbow Passage [34], which cov- We also present the result replacing GradTTS with a mel-
ers many English sounds and their combinations. It has been spectrogram extracted from the original speech file (i.e., copy
used widely in other data collection efforts, for example the synthesis), which serves as the upper bound for the waveform
VCTK corpus [18]. The second is a set of Semantically Un- model, DiffWave. The results consistently confirm that in all
predictable Sentences (SUS) [35] selected from past Blizzard cases models trained on TITW-Hard produce speech of lower
challenges [36]. In total, there are 200 different text samples quality, highlighting the challenging nature of TITW-Hard.
(31 from The Rainbow Passage and 169 from the set of SUS).
With the same set of 40 speakers, the protocol requires the gen- 6. Conclusion and remarks
eration of 8, 000 (= 40 × 200) synthetic utterances.
We introduce TITW, a new dataset tailored for training,
5. Experiments evaluation and benchmarking TTS systems using real-world,
in-the-wild speech data. TITW responds to the growing trend
5.1. Metrics in TTS research toward noisy-TTS training by leveraging un-
We adopt four metrics to assess the quality of generated controlled environments. Through a fully automated process-
synthetic speech: (1) Mel Cepstral Distortion [37] (MCD) mea- ing pipeline applied to VoxCeleb1—chosen for its diverse,
sures the spectral similarity between synthesized and natural YouTube-sourced speech—we ensure scalability and broad ac-
speech; (2) UTMOS [38] estimates the overall speech quality; cessibility. Our results demonstrate that four state-of-the-art
(3) DNSMOS [33] also estimates the overall quality, including TTS systems, when trained on TITW-Easy, produce synthetic
aspects such as noise reduction; (4) the ASR WER, measured speech that closely rivals the quality of the training data. How-
using the OpenAI Whisper-Large model [39], quantifies the in- ever, our analysis reveals that only modern deep-learning-based
telligibility of speech by measuring transcription errors. We use TTS systems can effectively utilize TITW, while older statis-
all four metrics as different proxies for speech quality. In prac- tical or early neural network-based systems struggle. Training
tice, we use the VERSA toolkit to compute all four metrics [40]. is also sensitive to data preparation, due to variability in noise,
5.2. TTS training data accents, or recording conditions, which might explain why the
To provide reference, we first compared the two TITW noisy-TTS training field has emerged only recently.
datasets with others commonly used for TTS training. Results Beyond technical advancements, TITW’s design carries
presented in Table 2 indicate that the TITW-Easy dataset sur- significant ethical potential. By using an automatic speaker
passes the TITW-Hard dataset in terms of quality as intended. verification data as a source, it supports research into speech
As expected, speech samples in both TITW datasets remain deepfake detection, a crucial task for combating the malicious
more challenging than those used typically for TTS training. use of synthetic voices. Consequently, TITW not only enhances
DNSMOS scores of TITW-Easy and -Hard are 2.78 and 2.38 TTS development for underrepresented languages lacking high-
while those of VCTK [41], MLS [22], and EMILIA [26] are quality datasets but also bolsters safeguards against generative
3.20, 3.33 and 3.22, respectively. speech misuse. We hope that making TITW publicly available
5.3. Baseline TTS benchmarks will spark further exploration of noisy-TTS training, driving
both innovation and ethical responsibility in synthetic speech
We present the performance of four different TTS sys-
technology.
tems, all trained with TITW datasets: (i) TransformerTTS [42]
7. References [25] K. Shen, Z. Ju et al., “NaturalSpeech 2: Latent diffusion mod-
[1] M. Le, A. Vyas et al., “Voicebox: Text-guided multilingual uni- els are natural and zero-shot speech and singing synthesizers,” in
versal speech generation at scale,” in Proc. NeurIPS, 2024. Proc. ICLR, 2024.
[2] S. Kim, K. Shih et al., “P-flow: a fast and data-efficient zero-shot [26] H. He, Z. Shang et al., “Emilia: An extensive, multilingual, and
TTS through speech prompting,” in Proc. NeurIPS, 2024. diverse speech dataset for large-scale speech generation,” arXiv
preprint arXiv:2407.05361, 2024.
[3] T. D. Nguyen, J.-H. Kim et al., “Fregrad: Lightweight and
fast frequency-aware diffusion vocoder,” in Proc. IEEE ICASSP, [27] J.-w. Jung, Y. Wu et al., “Spoofceleb: Speech deepfake detection
2024. and sasv in the wild,” IEEE Open Journal of Signal Processing,
2025.
[4] X. Zhang, D. Zhang et al., “Speechtokenizer: Unified speech tok-
enizer for speech large language models,” in Proc. ICLR, 2024. [28] M. Bain, J. Huh et al., “WhisperX: Time-accurate speech tran-
scription of long-form audio,” in Proc. Interspeech, 2023, pp.
[5] D. Yang, J. Tian et al., “UniAudio: An audio foundation model 4489–4493.
toward universal audio generation,” in Proc. ICML, 2024.
[29] A. Radford, J. W. Kim et al., “Robust speech recognition via
[6] J.-H. Lee, S.-H. Lee et al., “PVAE-TTS: Adaptive text-to-speech large-scale weak supervision,” in Proc. ICML, 2023, pp. 28 492–
via progressive style adaptation,” in Proc. IEEE ICASSP, 2022. 28 518.
[7] E. Kharitonov, D. Vincent et al., “Speak, read and prompt: High- [30] Y. Peng, J. Tian et al., “OWSM v3.1: Better and faster open
fidelity text-to-speech with minimal supervision,” Transactions of Whisper-style speech models based on e-branchformer,” in Proc.
the Association for Computational Linguistics, 2023. Interspeech, 2024, pp. 352–356.
[8] J. Kim, K. Lee et al., “Clam-TTS: Improving neural codec lan- [31] A. Défossez, G. Synnaeve, and Y. Adi, “Real time speech en-
guage model for zero-shot text-to-speech,” in Proc. ICLR, 2024. hancement in the waveform domain,” in Proc. Interspeech, 2020,
[9] J. Yamagishi, B. Usabaev et al., “Thousands of voices for HMM- pp. 3291–3295.
based speech synthesis–Analysis and application of TTS systems [32] C. K. Reddy, V. Gopal, and R. Cutler, “DNSMOS: A non-intrusive
built on various ASR corpora,” IEEE Transactions on Audio, perceptual objective speech quality metric to evaluate noise sup-
Speech, and Language Processing, vol. 18, no. 5, pp. 984–1004, pressors,” in Proc. IEEE ICASSP, 2021, pp. 6493–6497.
2010.
[33] ——, “DNSMOS P.835: A non-intrusive perceptual objective
[10] R. Karhila, U. Remes, and M. Kurimo, “Noise in HMM-Based speech quality metric to evaluate noise suppressors,” in Proc.
Speech Synthesis Adaptation: Analysis, Evaluation Methods and IEEE ICASSP, 2022, pp. 886–890.
Experiments,” IEEE Journal of Selected Topics in Signal Process-
ing, vol. 8, no. 2, pp. 285–295, Apr. 2014. [34] G. Fairbanks, “Voice and articulation drillbook,” 1960.
[11] C. Zhang, Y. Ren et al., “DenoiSpeech: Denoising text to speech [35] C. Benoı̂t, M. Grice, and V. Hazan, “The SUS test: A method
with frame-level noise modeling,” in Proc. IEEE ICASSP, 2021, for the assessment of text-to-speech synthesis intelligibility us-
pp. 7063–7067. ing semantically unpredictable sentences,” Speech Communica-
tion, vol. 18, no. 4, pp. 381–392, 1996.
[12] L.-W. Chen, S. Watanabe, and A. Rudnicky, “A vector quantized
approach for text to speech synthesis on real-world spontaneous [36] S. King, “Measuring a decade of progress in text-to-speech,” Lo-
speech,” in Proc. AAAI, 2023, pp. 12 644–12 652. quens, vol. 1, no. 1, pp. e006–e006, 2014.
[13] X. Wang, H. Delgado et al., “ASVspoof 5: Crowdsourced speech [37] R. Kubichek, “Mel-cepstral distance measure for objective speech
data, deepfakes, and adversarial attacks at scale,” in Proc. Inter- quality assessment,” in Proc. IEEE pacific rim conference on com-
speech, 2024. munications computers and signal processing, 1993.
[14] A. Nagrani, J. S. Chung, and A. Zisserman, “VoxCeleb: A large- [38] T. Saeki, D. Xin et al., “UTMOS: UTokyo-SaruLab system for
scale speaker identification dataset,” in Proc. Interspeech, 2017. VoiceMOS challenge 2022,” in Proc. Interspeech, 2022.
[15] K. T. Mai, S. Bray et al., “Warning: humans cannot reliably detect [39] A. Radford, J. W. Kim et al., “Robust speech recognition via
speech deepfakes,” PLOS One, vol. 18, no. 8, pp. 1–20, 2023. large-scale weak supervision,” in Proc. ICML. PMLR, 2023.
[16] J.-w. Jung, H. Tak et al., “SASV 2022: The first spoofing-aware [40] J. Shi, J. Tian et al., “ESPnet-Codec: Comprehensive training and
speaker verification challenge,” in Proc. Interspeech, 2022, pp. evaluation of neural codecs for audio, music, and speech,” in Proc.
2893–2897. SLT, 2024.
[17] J. Kominek and A. W. Black, “The CMU arctic speech databases,” [41] C. Veaux, J. Yamagishi, and K. MacDonald, “CSTR VCTK Cor-
in Proc. Interspeech, 2004. pus: English Multi-speaker Corpus for CSTR Voice Cloning
[18] J. Yamagishi, C. Veaux, and K. MacDonald, “CSTR VCTK Cor- Toolkit,” The Centre for Speech Technology Research (CSTR),
pus: English Multi-speaker Corpus for CSTR Voice Cloning University of Edinburgh, 2019.
Toolkit (version 0.92),” 2019. [42] N. Li, S. Liu et al., “Neural speech synthesis with transformer
[19] A. J. Hunt and A. W. Black, “Unit selection in a concatenative network,” in Proc. AAAI, 2019, pp. 6706–6713.
speech synthesis system using a large speech database,” in Proc. [43] R. Yamamoto, E. Song, and J.-M. Kim, “Parallel WaveGAN: A
IEEE ICASSP, 1996. fast waveform generation model based on generative adversar-
[20] K. Tokuda, Y. Nankaku et al., “Speech synthesis based on hidden ial networks with multi-resolution spectrogram,” in Proc. IEEE
Markov models,” Proceedings of the IEEE, vol. 101, no. 5, pp. ICASSP, 2020, pp. 6199–6203.
1234–1252, 2013. [44] V. Popov, I. Vovk et al., “Grad-TTS: A diffusion probabilistic
[21] K. Ito and L. Johnson, “The LJ speech dataset,” [Link] model for text-to-speech,” in Proc. ICML, 2021, pp. 8599–8608.
com/LJ-Speech-Dataset/, 2017. [45] Z. Kong, W. Ping et al., “DiffWave: A versatile diffusion model
[22] V. Pratap, Q. Xu et al., “MLS: A large-scale multilingual dataset for audio synthesis,” in Proc. ICLR, 2021.
for speech research,” in Proc. Interspeech, 2020. [46] J. Kim, J. Kong, and J. Son, “Conditional variational autoencoder
[23] H. Zen, V. Dang et al., “LibriTTS: A corpus derived from lib- with adversarial learning for end-to-end text-to-speech,” in Proc.
rispeech for text-to-speech,” in Proc. Interspeech, 2019. ICML, 2021, pp. 5530–5540.
[24] C. Wang, S. Chen et al., “Neural codec language mod-
els are zero-shot text to speech synthesizers,” arXiv preprint
arXiv:2301.02111, 2023.