0% found this document useful (0 votes)

53 views5 pages

TITW: Text-to-Speech Dataset Overview

The Text-to-Speech in the Wild (TITW) database introduces a publicly available dataset designed for training TTS systems using in-the-wild speech data, addressing the limitations of traditional studio-quality recordings. It consists of two sets: TITW-Hard, which contains raw data with minimal processing, and TITW-Easy, which is enhanced for better quality, both sourced from the VoxCeleb1 dataset. The TITW dataset aims to facilitate research in TTS and counteract malicious uses of synthetic speech by supporting tasks like speech deepfake detection.

Uploaded by

ThànhĐạt Ngô

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views5 pages

TITW: Text-to-Speech Dataset Overview

Uploaded by

ThànhĐạt Ngô

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

The Text-to-speech in the Wild (TITW) Database

Jee-weon Jung†1 , Wangyou Zhang2 , Soumi Maiti1,3 , Yihan Wu4 , Xin Wang5 ,
Ji-Hoon Kim6 , Yuta Matsunaga7 , Seyun Um8 , Jinchuan Tian1 , Hye-jin Shim1 ,
Nicholas Evans9 , Joon Son Chung6 , Shinnosuke Takamichi10 , Shinji Watanabe1
1
Carnegie Mellon University, USA 2 Shanghai Jiao Tong University, China 3 Meta, USA
4
Renmin University of China, China 5 National Institute of Informatics, Japan
6
Korea Advanced Institute of Science and Technology, South Korea 7 University of Tokyo, Japan
8
Yonsei University, South Korea 9 EURECOM, France 10 Keio University, Japan
jeeweonj@[Link]
Abstract
Traditional Text-to-Speech (TTS) systems rely on studio-
arXiv:2409.08711v2 [[Link]] 1 Jun 2025

quality speech recorded in controlled settings. Recently, an ef-

fort known as “noisy-TTS training” has emerged, aiming to uti-
lize in-the-wild data. However, the lack of dedicated datasets
has been a significant limitation. We introduce the TTS In
the Wild (TITW) dataset, which is publicly available1 , created
through a fully automated pipeline applied to the VoxCeleb1 Figure 1: Fully automated Text-To-Speech Synthesis In The Wild
dataset. It comprises two training sets: TITW-Hard, derived (TITW) processing pipeline. The pipeline incorporates tran-
from the transcription, segmentation, and selection of raw Vox- scription, segmentation, data selection, enhancement, and fil-
Celeb1 data, and TITW-Easy, which incorporates additional en- tering based on DNSMOS. The TITW dataset comprises two
hancement and data selection based on DNSMOS. State-of-the- editions: TITW-easy, which can be used to successfully train
art TTS models achieve over 3.0 UTMOS score with TITW- latest TTS systems, and TITW-hard, where the quality is too
Easy, while TITW-Hard remains difficult showing UTMOS below to train TTS systems with current technology but aims to
low 2.8. Beyond TTS, TITW’s unique design, leveraging an au- serve the training of more advanced future TTS systems.
tomatic speaker recognition dataset, strengthens ethical efforts
to counteract malicious use of TTS models by supporting tasks
such as speech deepfake detection. studies such as DenoiSpeech [11] and MQTTS [12] demon-
Index Terms: text-to-speech synthesis, in the wild, dataset strate that competitive performance can be achieved using non-
studio-quality data, such as podcasts and YouTube audio. The
1. Introduction ASVspoof 5 challenge [13] further supports this shift, demon-
Generative speech technology is evolving rapidly, driven strating the potential of real-world sources like audiobooks for
in part by advances in diffusion models, speech codecs, and high-quality synthetic speech. However, the field of noisy-
speech-language modeling methodologies [1–5]. Among these TTS training lacks standardized datasets, protocols, and bench-
advancements, Text-to-Speech (TTS) systems have made remarks, with most studies relying on private data or artificially
markable progress, with recent models capable of generating noised studio recordings. This gap hinders progress and repro-
speech that is nearly indistinguishable from human speech in ducibility in the field.
terms of intelligibility and naturalness. Notably, while tradi-
To this end, we introduce the Text-To-Speech Synthesis In
tional TTS systems required minutes of studio-recorded target
The Wild (TITW) database, which is publicly available. It is
speaker data, modern systems can now operate effectively with
constructed using the VoxCeleb1 database [14], a large collec-
just a few seconds of such data [6–8].
tion of YouTube speech data, and processed through an auto-
In terms of TTS training, however, studio-quality data re-
mated pipeline involving transcription, segmentation, and en-
mains the de facto standard, despite its limitations. Studio
hancement. This pipeline eliminates the need for manual pro-
recordings, while superior in audio quality, lack diversity, vari-
cessing, making the dataset scalable and accessible. TITW in-
ability, and scalability. In-the-wild speech data, on the other
cludes two training sets: TITW-Hard, comprising 189 hours of
hand, offers exposure to real-world variability, greater speaker
speech derived from raw VoxCeleb1 data with minimal process-
diversity, and nearly unlimited scalability. These benefits are
ing, and TITW-Easy, a refined subset of 173 hours enhanced
particularly valuable for underrepresented languages, as studio-
using DNSMOS-based selection and additional processing. We
based data is often limited, hindering the democratization of
also propose standardized evaluation protocols and benchmarks
speech technology. Being able to utilize publicly available
to facilitate reproducible research. Our experiments demon-
sources like YouTube for TTS training could revolutionize the
strate that four contemporary TTS models can be successfully
field, enabling the curation of target language data from acces-
trained using TITW, showcasing its practical utility.
sible, diverse, and abundant resources.
Efforts to train TTS systems with lower-quality, in-the- The selection of VoxCeleb1 as a source dataset, originally
wild data – often termed “noisy-TTS training” – have shown developed for automatic speaker recognition, offers unique and
promise. While earlier works have reported that TTS models novel benefits to TITW. Recent TTS systems, capable of pro-
cannot be trained with low-quality data [9, 10], more recent ducing speech nearly indistinguishable from human voices,
have raised concerns about malicious use and its potential to
1 [Link] damage society [15]. By training TTS models on TITW, the
† Currently at Apple. resulting synthetic speech can be paired with human speech
in TITW to support research in speech deepfake detection and
spoofing-robust automatic speaker verification [16]. This ap-
plication of TITW not only advances TTS research but also
contributes to developing robust countermeasures against the
misuse of synthetic voices. This dual benefit stems from the
dataset’s guarantee that all speech samples feature only single-
speaker audio. Consequently, our work underscores a broader
impact, fostering both innovation and ethical considerations in
the field of generative speech technology.

2. Related works
Numerous databases have been used for training TTS
models. Legacy databases such as CMU ARCTIC [17] and
VCTK [18] were carefully designed and curated. They contain Figure 2: An example of the transcription and segmentation
phonetically-balanced utterances, all recorded in highly con- in TITW automatic pipeline. A randomly selected utterance
trolled acoustic environments. Due to the high cost of record- from VoxCeleb1 goes through our transcription and segmenta-
ing, these and similar databases typically include data from a tion pipeline, deriving two segments. A segment in the middle is
single speaker or a small number of speakers. The speech data deleted because it is a non-speech segment over 500ms.
they contain is generally neutral in terms of emotions and ex-
TITW. Firstly, VoxCeleb1 is itself sourced from the wild,
pressiveness. These databases were widely used for training
specifically, YouTube, spanning diverse acoustic environments.
speaker-dependent and multiple-speaker legacy TTS systems
Secondly, as a dataset for automatic speaker recognition, each
(e.g., unit-selection [19] and HMM-based [20]).
utterance is from a single speaker. Lastly, by selecting Vox-
The revolution in deep-learning-based TTS systems called
Celeb1 as a source dataset, TTS systems trained on TITW
for larger-scale datasets. Datasets like LJSpeech [21], Multi-
can contribute to future research in speech deepfake detection
lingual Librispeech [22] and LibriTTS [23], which are sourced
and spoofing-robust automatic speaker verification, especially
from LibriVox audiobooks, are not recorded in studio-quality
requiring TTS researchers’ attention to safeguard the rapidly
environments. LJSpeech contains twenty-four hours of audio-
advancing speech generation technology from malicious us-
book recordings but from a single speaker while the other two
age. SpoofCeleb [27] exemplifies this effort, where synthesized
feature a greater number of speakers. They have been widely
(spoofed) utterances generated by 23 TTS systems trained on
used to train deep-learning-based TTS systems [24, 25]. Their
TITW-Easy are used to create a dataset for speech deepfake de-
adoption marks a shift towards using training data collected in
tection and spoofing-robust automatic speaker verification.
less controlled conditions. Even so, this data still falls short of
capturing the diversity in speaker style and acoustic conditions 3.1. Transcription and segmentation
found in truly “in-the-wild” scenarios; the signal-to-noise ratio We first transcribe and segment the utterances using pre-
remains high and utterances are generally well-enunciated.2 trained models and empirically derived heuristics, ensuring the
The TITW database introduce in this paper aims to sup- process is fully automatic without human intervention. Figure 2
port research in overcoming data constraints, often referred to displays an example of a VoxCeleb1 utterance, transcribed at
as noisy-TTS training. We envisage TTS systems that can be the word-level and then segmented into two.
trained successfully using speech data collected in uncontrolled Transcription. Since TTS training typically requires paired
conditions. We see two avenues for such research. The first, speech and text data, we generated transcriptions for the en-
most challenging direction involves the use of training data col- tire VoxCeleb1 corpus. For the sake of scalability and repro-
lected in the wild without manual human intervention, relying ducibility in future projects in various languages, we generated
solely on automatic transcriptions, segmentation, and data se- transcriptions automatically using pretrained automatic speech
lection based on heuristics. The second direction involves the recognition (ASR) models. We used the WhisperX [28] toolkit
use of a subset of data after applying additional speech enhance- to generate transcriptions with word-level timestamps. Whis-
ment and data selection based on speech quality. TITW con- perX incorporates the OpenAI Whisper Large v2 model [29]
tains in-the-wild recordings of interviews, podcasts, and more, for transcription and another phoneme-based ASR model for
all posted to social media, making it, to our best knowledge, one word-level alignment. We additionally employed the OSWMv3
of the first of its kind.3 speech foundation model [30] and transcribed the data in par-
allel. The transcriptions from the OWSMv3 model served to
3. TITW verify transcription accuracy.
For several reasons, we selected VoxCeleb1 [14], which Segmentation. We divide each sample into isolated segments
contains speech from 1, 251 speakers, as the source data for using Voice Activity Detection (VAD) embedded in Whis-
perX [28]. Practically, whenever a non-speech periods exceed
2 See Librivox documentation and guidelines for recording an audio-
500ms, we trim the silence and split it into two segments. This
book [Link]
3 We recognize EMILIA [26] as the most similar, parallel work. segmentation rule was developed through empirical, iterative
However, the goals of the two works differ. EMEILIA focuses on de- investigations. Initial attempts to train TTS models with un-
veloping a data processing pipeline that yields high-quality data from segmented data failed, revealing that excessively long silences
in-the-wild data. Therefore, it strives to provide the highest achievable within training samples were a significant issue. This procedure
quality. TITW is designed to foster research in the training of TTS sys- results in approximately 280k transcribed speech segments.
tems using more noisy and real-world data. Hence we provide not only
TITW-Easy which can be used for the training of contemporary TTS
3.2. Data selection
systems, but also TITW-Hard to challenge the development of future In order to maximize the “wildness” of the dataset, our ini-
systems. tial investigations did not apply any data selection mechanism
Table 1: Text-To-Speech Synthesis In The Wild (TITW) statistics.
Both sets involve 1, 251 speakers.

# samples Avg dur (s) Tot dur (h) Avg # words

TITW-Easy 282, 606 2.42 189 10.84
TITW-Hard 248, 024 2.51 173 10.55

Table 2: Speech quality of the TITW-Easy and -Hard sets. WER

is calculated using OWSMv3 [30]. TITW remains significantly
more challenging than VCTK or even EMILIA with a DNSMOS
of 3.20 and 3.26.

UTMOS DNSMOS WER (%)

TITW-Hard 3.00 2.38 9.30
TITW-Easy 3.32 2.78 9.10

when composing the training set for TTS. However, we em-

pirically found that attempts to train TTS models were unsuc-
cessful due to the data being excessively noisy, including issues
like mistranscription. Consequently, we developed four heuris-
tically defined rules for data selection. These heuristics emerged
from iterative efforts to train TTS models with filtered data. If
any of the following conditions are met, the data is removed and
discarded from further consideration:
• The language is not from target language, in this case, En- Figure 3: Histograms of samples in the TITW-Easy and -Hard
glish. To simplify TTS training, we use Whisper’s language sets using DNSMOS overall score shown in the x-axis. Even
recognition capability to detect and remove utterances in lan- with data selection heuristics discussed in Section 3.2, training
guages other than English. Multilingual extensions are left with TITW-Hard remains extremely challenging.
for future work.
• The segment duration is shorter than 1 second or longer than trained speech enhancement model, DEMUCS [31]6 to reduce
8 seconds. Empirical evidence suggests that using utterances additive, background noise.7 We then apply a second round of
with such a semi-consistent duration benefits training stabil- data selection, this time to the enhanced data. This is done by
ity. estimating DNSMOS scores [32, 33] for each utterance. Then,
• The per-word duration is longer than 500 ms. The typical all utterances for which the DNSMOS score is below a thresh-
speaking rate is in the order of 2 words per second. Out- old of 3.0 are removed. An exception is made for segments
liers often correspond to emotional or pathological speech, from speakers included in the generation protocol (Section 4).
or long intervals of non-speech, all of which can destabilize Figure 3 presents histograms of DNSMOS scores for both
TTS training and are hence removed. the TITW-Hard and TITW-Easy databases. Is is clearly shown
• The automatic transcription is empty. Such cases indicate that most of the low-quality samples with low DNSMOS scores
a non-speech segment or ASR failure. In either case, they have been filtered out. Table 1 presents statistics and Table 2
cannot be used for TTS training and are removed.4 further details the UTMOS, DNSMOS overall, and Word Er-
ror Rate (WER) of TITW-Easy and -Hard providing a compre-
The application of transcription, segmentation and data se- hensive measure of the overall quality and intelligibility. WER
lection results in the “TITW-Hard” database. Since the raw data was calculated by comparing TITW’s transcript with OWSMv3.
is collected from videos posted to social media, utterances in The low DNSMOS scores confirm that both TITW training sets
the TITW-Hard database still contain background noise or low- retain their challenging nature.
quality speech. Preliminary experiments have revealed that the
training of TTS models with TITW-Hard data is extremely chal- 4. Evaluation and benchmarking
lenging; most attempts failed to converge even after applying
the aforementioned data selection heuristics. Once a TTS model is trained using the TITW database, it
can be evaluated with one of the two protocols for generating
3.3. Enhancement and DNSMOS-based further data selec- new synthetic speech.
tion
TITW-KSKT (Known Speaker, Known Text) is designed to
Given the challenges of training TTS models using the
generate synthetic speech for speakers and text that are both
TITW-Hard database, we created a second, relatively less chal-
present in the TITW-Hard and TITW-Easy training sets. Both
lenging dataset named “TITW-Easy.”5 First, we apply a pre-
sets of speakers and text are randomly extracted from those used
4 Note that despite the application of these data selection heuristics, in the VoxCeleb1-O automatic speaker verification evaluation
TITW retains a higher level of variability and naturalness compared to
protocol. Consequently, the number of speakers here matches
most existing corpora. that of the VoxCeleb1-O protocol, at 40. However, due to our
5 We believe that future TTS models and training schematics will
6 [Link]
enable successful training with TITW-Hard. Nonetheless, we introduce
TITW-Easy, which contemporary state-of-the-art TTS architectures and 7 The application of denoising does not compromise our objective to
training schemes can effectively utilize, as a stepping stone towards re- train TTS models with automatically collected data from the wild – the
search in noisy-TTS training. entire pipeline remains automated, reproducible, and scalable.
Table 3: Speech quality of the segments generated from the baselines on the TITW-KSKT and -KSUT protocols. All models were trained
using the TITW-Easy data. MCD is only applicable for TITW-KSKT where it has the reference samples.

TITW-KSKT TITW-KSUT
System
MCD↓ UTMOS↑ DNSMOS↑ WER (%) ↓ MCD↓ UTMOS↑ DNSMOS↑ WER (%) ↓
TransformerTTS-ParallelWaveGAN 11.68 2.06 2.50 24.90 N/A 1.79 2.54 107.90
GradTTS-DiffWave 6.76 2.18 2.39 11.90 N/A 2.30 2.54 54.00
VITS 8.61 2.77 2.74 53.00 N/A 2.78 2.81 120.50
MQTTS 6.99 3.08 2.83 23.30 N/A 3.20 2.94 67.10

Table 4: Comparative results of the identical TTS system being with ParallelWaveGAN [43]; (ii) GradTTS-DiffWave [44, 45];
trained with TITW-Easy and TITW-Hard data. Metrics are re- (iii) VITS [46]; (iv) MQTTS [12]. The choice of these baseline
ported using the TITW-KSKT protocol. “GTmel” uses ground models aims to offer a diverse representation of contemporary
truth mel-spectrograms in place of GradTTS to solely measure TTS technologies. All models were trained with an open-source
vocoder’s performance. recipe for reproducibility with detailed recipes at [27].
Table 3 displays the results for these four TTS models
System Train MCD UTMOS DNSMOS WER when trained with the TITW-Easy dataset, evaluated under both
GTmel-DiffWave Easy 5.05 2.63 2.68 11.90 protocols. UTMOS and DNSMOS scores of the synthesized
GTmel-DiffWave Hard 4.97 2.24 2.30 12.20 speech being comparable with those in Table 2 show that it
GradTTS-DiffWave Easy 6.76 2.18 2.39 11.90 matches the quality of training data. This indicates that sys-
GradTTS-DiffWave Hard 8.23 1.29 1.47 26.20 tems were successfully trained. Yet, they struggle with the in-
VITS Easy 8.61 2.77 2.74 53.00 herent challenges of the TITW data. The WER significantly
VITS Hard 9.06 2.48 2.69 59.50 increases in most cases. These baseline performances are fur-
data preparation processes outlined in Section 3, the number of ther challenged by the TITW-KSUT protocol results, where all
segments has increased from 4, 708 to 9, 113. performances degrade compared to the TITW-KSKT protocol.
Table 4 compares the performance of models trained on
TITW-KSUT (Known Speaker, Unknown Text) aims to gen- TITW-Easy and TITW-Hard datasets. Here, we focus on two
erate synthetic speech with text that is unseen in both the TITW- systems, GradTTS-DiffWave and VITS, as the other two base-
Hard and TITW-Easy datasets. We employ two text sources line systems failed to converge when trained with TITW-Hard.
for this: The first is the Rainbow Passage [34], which cov- We also present the result replacing GradTTS with a mel-
ers many English sounds and their combinations. It has been spectrogram extracted from the original speech file (i.e., copy
used widely in other data collection efforts, for example the synthesis), which serves as the upper bound for the waveform
VCTK corpus [18]. The second is a set of Semantically Un- model, DiffWave. The results consistently confirm that in all
predictable Sentences (SUS) [35] selected from past Blizzard cases models trained on TITW-Hard produce speech of lower
challenges [36]. In total, there are 200 different text samples quality, highlighting the challenging nature of TITW-Hard.
(31 from The Rainbow Passage and 169 from the set of SUS).
With the same set of 40 speakers, the protocol requires the gen- 6. Conclusion and remarks
eration of 8, 000 (= 40 × 200) synthetic utterances.
We introduce TITW, a new dataset tailored for training,
5. Experiments evaluation and benchmarking TTS systems using real-world,
in-the-wild speech data. TITW responds to the growing trend
5.1. Metrics in TTS research toward noisy-TTS training by leveraging un-
We adopt four metrics to assess the quality of generated controlled environments. Through a fully automated process-
synthetic speech: (1) Mel Cepstral Distortion [37] (MCD) mea- ing pipeline applied to VoxCeleb1—chosen for its diverse,
sures the spectral similarity between synthesized and natural YouTube-sourced speech—we ensure scalability and broad ac-
speech; (2) UTMOS [38] estimates the overall speech quality; cessibility. Our results demonstrate that four state-of-the-art
(3) DNSMOS [33] also estimates the overall quality, including TTS systems, when trained on TITW-Easy, produce synthetic
aspects such as noise reduction; (4) the ASR WER, measured speech that closely rivals the quality of the training data. How-
using the OpenAI Whisper-Large model [39], quantifies the in- ever, our analysis reveals that only modern deep-learning-based
telligibility of speech by measuring transcription errors. We use TTS systems can effectively utilize TITW, while older statis-
all four metrics as different proxies for speech quality. In practical or early neural network-based systems struggle. Training
tice, we use the VERSA toolkit to compute all four metrics [40]. is also sensitive to data preparation, due to variability in noise,
5.2. TTS training data accents, or recording conditions, which might explain why the
To provide reference, we first compared the two TITW noisy-TTS training field has emerged only recently.
datasets with others commonly used for TTS training. Results Beyond technical advancements, TITW’s design carries
presented in Table 2 indicate that the TITW-Easy dataset sur- significant ethical potential. By using an automatic speaker
passes the TITW-Hard dataset in terms of quality as intended. verification data as a source, it supports research into speech
As expected, speech samples in both TITW datasets remain deepfake detection, a crucial task for combating the malicious
more challenging than those used typically for TTS training. use of synthetic voices. Consequently, TITW not only enhances
DNSMOS scores of TITW-Easy and -Hard are 2.78 and 2.38 TTS development for underrepresented languages lacking high-
while those of VCTK [41], MLS [22], and EMILIA [26] are quality datasets but also bolsters safeguards against generative
3.20, 3.33 and 3.22, respectively. speech misuse. We hope that making TITW publicly available
5.3. Baseline TTS benchmarks will spark further exploration of noisy-TTS training, driving
both innovation and ethical responsibility in synthetic speech
We present the performance of four different TTS sys-
technology.
tems, all trained with TITW datasets: (i) TransformerTTS [42]
7. References [25] K. Shen, Z. Ju et al., “NaturalSpeech 2: Latent diffusion mod-
[1] M. Le, A. Vyas et al., “Voicebox: Text-guided multilingual uni- els are natural and zero-shot speech and singing synthesizers,” in
versal speech generation at scale,” in Proc. NeurIPS, 2024. Proc. ICLR, 2024.
[2] S. Kim, K. Shih et al., “P-flow: a fast and data-efficient zero-shot [26] H. He, Z. Shang et al., “Emilia: An extensive, multilingual, and
TTS through speech prompting,” in Proc. NeurIPS, 2024. diverse speech dataset for large-scale speech generation,” arXiv
preprint arXiv:2407.05361, 2024.
[3] T. D. Nguyen, J.-H. Kim et al., “Fregrad: Lightweight and
fast frequency-aware diffusion vocoder,” in Proc. IEEE ICASSP, [27] J.-w. Jung, Y. Wu et al., “Spoofceleb: Speech deepfake detection
2024. and sasv in the wild,” IEEE Open Journal of Signal Processing,
2025.
[4] X. Zhang, D. Zhang et al., “Speechtokenizer: Unified speech tok-
enizer for speech large language models,” in Proc. ICLR, 2024. [28] M. Bain, J. Huh et al., “WhisperX: Time-accurate speech tran-
scription of long-form audio,” in Proc. Interspeech, 2023, pp.
[5] D. Yang, J. Tian et al., “UniAudio: An audio foundation model 4489–4493.
toward universal audio generation,” in Proc. ICML, 2024.
[29] A. Radford, J. W. Kim et al., “Robust speech recognition via
[6] J.-H. Lee, S.-H. Lee et al., “PVAE-TTS: Adaptive text-to-speech large-scale weak supervision,” in Proc. ICML, 2023, pp. 28 492–
via progressive style adaptation,” in Proc. IEEE ICASSP, 2022. 28 518.
[7] E. Kharitonov, D. Vincent et al., “Speak, read and prompt: High- [30] Y. Peng, J. Tian et al., “OWSM v3.1: Better and faster open
fidelity text-to-speech with minimal supervision,” Transactions of Whisper-style speech models based on e-branchformer,” in Proc.
the Association for Computational Linguistics, 2023. Interspeech, 2024, pp. 352–356.
[8] J. Kim, K. Lee et al., “Clam-TTS: Improving neural codec lan- [31] A. Défossez, G. Synnaeve, and Y. Adi, “Real time speech en-
guage model for zero-shot text-to-speech,” in Proc. ICLR, 2024. hancement in the waveform domain,” in Proc. Interspeech, 2020,
[9] J. Yamagishi, B. Usabaev et al., “Thousands of voices for HMM- pp. 3291–3295.
based speech synthesis–Analysis and application of TTS systems [32] C. K. Reddy, V. Gopal, and R. Cutler, “DNSMOS: A non-intrusive
built on various ASR corpora,” IEEE Transactions on Audio, perceptual objective speech quality metric to evaluate noise sup-
Speech, and Language Processing, vol. 18, no. 5, pp. 984–1004, pressors,” in Proc. IEEE ICASSP, 2021, pp. 6493–6497.
2010.
[33] ——, “DNSMOS P.835: A non-intrusive perceptual objective
[10] R. Karhila, U. Remes, and M. Kurimo, “Noise in HMM-Based speech quality metric to evaluate noise suppressors,” in Proc.
Speech Synthesis Adaptation: Analysis, Evaluation Methods and IEEE ICASSP, 2022, pp. 886–890.
Experiments,” IEEE Journal of Selected Topics in Signal Process-
ing, vol. 8, no. 2, pp. 285–295, Apr. 2014. [34] G. Fairbanks, “Voice and articulation drillbook,” 1960.
[11] C. Zhang, Y. Ren et al., “DenoiSpeech: Denoising text to speech [35] C. Benoı̂t, M. Grice, and V. Hazan, “The SUS test: A method
with frame-level noise modeling,” in Proc. IEEE ICASSP, 2021, for the assessment of text-to-speech synthesis intelligibility us-
pp. 7063–7067. ing semantically unpredictable sentences,” Speech Communica-
tion, vol. 18, no. 4, pp. 381–392, 1996.
[12] L.-W. Chen, S. Watanabe, and A. Rudnicky, “A vector quantized
approach for text to speech synthesis on real-world spontaneous [36] S. King, “Measuring a decade of progress in text-to-speech,” Lo-
speech,” in Proc. AAAI, 2023, pp. 12 644–12 652. quens, vol. 1, no. 1, pp. e006–e006, 2014.
[13] X. Wang, H. Delgado et al., “ASVspoof 5: Crowdsourced speech [37] R. Kubichek, “Mel-cepstral distance measure for objective speech
data, deepfakes, and adversarial attacks at scale,” in Proc. Inter- quality assessment,” in Proc. IEEE pacific rim conference on com-
speech, 2024. munications computers and signal processing, 1993.
[14] A. Nagrani, J. S. Chung, and A. Zisserman, “VoxCeleb: A large- [38] T. Saeki, D. Xin et al., “UTMOS: UTokyo-SaruLab system for
scale speaker identification dataset,” in Proc. Interspeech, 2017. VoiceMOS challenge 2022,” in Proc. Interspeech, 2022.
[15] K. T. Mai, S. Bray et al., “Warning: humans cannot reliably detect [39] A. Radford, J. W. Kim et al., “Robust speech recognition via
speech deepfakes,” PLOS One, vol. 18, no. 8, pp. 1–20, 2023. large-scale weak supervision,” in Proc. ICML. PMLR, 2023.
[16] J.-w. Jung, H. Tak et al., “SASV 2022: The first spoofing-aware [40] J. Shi, J. Tian et al., “ESPnet-Codec: Comprehensive training and
speaker verification challenge,” in Proc. Interspeech, 2022, pp. evaluation of neural codecs for audio, music, and speech,” in Proc.
2893–2897. SLT, 2024.
[17] J. Kominek and A. W. Black, “The CMU arctic speech databases,” [41] C. Veaux, J. Yamagishi, and K. MacDonald, “CSTR VCTK Cor-
in Proc. Interspeech, 2004. pus: English Multi-speaker Corpus for CSTR Voice Cloning
[18] J. Yamagishi, C. Veaux, and K. MacDonald, “CSTR VCTK Cor- Toolkit,” The Centre for Speech Technology Research (CSTR),
pus: English Multi-speaker Corpus for CSTR Voice Cloning University of Edinburgh, 2019.
Toolkit (version 0.92),” 2019. [42] N. Li, S. Liu et al., “Neural speech synthesis with transformer
[19] A. J. Hunt and A. W. Black, “Unit selection in a concatenative network,” in Proc. AAAI, 2019, pp. 6706–6713.
speech synthesis system using a large speech database,” in Proc. [43] R. Yamamoto, E. Song, and J.-M. Kim, “Parallel WaveGAN: A
IEEE ICASSP, 1996. fast waveform generation model based on generative adversar-
[20] K. Tokuda, Y. Nankaku et al., “Speech synthesis based on hidden ial networks with multi-resolution spectrogram,” in Proc. IEEE
Markov models,” Proceedings of the IEEE, vol. 101, no. 5, pp. ICASSP, 2020, pp. 6199–6203.
1234–1252, 2013. [44] V. Popov, I. Vovk et al., “Grad-TTS: A diffusion probabilistic
[21] K. Ito and L. Johnson, “The LJ speech dataset,” [Link] model for text-to-speech,” in Proc. ICML, 2021, pp. 8599–8608.
com/LJ-Speech-Dataset/, 2017. [45] Z. Kong, W. Ping et al., “DiffWave: A versatile diffusion model
[22] V. Pratap, Q. Xu et al., “MLS: A large-scale multilingual dataset for audio synthesis,” in Proc. ICLR, 2021.
for speech research,” in Proc. Interspeech, 2020. [46] J. Kim, J. Kong, and J. Son, “Conditional variational autoencoder
[23] H. Zen, V. Dang et al., “LibriTTS: A corpus derived from lib- with adversarial learning for end-to-end text-to-speech,” in Proc.
rispeech for text-to-speech,” in Proc. Interspeech, 2019. ICML, 2021, pp. 5530–5540.
[24] C. Wang, S. Chen et al., “Neural codec language mod-
els are zero-shot text to speech synthesizers,” arXiv preprint
arXiv:2301.02111, 2023.

Common questions

Experimental results illustrate the effectiveness of the TITW dataset in training contemporary TTS models through measures like Word Error Rate (WER), DNSMOS, and UTMOS scores. Models trained on TITW-Easy show improved performance compared to those trained on TITW-Hard, as evidenced by better DNSMOS and lower WER scores, demonstrating the tangible benefits of data refinement and enhancement. Despite its challenges, the TITW dataset enables the training of TTS models that perform competitively on synthetic speech tasks, showcasing its utility and highlighting the importance of data quality on model outcomes .

DNSMOS scores are perceptual metrics used to evaluate the noise suppression in speech data without requiring reference signals. In the processing of the TITW dataset, DNSMOS scores are employed to selectively filter out low-quality audio samples. For TITW-Easy, utterances with DNSMOS scores below a threshold of 3.0 are removed, thereby improving the dataset's overall quality and aiding the successful training of TTS models . This process ensures that only data with acceptable noise levels are retained for more effective model training .

The TITW-Hard and TITW-Easy datasets offer significant potential for future advancements in TTS technology, especially in the area of training models on real-world noisy data. TITW-Hard presents a challenging dataset that may push the limits of TTS systems and inspire the development of more robust and adaptable models capable of handling highly variable and low-quality data. In contrast, TITW-Easy serves as a more accessible resource for current TTS systems, providing a stepping stone for incremental progress in handling noise through refined data. Both datasets, therefore, support a continuum of research: from immediate applications with TITW-Easy to aspirational goals with TITW-Hard, fostering innovation in noise-robust speech synthesis and contributing to more advanced training methodologies .

Training TTS models with the TITW datasets presents multiple challenges due to the variability and noise in the data, especially with the raw data found in TITW-Hard. This dataset retains background noise and low-quality speech, making model convergence difficult, as noted in instances where TTS model training was largely unsuccessful . To mitigate these challenges, TITW-Easy was created by enhancing the original dataset using a pre-trained speech enhancement model, DEMUCS, and filtering out utterances with DNSMOS scores below a threshold of 3.0 . This resulted in a dataset that contemporary TTS architectures can utilize effectively, serving as a stepping stone for research in noisy TTS training .

The TITW project introduces several innovations to address gaps in current TTS research. Firstly, it provides a scalable, automated pipeline for processing speech data from real-world sources like YouTube, incorporating transcription, segmentation, and enhancement processes without manual interference . Secondly, the project establishes standardized evaluation protocols and benchmarks, promoting reproducibility and consistency in TTS research, which were previously hindered by the reliance on private or artificially noised datasets . Finally, by utilizing datasets like VoxCeleb1, TITW leverages diversified speaker variation that benefits both TTS system training and research in deepfake detection and speech verification, offering a dual-purpose approach that uniquely aligns innovation with ethical considerations in generative speech technology .

The heuristic data selection rules for the TITW-Hard dataset are designed to exclude data that could destabilize TTS training. These rules involve removing non-English language utterances, discarding segments that are shorter than 1 second or longer than 8 seconds, excluding samples where per-word duration exceeds 500 ms, and filtering out cases with empty automatic transcriptions. These steps aim to ensure consistency and relevance in the training data, despite inherent challenges like background noise and low-quality speech. However, even after applying these heuristics, training with TITW-Hard remains extremely challenging, with many TTS models failing to converge .

The use of the VoxCeleb1 database enhances the TITW dataset by providing a large collection of YouTube speech data, originally designed for automatic speaker recognition. This adds diversity in terms of speaker variation and acoustic conditions. An important unique benefit it offers is ensuring all speech samples feature single-speaker audio, which is beneficial for training spoofing-robust automatic speaker verification systems and potentially developing anti-deepfake technologies. VoxCeleb1's characteristics contribute to the dataset's relevance for both training advanced TTS systems and in research areas like deepfake detection, creating a robust platform for a dual-purpose application .

Advanced TTS systems, capable of producing highly realistic synthetic speech, raise significant ethical concerns about their potential misuse. These systems can be exploited to create deepfakes, which pose risks of manipulation and misinformation that could harm individuals or societal structures. The document highlights these issues by pointing out the potential for malicious uses and emphasizes the importance of developing technologies to detect and counteract such abuses. It stresses the dual benefit of the TITW dataset, which not only advances TTS research but also supports development of tools to detect and prevent malicious activities through robust speaker verification systems .

The TITW dataset aids speech deepfake detection and speech verification by enabling TTS models trained on its data to produce synthetic speech that can be paired with human speech from the same dataset. This allows researchers to develop robust countermeasures against the misuse of synthetic voices. By ensuring all speech samples are from single speakers, TITW creates a controlled environment to support the research and advancement of spoofing-robust automatic speaker verification systems .

The DEMUCS model plays a crucial role in processing the TITW-Easy dataset by serving as a pre-trained speech enhancement tool that reduces additive, background noise from the raw data. Its application is significant because it mitigates the noise present in orignal collected social media data, allowing for a higher quality training set that can be effectively used by contemporary state-of-the-art TTS architectures. This enhancement process was part of the key refinements that transformed the heavily noisy TITW-Hard dataset into the more usable TITW-Easy version, improving its suitability for effective TTS model training .

IndexTTS: Advanced Zero-Shot TTS System
No ratings yet
IndexTTS: Advanced Zero-Shot TTS System
5 pages
FireRedTTS: Advanced Text-to-Speech Framework
No ratings yet
FireRedTTS: Advanced Text-to-Speech Framework
14 pages
Low-Resource Text-to-Speech Project Report
No ratings yet
Low-Resource Text-to-Speech Project Report
15 pages
ZMM-TTS: Zero-Shot Multilingual TTS
No ratings yet
ZMM-TTS: Zero-Shot Multilingual TTS
16 pages
Thesis
No ratings yet
Thesis
37 pages
HiFi-GAN TTS for Indian Languages
No ratings yet
HiFi-GAN TTS for Indian Languages
8 pages
SupertonicTTS: Efficient TTS System
No ratings yet
SupertonicTTS: Efficient TTS System
21 pages
TatarTTS: Open-Source Speech Dataset
No ratings yet
TatarTTS: Open-Source Speech Dataset
5 pages
Human-Level Text-to-Speech Synthesis
No ratings yet
Human-Level Text-to-Speech Synthesis
12 pages
Neural Speech Synthesis
No ratings yet
Neural Speech Synthesis
63 pages
Naturalspeech 3:: Zero-Shot Speech Synthesis With Factorized Codec and Diffusion Models
No ratings yet
Naturalspeech 3:: Zero-Shot Speech Synthesis With Factorized Codec and Diffusion Models
22 pages
LRSpeech: Low-Resource Speech Tech
No ratings yet
LRSpeech: Low-Resource Speech Tech
11 pages
Suoni
No ratings yet
Suoni
38 pages
Meta-Learning for Few-Shot TTS
No ratings yet
Meta-Learning for Few-Shot TTS
14 pages
BASE TTS: A Billion-Parameter TTS Model
No ratings yet
BASE TTS: A Billion-Parameter TTS Model
27 pages
AutoPrep: Enhancing In-the-Wild Speech Data
No ratings yet
AutoPrep: Enhancing In-the-Wild Speech Data
5 pages
MiniMax-Speech: Zero-Shot TTS Innovation
No ratings yet
MiniMax-Speech: Zero-Shot TTS Innovation
20 pages
E2 TTS: Simplified Zero-Shot Speech
No ratings yet
E2 TTS: Simplified Zero-Shot Speech
8 pages
Stutter-TTS: Enhancing Stuttered Speech Recognition
No ratings yet
Stutter-TTS: Enhancing Stuttered Speech Recognition
8 pages
Open Source Text-to-Speech System
No ratings yet
Open Source Text-to-Speech System
5 pages
F5-TTS: Efficient Non-Autoregressive TTS
No ratings yet
F5-TTS: Efficient Non-Autoregressive TTS
17 pages
Arabic TTS Engine Development Overview
No ratings yet
Arabic TTS Engine Development Overview
19 pages
NAUTILUS: SOTA Voice Cloning System
No ratings yet
NAUTILUS: SOTA Voice Cloning System
15 pages
F5-TTS: Efficient Non-Autoregressive TTS
No ratings yet
F5-TTS: Efficient Non-Autoregressive TTS
17 pages
SNAC Layer for Zero-Shot TTS Models
No ratings yet
SNAC Layer for Zero-Shot TTS Models
5 pages
NaturalSpeech: Human-Level TTS Synthesis
No ratings yet
NaturalSpeech: Human-Level TTS Synthesis
12 pages
YourTTS Hindi TTS Model Optimization
No ratings yet
YourTTS Hindi TTS Model Optimization
5 pages
Transfer Learning for Low-Resource TTS
No ratings yet
Transfer Learning for Low-Resource TTS
12 pages
StyleTTS 2: Advancing Human-Level TTS
No ratings yet
StyleTTS 2: Advancing Human-Level TTS
28 pages
AI Voice Cloning for Presentation Automation
No ratings yet
AI Voice Cloning for Presentation Automation
5 pages
FoR Dataset for Synthetic Speech Detection
No ratings yet
FoR Dataset for Synthetic Speech Detection
10 pages
Unsupervised TTS Synthesis Model
No ratings yet
Unsupervised TTS Synthesis Model
5 pages
OZSpeech: Efficient Zero-shot TTS Model
No ratings yet
OZSpeech: Efficient Zero-shot TTS Model
17 pages
Text-to-Speech Synthesis Overview
No ratings yet
Text-to-Speech Synthesis Overview
15 pages
TTS-1 Technical Report: Audio Markups
No ratings yet
TTS-1 Technical Report: Audio Markups
20 pages
Child Speech Synthesis TTS Pipeline Evaluation
No ratings yet
Child Speech Synthesis TTS Pipeline Evaluation
15 pages
NaturalSpeech 3: Zero-Shot TTS Advances
No ratings yet
NaturalSpeech 3: Zero-Shot TTS Advances
22 pages
F5-TTS: Flow Matching for TTS
No ratings yet
F5-TTS: Flow Matching for TTS
18 pages
Voicecra) : V C: Zero-Shot Speech Editing and Text-To-Speech in The Wild
No ratings yet
Voicecra) : V C: Zero-Shot Speech Editing and Text-To-Speech in The Wild
21 pages
Text-to-Audio Conversion with OpenVoice
No ratings yet
Text-to-Audio Conversion with OpenVoice
48 pages
DiTTo-TTS: Scalable TTS Without Phonemes
No ratings yet
DiTTo-TTS: Scalable TTS Without Phonemes
34 pages
ParrotTTS: Low-Resource Multilingual TTS
No ratings yet
ParrotTTS: Low-Resource Multilingual TTS
13 pages
MoonCast: Zero-Shot Podcast Generation
No ratings yet
MoonCast: Zero-Shot Podcast Generation
24 pages
KazakhTTS: Open-Source TTS Dataset
No ratings yet
KazakhTTS: Open-Source TTS Dataset
5 pages
OLaniya Akinode
No ratings yet
OLaniya Akinode
7 pages
Comparing GTTS and Pyttsx3 in TTS
No ratings yet
Comparing GTTS and Pyttsx3 in TTS
12 pages
Voxtral TTS: Win Rate
No ratings yet
Voxtral TTS: Win Rate
16 pages
VALL-E: Zero-Shot TTS with Neural Codec
No ratings yet
VALL-E: Zero-Shot TTS with Neural Codec
16 pages
XTTS: Multilingual Zero-Shot TTS Model
No ratings yet
XTTS: Multilingual Zero-Shot TTS Model
5 pages
Optimizing Vietnamese TTS Naturalness
No ratings yet
Optimizing Vietnamese TTS Naturalness
8 pages
Revival With Voice: Multi-Modal Controllable Text-to-Speech Synthesis
No ratings yet
Revival With Voice: Multi-Modal Controllable Text-to-Speech Synthesis
5 pages
TTR Retraining-Free Pruning Text-To-Speech Synthesis M
No ratings yet
TTR Retraining-Free Pruning Text-To-Speech Synthesis M
13 pages
Multilingual Dubbing System for Kurdish
No ratings yet
Multilingual Dubbing System for Kurdish
11 pages
Multilingual Multiaccented TTS Model
No ratings yet
Multilingual Multiaccented TTS Model
5 pages
SPEAR-TTS: Efficient Multi-Speaker TTS
No ratings yet
SPEAR-TTS: Efficient Multi-Speaker TTS
19 pages
SV2TTS: Advanced Multi-Speaker TTS System
No ratings yet
SV2TTS: Advanced Multi-Speaker TTS System
9 pages
SoulX-Podcast: Multi-Speaker TTS Innovation
No ratings yet
SoulX-Podcast: Multi-Speaker TTS Innovation
11 pages
Deep Learning for Lung Sound Analysis
No ratings yet
Deep Learning for Lung Sound Analysis
23 pages
Deep Learning Case Studies Overview
No ratings yet
Deep Learning Case Studies Overview
50 pages
Convolutional Neural Networks Overview
No ratings yet
Convolutional Neural Networks Overview
41 pages
Data Science Interview Prep: V-Net & More
No ratings yet
Data Science Interview Prep: V-Net & More
16 pages
Data Science Interview Prep: 30 Days
No ratings yet
Data Science Interview Prep: 30 Days
16 pages
Data Science NLP Interview Guide
No ratings yet
Data Science NLP Interview Guide
10 pages
HRP-4C: Humanoid Robot Overview
No ratings yet
HRP-4C: Humanoid Robot Overview
15 pages
Systematic Review of CAPT in L2 Pronunciation
No ratings yet
Systematic Review of CAPT in L2 Pronunciation
21 pages
Overview of Jurafsky & Martin's SLP
No ratings yet
Overview of Jurafsky & Martin's SLP
8 pages
Online Cab Booking System Architecture
No ratings yet
Online Cab Booking System Architecture
19 pages
TSEC 2024-25 Admission Prospectus
No ratings yet
TSEC 2024-25 Admission Prospectus
98 pages
Pre-Processing in Speech Recognition
No ratings yet
Pre-Processing in Speech Recognition
19 pages
Feature Extraction in Speech Recognition
100% (1)
Feature Extraction in Speech Recognition
6 pages
AI Assessment Review: Natural Language Processing
No ratings yet
AI Assessment Review: Natural Language Processing
135 pages
NLP Applications and Techniques Explained
No ratings yet
NLP Applications and Techniques Explained
37 pages
Voice-Controlled Robotic Vehicle System
No ratings yet
Voice-Controlled Robotic Vehicle System
4 pages
AI-Powered Predictive Maintenance System
No ratings yet
AI-Powered Predictive Maintenance System
12 pages
AI Reading Assister Project Logbook
No ratings yet
AI Reading Assister Project Logbook
22 pages
Head-Controlled Mouse with Voice Typing
No ratings yet
Head-Controlled Mouse with Voice Typing
4 pages
Sentiment & Speech Analytics Overview
No ratings yet
Sentiment & Speech Analytics Overview
6 pages
Speech Processing
No ratings yet
Speech Processing
16 pages
Voice Based Email System Application For Blind and Visually Impaired Peoples
No ratings yet
Voice Based Email System Application For Blind and Visually Impaired Peoples
3 pages
ClassMate AI: Engaging Classroom Assistant
No ratings yet
ClassMate AI: Engaging Classroom Assistant
7 pages
Multimodal Interaction With A Wearable Augmented Reality System
No ratings yet
Multimodal Interaction With A Wearable Augmented Reality System
10 pages
Smart Mirror: Interactive UI Overview
No ratings yet
Smart Mirror: Interactive UI Overview
21 pages
Essential Software Tools for Development
No ratings yet
Essential Software Tools for Development
47 pages
Transformer Acoustic Models for Speech Recognition
No ratings yet
Transformer Acoustic Models for Speech Recognition
5 pages
Origins and Development of ChatGPT
No ratings yet
Origins and Development of ChatGPT
3 pages
Petition for Inter Partes Review of Patent 6,397,186
No ratings yet
Petition for Inter Partes Review of Patent 6,397,186
78 pages
Personalized Career Guidance Platform
No ratings yet
Personalized Career Guidance Platform
118 pages
AI Voice Agent for Indian SMBs
No ratings yet
AI Voice Agent for Indian SMBs
10 pages
Project Proposal Report Guidelines
No ratings yet
Project Proposal Report Guidelines
12 pages
Information Retrieval and NLP Overview
No ratings yet
Information Retrieval and NLP Overview
32 pages
Building a Personalized AI Assistant
No ratings yet
Building a Personalized AI Assistant
2 pages
Smart Home Automation with Voice Control
0% (1)
Smart Home Automation with Voice Control
26 pages
Text Search Techniques in IRS Systems
No ratings yet
Text Search Techniques in IRS Systems
6 pages

TITW: Text-to-Speech Dataset Overview

Uploaded by

TITW: Text-to-Speech Dataset Overview

Uploaded by

The Text-to-speech in the Wild (TITW) Database

quality speech recorded in controlled settings. Recently, an ef-

# samples Avg dur (s) Tot dur (h) Avg # words

Table 2: Speech quality of the TITW-Easy and -Hard sets. WER

UTMOS DNSMOS WER (%)

when composing the training set for TTS. However, we em-

Common questions

How do the experimental results indicate the effectiveness of the TITW dataset in training contemporary TTS models?

What are DNSMOS scores and how are they utilized in the processing of the TITW dataset?

Assess the potential research implications of TITW-Hard and TITW-Easy datasets, especially in terms of future advancements in TTS technology.

What challenges are faced when training TTS models with data from the Text-To-Speech Synthesis In The Wild (TITW) dataset, and how is TITW-Easy used to mitigate these challenges?

Identify and discuss the main innovations introduced by the Text-To-Speech Synthesis In The Wild (TITW) project in addressing the current gaps in TTS research.

Explain the heuristic data selection rules used for the TITW-Hard dataset and their impact on TTS training.

In what ways does the use of the VoxCeleb1 database enhance the TITW dataset, and what unique benefits does it offer?

Discuss the ethical considerations associated with the use of advanced TTS systems, as highlighted in the document.

How does the TITW dataset contribute to the area of speech deepfake detection and speech verification?

What role does the DEMUCS model play in processing the TITW-Easy dataset, and why is it significant?

You might also like