Speech Self-Supervised Learning Using Diffusion Model Synthetic Data
Speech Self-Supervised Learning Using Diffusion Model Synthetic Data
Anonymous Author(s)
Affiliation
Address
email
Abstract
1 While self-supervised learning (SSL) in speech has greatly reduced the reliance of
2 speech processing systems on annotated corpora, the success of SSL still hinges on
3 the availability of a large-scale unannotated corpus, which is still often impractical
4 for many low-resource languages or under privacy concerns. In this paper, we
5 investigate whether existing SSL methods have been underutilizing the information
6 in pretraining and explore ways to improve their information efficiency. Motivated
7 by the recent success of diffusion models in capturing the abundant information in
8 data, we propose D IFF S4L, a synthetic speech SSL algorithm based on diffusion
9 models. D IFF S4L introduces a diffusion model, which learns from a given small
10 pretraining dataset and expands it into a much larger synthetic dataset with different
11 levels of variations. The synthetic dataset is then used to pretrain SSL models.
12 Our experiments show that D IFF S4L can significantly improve the performance
13 of SSL models, such as reducing the WER of the HuBERT pretrained model by
14 6.26 percentage points in the English ASR task. Notably, even the nonsensical
15 babbles generated by the diffusion model can account for a significant portion of
16 the performance improvement, which indicates the strong capability of diffusion
17 models in capturing coherent information in speech that has been overlooked by
18 SSL methods.
19 1 Introduction
20 Self-supervised learning (SSL) in speech has greatly reduced the reliance of speech processing
21 systems on large-scale annotated corpora. By pretraining a speech representation network on a
22 large-scale unannotated dataset, SSL models only require a relatively small annotated dataset for
23 finetuning, which has significantly improved the efficiency and feasibility of speech processing,
24 particularly for low-resource languages. However, the success of such methods still hinges on the
25 availability of a large-scale unannotated corpus. For example, the training of HuBERT [Hsu et al.,
26 2021], one of the most widely-used speech pretraining models, typically requires that the unannotated
27 corpus contains at least 1,000 hours of speech. If the dataset size drops to 100 hours, it tends to
28 perform significantly worse. Yet, in many scenarios, obtaining such a large-scale dataset is still
29 impractical due to various constraints, e.g., low-resource languages, privacy concerns etc.
30 Such limitations have prompted us to re-examine SSL from an information efficiency perspective.
31 Essentially, if we consider the pretraining dataset as a source of information about speech data at
32 various levels (from phonetics to semantics), then SSL can be seen as a way to extract that information.
33 In situations where the pretraining dataset is limited, it becomes crucial to maximize the amount of
34 information captured from the dataset to achieve the best performance in downstream tasks. This
35 raises the question – do existing SSL techniques have a high enough information efficiency? Could
Submitted to 37th Conference on Neural Information Processing Systems (NeurIPS 2023). Do not distribute.
36 there be additional information that SSL models fail to capture, which would otherwise contribute to
37 a better performance in downstream tasks?
38 On the other hand, generative models are also often considered models that capture distributional
39 information about data. Recently, diffusion models [Ho et al., 2020, Song et al., 2021], with their
40 superior performance in computer vision, have quickly attracted wide research attention. Researchers
41 have found that compared to other generative models, diffusion models can generate samples with
42 much better global coherence [Li et al., 2022b] and local details [Dhariwal and Nichol, 2021], an
43 indication that diffusion models may be able to capture more complete information from a limited
44 dataset that could complement those learnable by existing SSL methods.
45 Motivated by this, in this paper, we conduct an extensive exploration of using synthetic data generated
46 by diffusion models to improve the performance of existing SSL methods in a low-resource setting.
47 In particular, we propose a Synthetic Speech Self-Superised Learning algorithm called D IFF S4L.
48 D IFF S4L introduces a diffusion model, which learns from a given small pretraining dataset and
49 then expands it into a much larger synthetic dataset. The new dataset contains synthetic speech
50 utterances with different levels of variations, ranging from identical utterances in the original small
51 dataset to near-complete babbles. Finally, the synthetic dataset is used to pretrain SSL models using
52 existing algorithms. Since the diffusion model only has access to the information in the original
53 real dataset, the entire process can be viewed as restructuring and recreating the information in the
54 original pretraining dataset into a more digestible form for existing SSL methods.
55 Our experiments on D IFF S4L reveal many interesting findings. D IFF S4L can significantly improve
56 the performance of existing SSL algorithms over models pretrained on the real data alone across
57 low-resource and high-resource scenarios. In English ASR, for example, with 100 hours of real data,
58 D IFF S4L can reduce the WER by 6.26 percentage points for HuBERT pretrained models, which
59 is a 26.4% relative improvement. Notably, the babbles generated by diffusion models, which are
60 complete nonsense to humans, can account for a significant portion of the performance improvement,
61 while babbles generated by other generative models, such as WaveNet [van den Oord et al., 2016],
62 only deteriorate the performance. These findings suggest the information in pretraining datasets has
63 been under-utilized, and diffusion models are very effective in capturing the information that has
64 been overlooked by existing SSL training methods and other generative models.
65 2 Related Work
66 Data Augmentation with Synthetic Data Training neural networks with synthetic data to improve
67 performance has been extensively studied in various computer vision tasks such as visual represen-
68 tation learning [Baradad Jurjo et al., 2021, Jahanian et al., 2021, Wu et al., 2022, Kataoka et al.,
69 2022], image classification [Gan et al., 2021, Mikami et al., 2021], object detection [Peng et al., 2015,
70 Prakash et al., 2019, Chattopadhyay et al., 2022], anomaly detection [Tsai and Wang, 2022], semantic
71 segmentation [Ros et al., 2016, Wang et al., 2020], action recognition [De Souza et al., 2017, Varol
72 et al., 2021], visual reasoning [Johnson et al., 2017], and embodied perception [Kolve et al., 2017,
73 Savva et al., 2019, Xia et al., 2018]. Recently this direction is also studied in NLP tasks such as
74 machine translation [Downey et al., 2022] and language model pretraining [Yao et al., 2022] and
75 finetuning [Steinert-Threlkeld et al., 2022].
76 Augmenting datasets with synthetic data has been shown effective in improving speech processing
77 systems. One research direction modifies speech waveforms by adding random noise [Amodei et al.,
78 2015], warping spectrogram, masking blocks of spectrograms in frequency and time domains [Park
79 et al., 2019], modifying pitch and adding reverberation [Kharitonov et al., 2020], and disentangling
80 speaker information from speech content [Qian et al., 2022].
81 Another line of research augments the dataset using speech data generated from speech synthesizers
82 and reports improvement on speech translations [Zhao et al., 2022], fake audio detection [Li et al.,
83 2022a], and speech recognition [Hayashi et al., 2018, Mimura et al., 2018, Li et al., 2018, Rossenbach
84 et al., 2020, Violeta et al., 2022, Jin et al., 2022, Krug et al., 2022, Zevallos et al., 2022], etc. Zheng
85 et al. [2021] use synthetic data to improve the recognition of out-of-vocabulary words in ASR systems.
86 Zhao et al. [2022] generate synthetic training data by retrieving and stitching clips from a spoken
87 vocabulary bank. Li et al. [2018] train a TACOTRON -2 [Shen et al., 2018] conditioned on Global
88 Style Tokens [Wang et al., 2018] to generate speech with different speaking styles. Jin et al. [2022]
89 use a GAN-based generator conditioned on dysarthric speech characteristics to generate synthetic
2
Primitive Speech Large Synthetic
Small Real Representation #!
Dataset Dataset
Initial Speech Final Speech
Diffusion Model
!! Representation Modify !"#$ Representation
Speaker Synthesizer % Train
Network " Network
Identity $
Train
Figure 1: The algorithm overview. Solid arrows represent the data flow that generates the synthetic
dataset. Dashed arrows mark the dataset on which each network is trained.
90 speech for dysarthric ASR. Krug et al. [2022] generate articulatory speech for phoneme recognition.
91 These works improve traditional task-specific speech systems by generating additional paired speech
92 and text data while our work aims to improve general-purpose self-supervised speech representations
93 without additional text data that benefits downstream ASR and other speech-related tasks.
94 Denoising Diffusion Probabilistic Models for Speech Denoising diffusion probabilistic models
95 (DDPMs) have recently demonstrated great power in image synthesis [Ho et al., 2020, Dhariwal and
96 Nichol, 2021] and image impainting [Lugmayr et al., 2022] tasks. Recently various DDPM-based
97 vocoders and text-to-speech (TTS) synthesizers have been proposed [Chen et al., 2021a,b, Kong
98 et al., 2020b, Lam et al., 2022, Huang et al., 2022a,b] and achieved high quality. WAVE G RAD
99 [Chen et al., 2021a] and D IFF WAVE [Kong et al., 2020b] are two concurrent works that study the
100 DDPM-based vocoder to synthesize audio waveform from spectrograms; WAVE G RAD uses a neural
101 architecture inspired by GAN-TTS [Bińkowski et al., 2019] and D IFF WAVE inspired by WAVE N ET.
102 FAST D IFF [Huang et al., 2022a] and P RODIFF [Huang et al., 2022b] are end-to-end TTS systems
103 that use FAST S PEECH [Ren et al., 2020], a transformer-based TTS encoder, to extract text feature
104 to condition the DDPM and adopts the noise scheduling algorithm proposed in BBDM [Lam et al.,
105 2022] to shorten the sampling steps for fast speech synthesis.
110 Denote a speech utterance as X . Given an unannotated speech dataset, denoted as D0 , D IFF S4L aims
111 to synthesize a much larger dataset Dsyn , which is then used to pre-train the speech representation
112 network. As shown in Figure 1, the algorithm consists of four steps.
113 Step 1: Use D0 to train an initial speech representation network f0 (·), which can produce a primitive
114 speech representation, denoted as R0 = f0 (X).
115 Step 2: Use D0 to train a diffusion-model-based speech synthesizer g(·), which generates speech X
116 conditional on the partially masked primitive speech representation R0 and speaker identity, denoted
117 as I , i.e., X̃ = g(R0 , I).
118 Step 3: For each utterance X in D0 , manipulate its speech representation R0 and speaker identity
119 I , and then fed to the speech synthesizer to generate utterances with different levels of variations.
120 Denote the resulting dataset as Dsyn .
121 Step 4: Use Dsyn to train a new speech representation network.
122 It is worth noting that the diffusion model only has access to the original pretraining dataset D0 during
123 training and generation, so the synthetic dataset Dsyn would contain no more information than D0 ,
124 but may restructure and recreate it in a way that is more beneficial for SSL with existing methods.
125 The following subsection will provide more details on steps 1-3, respectively.
127 In our setting, the size of D0 is very small. We adopt the WAV 2 VEC 2.0 [Baevski et al., 2020] for our
128 primitive speech representation learning because it has stable performance in low-resource scenarios.
129 Note that the algorithm used to train the final speech representation network (step 4) need not be the
3
& = 10 &=5 &=3 &=0
Copy Fully-
Conditional
!!
Replaced $ Model
original utterance is ‘There were no ferries and hobgoblins about’. The yellow dashed lines on the
spectrogram in (d) mark
Masked #! the boundaries of the masks on R0 .
130 same as the one for the primitive speech representation learning. After the WAV 2 VEC 2.0 is trained,
131 we elicit the 5th-layer feature and quantize it into 500 classes using k-means, which becomes the
132 primitive speech representation R0 for the subsequent steps. A discussion on choosing the number of
133 clusters is provided in Appendix E.
135 THERE
Diffusion models refer to aWERE
familyNO
ofFAIRIES AND
generative HOBGOBLINS
models ABOUT
that denoise from noise signals into clean
136 signals through multiple denoising steps. In this work, we adopt the canonical denoising diffusion
137 probabilistic model (DDPM) [Ho et al., 2020] to generate a speech spectrogram. Specifically, DDPM
138 introduces a set of intermediate variables forming a Markov process, denoted as X0:T , where X0 is
139 the original speech spectrogram, and Xt is corrupted from Xt 1 with Gaussian noise:
p
q(Xt |Xt 1) = N (Xt ; 1 t Xt 1 , t I), (1)
140 where t is a hyperparameters. It can be shown that with a proper t schedule, XT is very close
141 to standard Gaussian noise. To generate X0 , we randomly sample XT from the standard Gaussian
142 distribution, and sequentially recover XT 1 through X0 , as visualized in Figure 2, via the following
143 denoising process:
p✓ (Xt 1 |Xt , C) = N (Xt 1 ; µ✓ (Xt , t, C), t I), (2)
144 where µ✓ is produced by a (reparameterized) denoising network, and t can be computed from t .
145 C represents the conditioning information for the denoising network. In this paper, we introduce two
146 models with different levels of conditioning: a fully-conditional model and a partially-conditional
147 model. For the fully-conditional model, the denoising network is conditioned upon the entire primitive
148 speech representation R0 , so that the diffusion model will generate speech that follows the content
149 depicted in R0 . For the partially-conditional model, the denoising network is still conditioned upon
150 R0 , but with a consecutive span of 80% of the frames masked out. In this case, the diffusion model
151 will follow the content in R0 only where it is unmasked, and try to generate novel content that fits
152 into the given context at the remaining frames. These two models are both crucial in generating
153 synthetic data with different levels of variations.
154 Besides R0 , both models are also conditional on speaker labels I , which can be either one-hot vectors
155 or speaker embeddings produced by a pre-trained speaker embedding network, depending on whether
156 D0 comes with speaker labels. We will compare different conditioning settings in Section 4.
157 To convert the spectrograms into speech waveforms, we adopt a HifiGAN [Kong et al., 2020a], which
158 is also trained only on the small dataset D0 .
160 The synthetic speech generation uses the original speech dataset D0 as seeds. Specifically, we first
161 draw a speech utterance from D0 as the seed speech, eliciting its primitive speech representation R0
4
162 and speaker identity I , and then generate a synthetic utterance by feeding a modified version of these
163 conditioning variables to the diffusion model synthesizer. When designing the modification schemes
164 for the conditioning variables, we primarily consider the tradeoff between novelty and naturalness –
165 if the generated speech is identical to the original utterance, we can achieve maximum naturalness but
166 introduces no new information to the dataset; if the generated speech is a complete babble, we can
167 introduce maximum novelty but may significantly compromise naturalness. Therefore, we introduce
168 the following four different levels of novelty, as shown in Figure 3:
169 • Original Speech (O): The seed speech is directly copied to the synthetic dataset without modifica-
170 tion, as shown in Figure 3(a). No resynthesis is involved for this level.
171 • Same Speaker (SS): R0 and I are fed as is to the fully-conditional diffusion model. The resulting
172 synthetic speech is almost the same as the seed speech. However, since R0 tends to obscure the
173 pitch information, the synthetic speech will be in a different intonation, as shown in Figure 3(b).
174 • Novel Speaker (NS): R0 is still fed as is to the fully-conditional diffusion model, but I is replaced
175 with a different speaker ID. As a result, the synthetic would still have the same content, but in a
176 different voice and intonation, as shown in Figure 3(c).
177 • Novel Content (NC): We mask out a consecutive span of 80% frames in R0 and replace I before
178 feeding them to the partially-conditional diffusion model. As shown in Figure 3(d), the synthetic
179 speech is almost completely different from the seed speech in terms of content, speaker, and
180 prosody, except for the content information in the 20% unmasked frames. The utterances are almost
181 nonsensical babbles to human listeners. We are thus interested in seeing whether utterances at this
182 high level of randomness could still contribute to SSL.
183 As we will show, all four levels of the speech are beneficial for the subsequent speech pretraining
184 and thus should all be included into Dsyn with appropriate ratios. We have included additional
185 spectrograms in Appendix B, as well as some generated audio files in the supplemental materials.
186 4 Experiments
187 In this section, we will present our experimental results on training different SSL models integrating
188 D IFF S4L. Some additional experimental results are presented in the appendix.
190 Pretraining Dataset For the experiments in English, the methods to be evaluated are pretrained
191 on Librispeech-960 dataset [Panayotov et al., 2015]. We consider two settings, the low-resource
192 setting and the high-resource setting. For the low-resource setting, the seed dataset D0 for training
193 Steps 1 and 2 contains only 100 hours of real speech from the train-clean-100 subset. The
194 synthetic dataset Dsyn contains 1) 100 hours of real speech; 2) 430 hours of SS/NS speech, which is
195 generated by replacing the speaker ID with a uniformly randomly chosen one from all the speakers in
196 D0 can be the same as the original speaker); and 3) 430 hours of NC speech. We deliberately make
197 the total hours of speech in Dsyn equal to 960 so that we can compare to the common setting with
198 960 hours of real speech. In the following, we will use x + y + z notation to represent the hours of
199 real speech (x), SS/NS speech (y ), and NC speech (z ) respectively. So the above Dsyn composition
200 is represented as 100 + 430 + 430. For the high-resource setting, D0 contains all 960 hours of real
201 speech from Librispeech-960 and the dataset composition of Dsyn is 960 + 960 + 480. We will
202 explore other dataset compositions in Section 4.4.
203 Evaluation Tasks We consider two sets of tasks, automatic speech recognition (ASR) and the
204 SUPERB benchmark [Yang et al., 2021]. For ASR, we use the ‘base_10h’ configuration file in
205 FAIRSEQ for WAV 2 VEC 2.0 and H U BERT fine-tuning on a 10-hour limited supervision dataset.
206 We follow the same finetuning procedure as in Baevski et al. [2020] and Hsu et al. [2021] where
207 we add a linear projection layer on top and finetune with the CTC loss. For SUPERB, which is a
208 collection of speech-processing tasks, we evaluate our models on KS (keyword spotting), IC (intent
209 classification), SID (speaker identification), ER (emotion recognition), Qbe (query by example spoken
210 term detection), SF (slot filling), ASV (automatic speaker verification) and SD (speaker diarization),
211 We did not include the ASR and PR (phoneme recognition), because they overlap with the first task.
5
212 Evaluation Models For both the high-resource (960h real) and low-resource (100h real) settings,
213 we compare the following four models:
214 • WAV 2 VEC 2-D IFF S4L/H U BERT-D IFF S4L: WAV 2 VEC 2.0 [Baevski et al., 2020] and H U BERT
215 [Hsu et al., 2021] pretrained on the synthetic dataset produced by the proposed DiffS4L procedure;
216 • WAV 2 VEC 2-R EAL/H U BERT-R EAL: WAV 2 VEC 2.0 and H U BERT pretrained on real speech only.
217 In addition, for the low-resource setting, we add three models for better comparison:
218 • WAV 2 VEC 2-O NE H OT/H U BERT-O NE H OT: In WAV 2 VEC 2-D IFF S4L/H U BERT-D IFF S4L, we
219 use the pretrained GE2E speaker embedding [Wan et al., 2018]. To study whether this would leak
220 information of more real speech data, we replace it with one-hot speaker embedding.
221 • WAV 2 VEC 2-AUG: Wav2vec2.0 pretrained on 100-hour real data augmented by adding reverbera-
222 tion, Gaussian noise, and modifying the pitch of the speech samples [Sriram et al., 2022].
223 Implementation Details The entire training pipeline is constructed based on two existing code
224 repositories: FAIRSEQ [Ott et al., 2019] and P RO D IFF [Huang et al., 2022b]. The code and con-
225 figuration files are uploaded to an anonymous GitHub repository1 . We follow the same procedure
226 as in Baevski et al. [2020] and Hsu et al. [2021] to pretrain all the WAV 2 VEC 2.0 and H U BERT
227 models using FAIRSEQ. We use the base model of WAV 2 VEC 2.0 and H U BERT, which contain 12
228 Transformer layers and 95M parameters. For H U BERT, we adopt two rounds of training; the first
229 round uses a K-Means teacher of 500 clusters on the 80-bin mel-spectrogram and the second round
230 uses a K-Means teacher of 500 clusters on the H U BERT feature from the first round.
231 The speech synthesizer is based on the code of the P RO D IFF -TTS model implemented in P RO D IFF,
232 which consists of a FASTSPEECH 2 encoder and a DDPM. We remove the Energy Predictor, Pitch
233 Predictor in the FASTSPEECH 2 encoder, and replace the Duration Predictor that aligns the text
234 with mel-spectrogram with an upsampling network that resamples the H U BERT units from 50Hz
235 to 62.5Hz to match the length of mel-spectrogram. The DDPM models a 20-time-step forward
236 and reverse Gaussian diffusion process on the mel-spectrogram, conditioned on the FAST S PEECH 2
237 encoder outputs. Both the fully- and partially-conditional diffusion models are trained for 200k
238 iterations. To convert the mel-spectrogram into a speech waveform, we apply the H IFI GAN vocoder
239 [Kong et al., 2020a], which is trained on the same real dataset D0 for 1M iterations.
240 More implementation details are included in Appendix A.
242 Table 1 reports the character error rate (CER) and word error rate (WER) of the ASR task and the
243 performance on the SUPERB tasks. There are three key observations. First, in both low-resource and
244 high-resource scenarios, pretraining on D IFF S4L-synthetic data consistently improve the performance
245 of almost all downstream tasks, compared to pretraining on the real speech portion only. Second,
246 performance improvement is particularly significant in low-resource scenarios. This confirms that
247 DiffS4L more thoroughly utilizes the information in the same real speech dataset that is otherwise
248 overlooked by SSL models. Third, Wav2vec2.0-based systems perform slightly better in ASR tasks,
249 whereas HuBERT-based systems do better in SUPERB tasks. Additional results in Appendix C and F
250 further verify that the performance advantage is consistent with and without language models, across
251 different sizes of the finetuning dataset, and with different choices of diffusion models.
252 Furthermore, models utilizing one-hot speaker embeddings demonstrate similar performance to those
253 using GE2E embeddings, confirming that the performance advantage of D IFF S4L does not come
254 from the leakage of the GE2E pretraining dataset. To be fair, D IFF S4L does need speaker information
255 or labels whereas the baseline pretraining methods do not. It is worth emphasizing, though, that
256 D IFF S4L only converts the speaker identity to the seen speakers in the original real dataset. It does
257 not introduce new speakers.
258 Finally, D IFF S4L systems consistently outperform WAV 2 VEC 2-AUG, suggesting that D IFF S4L better
259 capture the information and variations in the real speech than signal-processing-based augmentation.
1
https://anonymous.4open.science/r/DiffS4L-CE41
6
Table 1: Main results on (a) English automatic speech recognition and (b) SUPERB benchmark. The
bolded results show the best performance among all but the topline models.
E NGLISH SUPERB
ASR KS IC SID ER Q BE SF ASV SD
TASK /M ETRIC CER# WER# ACC" ACC" ACC" ACC" MTWV" F1" CER# EER# DER#
H IGH -R ESOURCE S ETTING (960- HOUR REAL SPEECH )
WAV 2 VEC 2-R EAL 3.18 10.49 96.23 92.35 66.20 60.55 0.0233 87.64 25.37 6.67 6.65
H U BERT-R EAL 3.03 10.30 96.30 98.26 66.27 60.74 0.0736 88.53 25.20 5.80 6.30
WAV 2 VEC 2-D IFF S4L 2.98 9.93 96.17 94.73 65.79 61.29 0.0630 88.50 24.71 6.60 6.63
H U BERT-D IFF S4L 2.95 9.87 96.47 98.50 64.36 61.40 0.0766 88.93 24.03 5.78 6.26
L OW-R ESOURCE S ETTING (100- HOUR REAL SPEECH )
WAV 2 VEC 2-R EAL 7.37 23.48 91.92 88.64 47.68 58.99 0.0311 81.31 37.06 8.78 8.45
H U BERT-R EAL 7.43 23.71 91.82 78.43 57.53 61.84 0.0419 78.87 40.69 8.91 8.53
WAV 2 VEC 2-AUG 6.92 22.06 92.18 92.83 48.65 58.34 0.0377 81.99 36.39 8.37 8.84
WAV 2 VEC 2-D IFF S4L 5.19 16.67 93.57 91.01 45.41 59.86 0.0331 83.13 33.60 8.02 7.14
WAV 2 VEC 2-O NE H OT 5.19 16.65 93.23 91.41 48.94 61.64 0.0364 83.00 34.64 8.14 7.28
H U BERT-D IFF S4L 5.33 17.45 94.68 95.94 44.22 62.02 0.0469 84.61 32.68 7.42 7.09
H U BERT-O NE H OT 5.36 17.47 94.26 95.89 44.25 62.60 0.0445 83.98 32.33 7.64 7.44
Table 2: ASR results (CER/WER) on selected languages from MLS and CommonVoice. The languages
are (from left to right, top to bottom) English, German, Spanish, French, Italian, Dutch, Polish,
Portuguese, Bashki, Central Kurdish, Welsh, Meadow Mari, Swahili, and Tamil.
L ANGUAGES EN DE ES FR IT NL PL
WAV 2 VEC -100R 7.4/23.5 8.3/30.4 7.1/27.2 16.2/45.5 8.3/35.1 17.8/50.9 11.4/44.2
WAV 2 VEC -D IFF S4L 5.2/16.8 6.4/23.3 4.5/16.7 11.9/34.8 6.2/27.2 14.7/44.8 7.1/31.0
L ANGUAGES PO BA CKB CY MHR SW TA
WAV 2 VEC -100R 13.8/45.8 10.2/43.8 7.2/39.0 20.6/62.1 10.7/45.4 8.8/31.5 9.2/47.2
WAV 2 VEC -D IFF S4L 8.9/37.1 8.9/37.1 6.7/29.7 16.7/52.3 9.4/37.5 7.0/25.9 7.5/41.0
261 To test whether the performance improvement of D IFF S4L can generalize to other languages, we
262 select all the seven non-English languages from the Mulingual LibriSpeech (MLS) dataset [Pratap
263 et al., 2020] and six languages from the Commonvoice dataset [Ardila et al., 2019]. The languages
264 in the Commonvoice dataset are chosen based on the criterion that they have just over 100 hours of
265 validated data in the dataset. For each language in MLS, we sample 100 hours from training split for
266 pertaining and use the limited supervision subset for finetuning. Both cases use the provided dev and
267 test split for validation and testing. For each language in CommonVoice, we create a 100-hour split for
268 pretraining, and a 10-hour split for finetuning. The provided dev and test split are used for validation
269 and testing, respectively. We only evaluate the WAV 2 VEC 2.0 systems due to the substantial time
270 cost for pretraining and due to our observation that the relative improvements in both WAV 2 VEC 2.0
271 and H U BERT are similar. Also, since most of these languages do not have 960 hours of data in the
272 dataset, we cannot compute the topline results, so we show only the baseline and D IFF S4L models.
273 Table 2 demonstrates a consistent performance advantage of D IFF S4L across all the languages. In
274 particular, D IFF S4L can reduce the CER by an average of 2.6 percentage points, and WER by an
275 average of 8.3 percentage points, which is a significant improvement for ASR. Notice that these
276 languages are from different language families and each has very unique phonetic, lexical, and
277 syntactic structures, so these results show that the diffusion models can successfully capture various
278 structures in all these languages. Additional results in Appendix D show that the performance gain is
279 consistent across different dataset partitions and compositions.
281 In the low-resource setting, the dataset composition is fixed to 100+430+430 (recall the three numbers
282 are the hours of real speech, SS/NS speech, and NC speech respectively). To better understand
7
Figure 4: CER/WER across data compositions Figure 5: Performance over different synthetic
by varying the ratios of SS/NS and NC speech, dataset sizes, 100+x+0, where x ranges from 0
ranging from 100+860+0 to 100+0+860. to 1820.
283 the contribution of each component, we perform an ablation study where we change the dataset
284 composition. To keep our computation tractable, we only perform experiments on WAVE 2 VEC 2.0
285 and on the English ASR tasks in all the remaining ablation studies.
286 In our first experiment, we fix the total hours of the dataset to 960 and fix hours of real data to
287 100, but we vary the ratio of the SS/NS and NC from 100+860+0 to 100+0+860. The results are
288 shown in Figure 4. There are two important observations. First, the performance curve exhibits a
289 U-shape, with the lowest CER and WER achieved when both SS/NS and NC are of comparable
290 amounts. This indicates that both the recombination of speaker information and the innovation of
291 content plays a crucial role in improving the performance of SSL models. In particular, note that NC
292 data is essentially nonsensical babbles reflecting the limited knowledge of phone transitions learned
293 by the diffusion models from the small real dataset, and that one of the purposes of SSL models is
294 also to learn the phone transition structures. The fact that the nonsensical babble can still help the
295 SSL performance implies that the existing SSL algorithms cannot effectively utilize all the phone
296 transition information in the original real dataset.
297 Our second observation of Figure 4 is that comparing the two extreme cases, the performance without
298 the SS/NS data (the left endpoint) is worse than that without the NC data (the right endpoint). Recall
299 that SS/NS data are generated conditional on the true content information and therefore are of high
300 quality, whereas NC data generally sound messier and noisier. This observation may be ascribed to
301 the quality differences in the synthetic data.
302 Now that we have verified the contribution of synthesizing novel content, we will investigate the
303 effect of synthesizing novel speaker combinations in the next experiment. In particular, we start with
304 the standard dataset composition, i.e. 100+430+430, but instead, we do not replace the speaker in
305 any of the synthesis types; hence there is no longer NS data and the NC data has reduced speaker
306 variations. The result, shown in Table 4 (WAV 2 VEC -SS), shows a marked performance degradation
307 (1.7 percentage points in CER and 5.0 percentage points in WER) compared to the standard dataset
308 composition, which verifies that the novel speaker combination is crucial to the performance.
309 Finally, to test the contribution of including the original dataset, we remove the real data and expand
310 the synthetic data proportionally to 960 hours, i.e. 0+480+480. The result, as shown in Table 4
311 (WAV 2 VEC -N O R EAL), shows an even larger performance degradation. In fact, we find that without
312 the real data, the SSL training is hard to converge. This shows that including the real data is essential
313 for successful SSL training with synthetic data.
315 Since we have verified that synthetic data improve SSL training, a natural follow-up question is
316 whether the more synthetic data the better. To answer this question, we fix the real data to 100
317 hours and NC data to 0 hours but vary the hours of SS/NS data, i.e., 100+x+0, with x ranging from
318 0 to 1820. Figure 5 shows the corresponding WAV 2 VEC results on English ASR. As shown, the
319 performance does not always improve as the amount of synthetic data increases. When synthetic
320 data is small, increasing synthetic data can drastically improve performance. However, as synthetic
321 data continues to increase, the performance gradually saturates and then starts to degrade, with the
322 optimal performance achieved at around 630 hours. Combining the previous results, we can conclude
8
Table 3: English ASR performance of WAV 2 VEC Table 4: Performance over different
pretrained on D IFF S4L-generated data versus that masking ratios when synthesizing NC
on WAVENET-generated data. speech.
323 that although adding synthetic data can inject new knowledge and variations, adding too much can
324 dilute the contribution of the real data, which have been shown essential for the training, and hence
325 will negatively impact the performance.
343 Recall that the NC data is generated by conditioning on R0 with 80% frames masked out, as shown
344 in Figure 3(d). We would like to investigate whether the masking length has an impact on the
345 performance. We thus retrain two partially-conditional diffusion models, one with 50% masking
346 length and the other with 100% (which becomes totally unconditional). We then generate two
347 synthetic datasets, whose compositions are both 100+430+430, but whose NC data are generated with
348 50% and 100% masking length, respectively. The corresponding WAVE 2 VEC English ASR results are
349 shown in Figure 6. As shown, there are only slight differences in the performance, with the optimal
350 achieved by 80% masking length. We conjecture that two factors influence the performance when
351 changing the mask length. One is the amount of novel content, which increases as masking length
352 increases; the other is the quality of the generated speech, which tends to decrease as masking length
353 increases. Therefore, pushing the mask length to both extremes negatively impact the performance.
354 5 Conclusion
355 In this study, we examined SSL from an information efficiency perspective and found that performance
356 can be greatly improved by utilizing the information present in the pretraining dataset, particularly in
357 low-resource settings. We discovered that synthetic data is an effective way to extract information
358 and enhance SSL performance. Specifically, diffusion models were found to be particularly capable
359 of capturing complex structures in speech that traditional pretraining methods cannot; thus even
360 synthetic babbles contain valuable information for SSL training. D IFF S4L opens the door to a new
361 approach to speech SSL. One limitation of D IFF S4L is that it is a time-consuming process, as it
362 involves training of multiple networks sequentially. As a next step, we plan to investigate more
363 efficient methods of information sharing between diffusion models and SSL models to reduce the
364 need for synthetic data generation and prolonged pretraining.
9
365 References
366 Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jin Bai, Eric Battenberg, Carl Case,
367 Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Gregory Fred-
368 erick Diamos, Erich Elsen, Jesse Engel, Linxi (Jim) Fan, Christopher Fougner, Awni Y. Hannun,
369 Billy Jun, Tony Xiao Han, Patrick LeGresley, Xiangang Li, Libby Lin, Sharan Narang, A. Ng,
370 Sherjil Ozair, Ryan J. Prenger, Sheng Qian, Jonathan Raiman, Sanjeev Satheesh, David Seetapun,
371 Shubho Sengupta, Anuroop Sriram, Chong-Jun Wang, Yi Wang, Zhiqian Wang, Bo Xiao, Yan Xie,
372 Dani Yogatama, Junni Zhan, and Zhenyao Zhu. Deep speech 2 : End-to-end speech recognition in
373 english and mandarin. In International Conference on Machine Learning, 2015.
374 Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben
375 Morais, Lindsay Saunders, Francis M Tyers, and Gregor Weber. Common voice: A massively-
376 multilingual speech corpus. arXiv preprint arXiv:1912.06670, 2019.
377 Alexei Baevski, Henry Zhou, Abdel rahman Mohamed, and Michael Auli. wav2vec 2.0: A framework
378 for self-supervised learning of speech representations. ArXiv, abs/2006.11477, 2020.
379 Manel Baradad Jurjo, Jonas Wulff, Tongzhou Wang, Phillip Isola, and Antonio Torralba. Learning to
380 see by looking at noise. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman
381 Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 2556–
382 2569. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper/2021/
383 file/14f2ebeab937ca128186e7ba876faef9-Paper.pdf.
384 Mikołaj Bińkowski, Jeff Donahue, Sander Dieleman, Aidan Clark, Erich Elsen, Norman Casagrande,
385 Luis C Cobo, and Karen Simonyan. High fidelity speech synthesis with adversarial networks. In
386 International Conference on Learning Representations, 2019.
387 Prithvijit Chattopadhyay, Kartik Sarangmath, Vivek Vijaykumar, and Judy Hoffman. Pasta: Propor-
388 tional amplitude spectrum training augmentation for syn-to-real domain generalization. arXiv
389 preprint arXiv:2212.00979, 2022.
390 Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad Norouzi, and William Chan. Wave-
391 grad: Estimating gradients for waveform generation. In International Conference on Learning
392 Representations, 2021a. URL https://openreview.net/forum?id=NsMLjcFaO8O.
393 Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, Najim Dehak, and William
394 Chan. Wavegrad 2: Iterative refinement for text-to-speech synthesis. In Interspeech, 2021b.
395 Cesar Roberto De Souza, Adrien Gaidon, Yohann Cabon, and Antonio Manuel Lopez. Procedural
396 generation of videos to train deep action recognition networks. In 2017 IEEE Conference on
397 Computer Vision and Pattern Recognition (CVPR), pages 2594–2604. IEEE Computer Society,
398 2017.
399 Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. In
400 M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,
401 Advances in Neural Information Processing Systems, volume 34, pages 8780–8794. Cur-
402 ran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper/2021/file/
403 49ad23d1ec9fa4bd8d77d02681df5cfa-Paper.pdf.
404 C.M. Downey, L. Liu, Xuhui Zhou, and Shane Steinert-Threlkeld. Learning to translate by learning
405 to communicate. ArXiv, abs/2207.07025, 2022.
406 Ewan Dunbar, Julien Karadayi, Mathieu Bernard, Xuan-Nga Cao, Robin Algayres, Lucas Ondel,
407 Laurent Besacier, Sakriani Sakti, and Emmanuel Dupoux. The zero resource speech challenge
408 2020: Discovering discrete subword and word units. In Interspeech, 2020.
409 Chuang Gan, Jeremy Schwartz, Seth Alter, Martin Schrimpf, James Traer, Julian De Freitas, Jonas
410 Kubilius, Abhishek Bhandwaldar, Nick Haber, Megumi Sano, et al. Threedworld: A platform
411 for interactive multi-modal physical simulation. In Annual Conference on Neural Information
412 Processing Systems, 2021.
10
413 Tomoki Hayashi, Shinji Watanabe, Yu Zhang, Tomoki Toda, Takaaki Hori, Ramon Astudillo, and
414 Kazuya Takeda. Back-translation-style data augmentation for end-to-end asr. In 2018 IEEE Spoken
415 Language Technology Workshop (SLT), pages 426–433. IEEE, 2018.
416 Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in
417 Neural Information Processing Systems, 33:6840–6851, 2020.
418 Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov,
419 and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked
420 prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing,
421 29:3451–3460, 2021.
422 Rongjie Huang, Max W. Y. Lam, J. Wang, Dan Su, Dong Yu, Yi Ren, and Zhou Zhao. Fastdiff: A fast
423 conditional diffusion model for high-quality speech synthesis. In International Joint Conference
424 on Artificial Intelligence, 2022a.
425 Rongjie Huang, Zhou Zhao, Huadai Liu, Jinglin Liu, Chenye Cui, and Yi Ren. Prodiff: Progressive
426 fast diffusion model for high-quality text-to-speech. Proceedings of the 30th ACM International
427 Conference on Multimedia, 2022b.
428 Ali Jahanian, Xavier Puig, Yonglong Tian, and Phillip Isola. Generative models as a data source
429 for multiview representation learning. In International Conference on Learning Representations,
430 2021.
431 Zengrui Jin, Xurong Xie, Mengzhe Geng, Tianzi Wang, Shujie Hu, Jiajun Deng, Guinan Li, and
432 Xunying Liu. Adversarial data augmentation using vae-gan for disordered speech recognition.
433 arXiv preprint arXiv:2211.01646, 2022.
434 Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and
435 Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual
436 reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition,
437 pages 2901–2910, 2017.
438 Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-
439 based generative models. In Proc. NeurIPS, 2022.
440 Hirokatsu Kataoka, Ryo Hayamizu, Ryosuke Yamada, Kodai Nakashima, Sora Takashima, Xinyu
441 Zhang, Edgar Josafat Martinez-Noriega, Nakamasa Inoue, and Rio Yokota. Replacing labeled
442 real-image datasets with auto-generated contours. In Proceedings of the IEEE/CVF Conference on
443 Computer Vision and Pattern Recognition, pages 21232–21241, 2022.
444 Eugene Kharitonov, Morgane Rivière, Gabriel Synnaeve, Lior Wolf, Pierre-Emmanuel Mazar’e,
445 Matthijs Douze, and Emmanuel Dupoux. Data augmenting contrastive learning of speech repre-
446 sentations in the time domain. 2021 IEEE Spoken Language Technology Workshop (SLT), pages
447 215–222, 2020.
448 Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Daniel
449 Gordon, Yuke Zhu, Abhinav Gupta, and Ali Farhadi. AI2-THOR: An Interactive 3D Environment
450 for Visual AI. arXiv, 2017.
451 Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. Hifi-gan: Generative adversarial networks for
452 efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems,
453 33:17022–17033, 2020a.
454 Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diffwave: A versatile
455 diffusion model for audio synthesis. ArXiv, abs/2009.09761, 2020b.
456 Paul Konstantin Krug, Peter Birkholz, Branislav Gerazov, Daniel Rudolph van Niekerk, Anqi Xu,
457 and Yi Xu. Articulatory Synthesis for Data Augmentation in Phoneme Recognition. In Proc.
458 Interspeech 2022, pages 1228–1232, 2022. doi: 10.21437/Interspeech.2022-10874.
459 Kushal Lakhotia, Eugene Kharitonov, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Benjamin Bolte,
460 Tu-Anh Nguyen, Jade Copet, Alexei Baevski, Abdelrahman Mohamed, et al. On generative spoken
461 language modeling from raw audio. Transactions of the Association for Computational Linguistics,
462 9:1336–1354, 2021.
11
463 Max W. Y. Lam, Jun Wang, Dan Su, and Dong Yu. BDDM: bilateral denoising diffusion models
464 for fast and high-quality speech synthesis. In The Tenth International Conference on Learning
465 Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL
466 https://openreview.net/forum?id=L7wzpQttNO.
467 Jason Li, Ravi Teja Gadde, Boris Ginsburg, and Vitaly Lavrukhin. Training neural speech recognition
468 systems with synthetic speech augmentation. ArXiv, abs/1811.00707, 2018.
469 Kai Li, Sheng Li, Xugang Lu, Masato Akagi, Meng Liu, Lin Zhang, Chang Zeng, Longbiao Wang,
470 Jianwu Dang, and Masashi Unoki. Data augmentation using mcadams-coefficient-based speaker
471 anonymization for fake audio detection. In Proc. INTERSPEECH, 2022a.
472 Xiang Lisa Li, John Thickstun, Ishaan Gulrajani, Percy Liang, and Tatsunori B Hashimoto. Diffusion-
473 lm improves controllable text generation. arXiv preprint arXiv:2205.14217, 2022b.
474 Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool.
475 Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the
476 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11461–11471, 2022.
477 Hiroaki Mikami, Kenji Fukumizu, Shogo Murai, Shuji Suzuki, Yuta Kikuchi, Taiji Suzuki, Shin-ichi
478 Maeda, and Kohei Hayashi. A scaling law for synthetic-to-real transfer: How much is your
479 pre-training effective? arXiv preprint arXiv:2108.11018, 2021.
480 Masato Mimura, Sei Ueno, Hirofumi Inaguma, Shinsuke Sakai, and Tatsuya Kawahara. Leveraging
481 sequence-to-sequence speech synthesis for enhancing acoustic-to-word speech recognition. In
482 2018 IEEE Spoken Language Technology Workshop (SLT), pages 477–484. IEEE, 2018.
483 Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier,
484 and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of
485 NAACL-HLT 2019: Demonstrations, 2019.
486 Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus
487 based on public domain audio books. In 2015 IEEE international conference on acoustics, speech
488 and signal processing (ICASSP), pages 5206–5210. IEEE, 2015.
489 Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin Dogus Cubuk, and
490 Quoc V. Le. Specaugment: A simple data augmentation method for automatic speech recognition.
491 ArXiv, abs/1904.08779, 2019.
492 Xingchao Peng, Baochen Sun, Karim Ali, and Kate Saenko. Learning deep object detectors from
493 3d models. In Proceedings of the IEEE international conference on computer vision, pages
494 1278–1286, 2015.
495 Aayush Prakash, Shaad Boochoon, Mark Brophy, David Acuna, Eric Cameracci, Gavriel State,
496 Omer Shapira, and Stan Birchfield. Structured domain randomization: Bridging the reality gap
497 by context-aware synthetic data. In 2019 International Conference on Robotics and Automation
498 (ICRA), pages 7249–7255. IEEE, 2019.
499 Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, and Ronan Collobert. Mls: A
500 large-scale multilingual dataset for speech research. ArXiv, abs/2012.03411, 2020.
501 Kaizhi Qian, Yang Zhang, Heting Gao, Junrui Ni, Cheng-I Lai, David Cox, Mark A. Hasegawa-
502 Johnson, and Shiyu Chang. Contentvec: An improved self-supervised speech representation by
503 disentangling speakers. In International Conference on Machine Learning, 2022.
504 Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech
505 2: Fast and high-quality end-to-end text to speech. In International Conference on Learning
506 Representations, 2020.
507 German Ros, Laura Sellart, Joanna Materzynska, David Vazquez, and Antonio M Lopez. The synthia
508 dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In
509 Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3234–3243,
510 2016.
12
511 Nick Rossenbach, Albert Zeyer, Ralf Schlüter, and Hermann Ney. Generating synthetic audio data for
512 attention-based speech recognition systems. In ICASSP 2020-2020 IEEE International Conference
513 on Acoustics, Speech and Signal Processing (ICASSP), pages 7069–7073. IEEE, 2020.
514 Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain,
515 Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai
516 research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages
517 9339–9347, 2019.
518 Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng
519 Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al. Natural tts synthesis by conditioning
520 wavenet on mel spectrogram predictions. In 2018 IEEE international conference on acoustics,
521 speech and signal processing (ICASSP), pages 4779–4783. IEEE, 2018.
522 Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In In-
523 ternational Conference on Learning Representations, 2021. URL https://openreview.net/
524 forum?id=St1giarCHLP.
525 Anuroop Sriram, Michael Auli, and Alexei Baevski. Wav2vec-aug: Improved self-supervised training
526 with limited data. ArXiv, abs/2206.13654, 2022.
527 Shane Steinert-Threlkeld, Xuhui Zhou, Zeyu Liu, and CM Downey. Emergent communication
528 fine-tuning (ec-ft) for pretrained language models. In Emergent Communication Workshop at ICLR
529 2022, 2022.
530 Min-Chun Tsai and Sheng-De Wang. Self-supervised image anomaly detection and localization with
531 synthetic anomalies. Available at SSRN 4264542, 2022.
532 Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves,
533 Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw
534 audio. In 9th ISCA Speech Synthesis Workshop, pages 125–125, 2016.
535 Gül Varol, Ivan Laptev, Cordelia Schmid, and Andrew Zisserman. Synthetic humans for ac-
536 tion recognition from unseen viewpoints. Int. J. Comput. Vision, 129(7):2264–2287, jul 2021.
537 ISSN 0920-5691. doi: 10.1007/s11263-021-01467-7. URL https://doi.org/10.1007/
538 s11263-021-01467-7.
539 Lester Phillip Violeta, Ding Ma, Wen-Chin Huang, and Tomoki Toda. Intermediate fine-tuning using
540 imperfect synthetic speech for improving electrolaryngeal speech recognition. arXiv preprint
541 arXiv:2211.01079, 2022.
542 Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno. Generalized end-to-end loss for speaker
543 verification. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing
544 (ICASSP), pages 4879–4883. IEEE, 2018.
545 Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ-Skerry Ryan, Eric Battenberg, Joel Shor, Ying Xiao,
546 Ye Jia, Fei Ren, and Rif A Saurous. Style tokens: Unsupervised style modeling, control and
547 transfer in end-to-end speech synthesis. In International Conference on Machine Learning, pages
548 5180–5189. PMLR, 2018.
549 Zhonghao Wang, Mo Yu, Yunchao Wei, Rogerio Feris, Jinjun Xiong, Wen-mei Hwu, Thomas S
550 Huang, and Honghui Shi. Differential treatment for stuff and things: A simple unsupervised domain
551 adaptation method for semantic segmentation. In Proceedings of the IEEE/CVF Conference on
552 Computer Vision and Pattern Recognition, pages 12635–12644, 2020.
553 Yawen Wu, Zhepeng Wang, Dewen Zeng, Yiyu Shi, and Jingtong Hu. Synthetic data can also
554 teach: Synthesizing effective data for unsupervised visual representation learning, 2022. URL
555 https://arxiv.org/abs/2202.06464.
556 Fei Xia, Amir R Zamir, Zhiyang He, Alexander Sax, Jitendra Malik, and Silvio Savarese. Gibson env:
557 Real-world perception for embodied agents. In Proceedings of the IEEE conference on computer
558 vision and pattern recognition, pages 9068–9079, 2018.
13
559 Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y. Lin,
560 Andy T. Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, Tzu-Hsien Huang, Wei-Cheng Tseng,
561 Ko tik Lee, Da-Rong Liu, Zili Huang, Shuyan Dong, Shang-Wen Li, Shinji Watanabe, Abdelrahman
562 Mohamed, and Hung yi Lee. SUPERB: Speech Processing Universal PERformance Benchmark.
563 In Proc. Interspeech 2021, pages 1194–1198, 2021. doi: 10.21437/Interspeech.2021-1775.
564 Shunyu Yao, Mo Yu, Yang Zhang, Karthik R Narasimhan, Joshua B. Tenenbaum, and Chuang Gan.
565 Linking emergent and natural languages via corpus transfer. In International Conference on
566 Learning Representations, 2022. URL https://openreview.net/forum?id=49A1Y6tRhaq.
567 Rodolfo Zevallos, Nuria Bel, Guillermo Cámbara, Mireia Farrús, and Jordi Luque. Data augmentation
568 for low-resource quechua asr improvement. arXiv preprint arXiv:2207.06872, 2022.
569 Jinming Zhao, Gholamreza Haffar, and Ehsan Shareghi. Generating synthetic speech from spokenvo-
570 cab for speech translation. arXiv preprint arXiv:2210.08174, 2022.
571 Xianrui Zheng, Yulan Liu, Deniz Gunceler, and Daniel Willett. Using synthetic audio to improve the
572 recognition of out-of-vocabulary words in end-to-end asr systems. In ICASSP 2021-2021 IEEE
573 International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5674–5678.
574 IEEE, 2021.
14