BnTTS: Few-Shot Speaker Adaptation in Low-Resource Setting
Mohammad Jahid Ibna Basher1 , Md Kowsher2 , Md Saiful Islam1 , Rabindra Nath Nandi1 ,
Nusrat Jahan Prottasha2 , Mehadi Hasan Menon1 , Tareq Al Muntasir1 ,
Shammur Absar Chowdhury3 , Firoj Alam3 , Niloofar Yousefi2 , Ozlem Ozmen Garibay2
1
Hishab Singapore Pte. Ltd, Singapore, 2 University of Central Florida, USA
3
Qatar Computing Research Institute, Qatar
Abstract 2023); however, the pretraining stage still necessi-
tates substantial datasets (Zhang et al., 2021).
This paper introduces BnTTS (Bangla Text-To- Recent works such as YourTTS (Bai et al., 2022)
Speech), the first framework for Bangla speaker
and VALL-E X (Xu et al., 2022) have made strides
arXiv:2502.05729v1 [cs.CL] 9 Feb 2025
adaptation-based TTS, designed to bridge the
gap in Bangla speech synthesis using minimal
in cross-lingual zero-shot TTS, with YourTTS
training data. Building upon the XTTS archi- exploring English, French, and Portuguese, and
tecture, our approach integrates Bangla into a VALL-E X incorporating language identification
multilingual TTS pipeline, with modifications to extend support for a broader range of languages
to account for the phonetic and linguistic char- (Xu et al., 2022). These advancements highlight the
acteristics of the language. We pretrain BnTTS potential for multilingual TTS systems to achieve
on 3.85k hours of Bangla speech dataset with cross-lingual speech synthesis. Furthermore, the
corresponding text labels and evaluate perfor-
XTTS model (Casanova et al., 2024) represents a
mance in both zero-shot and few-shot settings
on our proposed test dataset. Empirical eval- significant leap by expanding zero-shot TTS capa-
uations in few-shot settings show that BnTTS bilities across 16 languages. Based on the Tortoise
significantly improves the naturalness, intel- model (Casanova et al., 2024), XTTS enhances
ligibility, and speaker fidelity of synthesized voice cloning accuracy and naturalness but remains
Bangla speech. Compared to state-of-the-art focused on high- and medium-resource languages,
Bangla TTS systems, BnTTS exhibits superior leaving low-resource languages such as Bangla un-
performance in Subjective Mean Opinion Score
derserved (Zhang et al., 2022; Xu et al., 2023).
(SMOS), Naturalness, and Clarity metrics.
The scarcity of extensive datasets has hindered
1 Introduction the adaptation of state-of-the-art (SOTA) TTS
models for low-resource languages. Models like
Speaker adaptation in Text-to-Speech (TTS) tech- YourTTS (Bai et al., 2022), VALL-E X (Baevski
nology has seen substantial advancements in re- et al., 2022a), and Voicebox (Baevski et al., 2022b)
cent years, particularly with speaker-adaptive mod- have demonstrated success in multilingual settings,
els enhancing the naturalness and intelligibility of yet their primary focus remains on languages with
synthesized speech (Eren and Demiroglu, 2023). rich resources like English, Spanish, French, and
Notably, recent innovations have emphasized zero- Chinese. While a few Bangla TTS systems exist
shot and one-shot adaptation approaches (Kodirov (Gutkin et al., 2016), they often produce robotic
et al., 2015). Zero-shot TTS models eliminate the tones (Hossain et al., 2018) or are limited to a small
need for speaker-specific training by generating set of static speakers (Gong et al., 2024), lacking
speech from unseen speakers using reference audio instant speaker adaptation capabilities and typically
samples (Min et al., 2021). Despite this progress, not being open-source.
zero-shot models often require large datasets and To address these challenges, we propose the
face challenges with out-of-distribution (OOD) first framework for few-shot speaker adaptation
voices, as they struggle to adapt effectively to novel in Bangla TTS. Our approach integrates Bangla
speaker traits (Le et al., 2023; Ju et al., 2024). Alter- into the XTTS training pipeline, with architectural
natively, one-shot adaptation fine-tunes pre-trained modifications to accommodate Bangla’s unique
models using a single data instance, offering im- phonetic and linguistic features. Our model is opti-
proved adaptability with reduced data and compu- mized for effective few-shot voice cloning, address-
tational demands (Yan et al., 2021; Wang et al., ing the needs of low-resource language settings.
Our contributions are summarized as follows: (i) speaker spectrogram S is transformed into an inter-
we present the first speaker-adapted Bangla TTS mediate representation Sz ∈ RL×d , where each at-
system; (ii) we integrate Bangla into a multilingual tention layer applies a scaled dot-product attention
XTTS pipeline, optimizing the framework to ac- mechanism. The Perceiver Resampler generates
commodate the unique challenges of low-resource a fixed output dimensionality R ∈ RP ×d from a
languages; (iii) we make the developed BnTTSTex- variable input length L.
tEval evaluation dataset public. Text Encoder: The text tokens T =
{t1 , t2 , . . . , tN } are projected into a continu-
2 BnTTS ous embedding space, yielding Te ∈ RN ×d .
Large Language Model (LLM): The transformer-
based LLM (Radford et al., 2019) utilizes the de-
coder portion. Speaker embeddings Sp , text em-
beddings Te , and ground truth spectrogram embed-
dings Ye are concatenated to form the input:
X = Sp ⊕ Te ⊕ Ye ∈ R(N +P +M )×d
The LLM processes X, producing output H with
hidden states for the text, speaker, and spectro-
gram embeddings. During inference, only text and
speaker embeddings are concatenated, generating
spectrogram embeddings {hY1 , hY2 , . . . , hYP } as the
output.
HiFi-GAN Decoder: The HiFi-GAN Decoder
(Kong et al., 2020) converts the LLM’s output into
realistic speech, preserving the speaker’s character-
istics. Specifically, it takes the LLM’s speech head
Figure 1: Overview of BnTTS Model. output HY = {hY1 , hY2 , . . . , hYP }. The speaker em-
bedding S is resized to match HY , resulting in
Preliminaries: Given a text sequence with N S′ ∈ RP ×d . The final audio waveform W is then
tokens, T = {t1 , t2 , . . . , tN }, and a speaker’s mel- generated by:
spectrogram S = {s1 , s2 , . . . , sL }, the objective is
W = gHiFi (HY + S′ )
to generate speech Ŷ that matches the speaker’s
characteristics. The ground truth mel-spectrogram
Thus, the HiFi-GAN decoder produces speech
frames for the target speech are denoted as Y =
that reflects the input text while maintaining the
{y1 , y2 , . . . , yM }. The synthesis process can be
speaker’s unique qualities.
described as:
3 Experiments
Ŷ = F(S, T)
BnTTS model: BnTTS employs the pretrained
where F produces speech conditioned on both the XTTS checkpoint (Casanova et al., 2024) as its base
text and the speaker’s spectrogram. model, chosen for resource efficiency. The Condi-
Audio Encoder: A Vector Quantized-Variational tioning Encoder has six attention blocks with 32
AutoEncoder (VQ-VAE) (Betker, 2023) encodes heads, capturing contextual information. The Per-
mel-spectrogram frames Y into discrete tokens ceiver Resampler reduces the sequence to a fixed
M ∈ C, where C is vocab or codebook. An em- length of 32. The model maintains GPT-2’s di-
bedding layer then transforms these tokens into a mensionality, with a hidden size of 1024 and an
d-dimensional vector: Ye ∈ RM ×d . intermediate layer size of 3072, handling sequences
Conditioning Encoder & Perceiver Resampler: of up to 400 tokens. (Details in Appendix D).
The Conditioning Encoder (Casanova et al., 2024) Dataset: We continuously pre-trained the BnTTS
consists of l layers of k-head Scaled Dot-Product model (initialized from the XTTS checkpoint) on
Attention, followed by a Perceiver Resampler. The 3.85k hours of Bengali speech data, sourced from
open-source datasets, pseudo-labeled data, and syn- setting for 10 epochs. This fine-tuning approach is
thetic datasets. The pseudo-labeled data were col- more meaningful with the XTTS-like architecture
lected using an in-house automated TTS Data Ac- pretrained on large-scale datasets. The evaluation
quisition Framework, which segments speech into results are presented in Section 4.
0.5 to 11-second chunks with time-aligned tran- Evaluation Metric: We evaluate the BnTTS sys-
scripts. These segments were further refined using tem using six criteria. The Subjective Mean Opin-
neural speech models and custom algorithms to ion Score (SMOS) including Naturalness and Clar-
enhance quality and accuracy. For speaker adap- ity evaluates perceived audio quality from Streijl
tation, we incorporated 4.22 hours of high-quality et al. (2016), while the ASR-based Character Error
studio recordings from four speakers, referred to as Rate (CER) (Nandi et al., 2023) measures transcrip-
In-House HQ Data. tion accuracy, SpeechBERTScore assesses similar-
For evaluation, we propose two datasets: (1) Bn- ity to reference speech, and Speaker Encoder Co-
StudioEval, derived from our In-House HQ Data, to sine Similarity (SECS) evaluates speaker identity
assess high-fidelity speech generation and speaker fidelity (Saeki et al., 2024; Casanova et al., 2021;
adaptation, and (2) BnTTSTextEval, a text-only Thienpondt and Demuynck, 2024). See Appendix
dataset consisting of three subsets: BengaliStim- E for details.
uli53 (assessing phonetic diversity), BengaliName-
dEntity1000 (evaluating named entity pronuncia- 4 Results
tion), and ShortText200 (measuring conversational
fluency in short sentences, filler words, and com- We evaluated the pretrained BnTTS (BnTTS-0) and
mon phrases used in everyday dialogue). Further speaker-adapted BnTTS (BnTTS-n) alongside In-
details are provided in Appendices A, B, and C. dicTTS (Kumar et al., 2023) and two commercial
Training Setup: We initialized the BnTTS model systems: Google Cloud TTS (GTTS) and Azure
from the XTTS checkpoint and do continual pre- TTS (AzureTTS). The evaluation was conducted
training using the AdamW optimizer with betas of on both the BnStudioEval and BnTTSTextEval
0.9 and 0.96, weight decay of 0.01, and an initial datasets. For a time-efficient subjective evaluation,
learning rate of 2e-05. The batch size was 12, with we randomly selected 200 sentences from the Ben-
gradient accumulation over 24 steps per GPU, and galiNamedEntity1000 subset, which originally con-
the learning rate decay(0.66) was applied using tains 1000 samples, maintaining a comprehensive
MultiStepLR. All experiments are run on a single assessment while reducing evaluation overhead.
NVIDIA A100 GPU with 80GB of VRAM. The Reference-aware Evaluation: Table 1 shows the
pretraining process consists of two stages: performance of various TTS systems on the BnStu-
a) Partial Audio Prompting: In this stage, a dioEval dataset. GTTS outperforms other methods
random segment of the ground truth audio is used in the CER metric, even surpassing the Ground
as the speaker prompt. Training in this phase lasted Truth (GT) in transcription accuracy. As for the sub-
for 5 epochs. jective measures, the proposed BnTTS-n closely
b) Complete Audio Prompting: Here, the full follows the GT, with competitive scores in SMOS
duration of audio is used as the speaker prompt. (4.624 vs 4.809), Naturalness (4.600 vs 4.798), and
This stage continues from the checkpoint and opti- Clarity (4.869 vs 4.913). Meanwhile, BnTTS-0
mizer state of the first phase and lasts for 1 epoch. achieves SMOS, Naturalness, and Clarity scores
Additionally, the HiFi-GAN vocoder was fine- of 4.456, 4.447, and 4.577, respectively. IndicTTS,
tuned separately using GPT-2 embeddings derived AzureTTS, and GTTS perform poorly in the sub-
from the model in stage b. The vocoder was fine- jective metrics.
tuned for three days to ensure optimal performance. In speaker similarity evaluation, GT attains a
The audio encoder and speaker encoder remain perfect SECS (reference) score and high SECS
frozen across all experiments. (prompt) scores. BnTTS-n outperforms BnTTS-
Few-shot Speaker Adaptation: For few-shot 0 in both SECS (reference) (0.548 vs 0.529) and
speaker adaptation, we fine-tuned the BnTTS SECS (prompt) (0.586 vs 0.576). Additionally,
model using our In-House HQ dataset, which com- BnTTS-n achieves a SpeechBERTScore of 0.791,
prises studio recordings from four speakers. We slightly higher than BnTTS-0 at 0.789, while GT
randomly selected 20 minutes of audio for each retains a perfect score of 1.0. IndicTTS, GTTS, and
speaker and fine-tuned the model in a multi-speaker AzureTTS do not support speaker adaptation, so
SECS and SpeechBERTScore were not evaluated Short Duration
Exp. T and TopK CER
Prompt Equality
for these systems. 1 T=0.85, TopK=50 N 0.699 0.081
Reference-independent Evaluation: Table 2 2 T=0.85, TopK=50 Y 0.820 0.029
presents the comparative performance of various 3 T=1.0, TopK=2 N 0.701 0.023
4 T=1.0, TopK=2 Y 0.827 0.015
TTS systems evaluated on the BnTTSTextEval
dataset. The AzureTTS and GTTS consistently Table 3: Impact of prompt duration, temperature (T),
achieve lower CER scores, with BnTTS-n and and Top-K on BnTTS-n performance in the Short-
BnTTS-0 following closely in third and fourth BnStudioEval Dataset.
place, respectively, and IndicTTS trailing behind.
BnTTS-n performs strongly in subjective evalua- evaluations. The BnTTS-n model produces more
tions, excelling in SMOS, Naturalness, and Clarity natural and intelligible speech with high speaker fi-
scores across the BengaliStimuli53, BengaliName- delity, leading to improved SMOS, CER, and SECS
dEntity1000, and ShortText200 subsets. Overall, scores. This performance gap is particularly evi-
BnTTS-n achieves the highest scores in SMOS dent in the ShortText-200 dataset, which assesses
(4.601), Naturalness (4.578), and Clarity (4.832). conversational fluency in short, everyday phrases.
Meanwhile, AzureTTS performs competitively, sur- The results affirm that finetuning can significantly
passing other commercial and open-source models improve the XTTS-based model for generating nat-
and achieving scores comparable to BnTTS-0. ural, fluent, and speaker-adapted speech.
High CER in Text Generation: Both BnTTS mod-
Method GT IndicTTS GTTS AzureTTS BnTTS-0 BnTTS-n
CER 0.030 0.058 0.020 0.021 0.052 0.034 els exhibited higher CER compared to AzureTTS
SMOS 4.809 3.475 4.017 4.154 4.456 4.624
Naturalness 4.798 3.406 3.949 4.100 4.447 4.600
and GTTS in both BnStudioEval and BnTTSTex-
Clarity 4.913 4.160 4.700 4.686 4.577 4.869 tEval datasets. The AzureTTS and GTTS also
SECS (Ref.) 1.0 - - - 0.529 0.548
SECS (Prompt) 0.641 - - - 0.576 0.586 achieved a lower CER score than the GT. The
SpeechBERT-
Score 1.0 - - - 0.789 0.791
BnTTS generates speech with more conversational
prosody and expressiveness, which, while improv-
Table 1: Comparative average performance for ing perceived quality, may negatively impact CER.
reference-aware BnStudioEval dataset. SECS and ASR systems, used for CER evaluation, are often
SpeechBERTScore are not reported for IndicTTS,
better suited to transcribing standardized speech
GTTS, and AzureTTS as these systems do not support
speaker adaption.
patterns, as seen in AzureTTS and GTTS. The con-
sistent loudness and simplified prosody in these
systems create clearer phonetic boundaries, mak-
Dataset Method CER SMOS Naturalness Clarity
Bengali- IndicTTS 0.110 3.445 3.403 3.857 ing them more easily transcribed by the ASR model
Stimuli-53 GTTS 0.063 4.006 3.937 4.688 (Choi et al., 2022; Wagner et al., 2019).
AzureTTS 0.060 4.108 4.064 4.542
BnTTS-0 0.092 4.622 4.613 4.719
Effect of Sampling and Prompt Length on Short
BnTTS-n 0.086 4.654 4.634 4.854 Speech Generation: The generation of short au-
Bengali- IndicTTS 0.049 3.527 3.462 4.179 dio sequences presents challenges in the BnTTS
Named- GTTS 0.037 4.037 3.969 4.712
Entity- AzureTTS 0.032 4.182 4.135 4.654 models, particularly for texts containing fewer than
1000 BnTTS-0 0.043 4.585 4.613 4.698 30 characters when using the default generation
(200) BnTTS-n 0.040 4.635 4.614 4.841 settings (Temperature T = 0.85 and TopK = 50).
Short- IndicTTS 0.204 3.233 3.325 3.893
Text-200 GTTS 0.043 4.058 3.993 4.705 The issues observed are twofold: (1) the generated
AzureTTS 0.050 4.294 4.256 4.675 speech often lacks intelligibility, and (2) the output
BnTTS-0 0.116 4.297 4.271 4.556
BnTTS-n 0.092 4.554 4.528 4.816
speech tends to be longer than expected. To inves-
Overall IndicTTS 0.125 3.388 3.325 4.017 tigate this, we extracted a subset of 23 short text-
GTTS 0.049 4.042 3.976 4.706 speech pairs from the BnStudioEval dataset, which
AzureTTS 0.045 4.223 4.180 4.650
BnTTS-0 0.081 4.463 4.445 4.639
we call ShortBnStudioEval dataset. For evaluation,
BnTTS-n 0.069 4.601 4.578 4.832 we utilize the CER metric to assess intelligibility
and DurationEquality (Appendix: E) to quantify
Table 2: Comparative average performance analysis on duration discrepancies in the BnTTS-n model.
the reference-independent BnTTSTextEval dataset.
Under the default settings (Exp. 1 in Table 3),
Zero-shot vs. Few-shot BnTTS: BnTTS-0 con- the model achieves a CER of 0.081 and a Dura-
sistently falls short of BnTTS-n across all metrics tionEquality score of 0.699. We hypothesize that
in both reference-aware and reference-independent this issue stems from its training process. During
training, the model is accustomed to short audio BnTTS effectively supports zero-shot and few-shot
prompts for short sequences. By aligning the infer- speaker adaptation, outperforming existing Bangla
ence with this training strategy and using short TTS systems in sound quality, naturalness, and clar-
prompts, the generation performance improves ity. Despite its strengths, BnTTS faces challenges
vastly, as evidenced by a higher DurationEquality in handling diverse dialects and short-sequence gen-
score of 0.820 and a lower CER of 0.029 (Exp. 2). eration. Future work will focus on training BnTTS
Further, by adjusting the temperature to T = 1.0 from scratch, developing medium and small model
and reducing the top-K value to 2, we observed an variants, and exploring knowledge distillation to
improvement in the DurationEquality score from optimize inference speed for real-time applications.
0.699 to 0.701, accompanied by a substantial re-
duction in CER from 0.081 to 0.023 (Exp. 3). 7 Limitations
Combining the short prompt with the adjusted tem-
perature and top-K values yielded the best results. Despite the significant performance of BnTTS, the
In this configuration, the DurationEquality score system has several limitations. It struggles to adapt
improved to 0.827, with a CER of 0.015, demon- to speakers with unique vocal traits, especially
strating that both factors are crucial for accurate without prior training on their voices, limiting its
short speech generation. effectiveness in speaker adaptation tasks. We found
poor performance on short text due to pre-existing
5 Related Works issues in the XTTS foundation model. Although
we improved performance by modifying generation
The development of Bangla TTS technology settings and incorporating additional training with
presents unique challenges due to the language’s Complete Audio Prompting, the model still fails
rich morphology and phonetic diversity. The first to generate sequences under two words or 20 char-
Bangla TTS system, Katha (Alam et al., 2007), acters in some cases. We did not investigate the
was developed using diphone concatenation within performance of the XTTS model by training from
the Festival Framework. However, this approach scratch; instead, we used continual pretraining due
struggled with natural prosody and efficient run- to resource constraints, which may have yielded
time. Later advancements, such as Subhachan better results.
(Naser et al., 2010), aimed to improve these as-
pects but still faced similar limitations. The in- 8 Acknowledgments
troduction of LSTM-based models (Gutkin et al.,
2016) showed promising results in Bangla speech We are grateful to HISHAB1 for providing us with
synthesis. Beyond Bangla-specific TTS, broader ef- all the necessary working facilities, computational
forts on Indian language synthesis have contributed resources, and an appropriate environment through-
to Indic-TTS systems. Prakash and Murthy (2020) out our entire work.
employed Tacotron2 for text-to-mel-spectrogram
conversion and WaveGlow as a vocoder. Another 9 Ethical Considerations
study (Kumar et al., 2023) demonstrated that mono-
lingual models utilizing FastPitch and HiFi-GAN The development of BnTTS raises ethical concerns,
V1, trained on both male and female voices, out- particularly regarding the potential misuse for unau-
performed previous approaches. However, these thorized voice impersonation, which could impact
works supported a limited number of speakers and privacy and consent. Protections, such as requiring
lacked speaker adaptability. To address this gap, we speaker approval and embedding markers in syn-
explore the LLM-based XTTS model for Bangla, thetic speech, are essential. Diverse training data
developing the first Bangla TTS system designed is also crucial to reduce bias and reflect Bangla’s
for low-resource speaker adaptation. dialectal variety. Additionally, synthesized voices
risk diminishing dialectal diversity. As an open-
6 Conclusion source tool, BnTTS requires clear guidelines for
responsible use, ensuring adherence to ethical stan-
In this work, we introduced BnTTS, the first
dards and positive community impact.
speaker-adaptive TTS system for Bangla, capable
of generating natural and clear speech with min-
1
imal training data. Built on the XTTS pipeline, https://www.verbex.ai/
References Md Jakir Hossain, Sayed Mahmud Al Amin, Md Saiful
Islam, et al. 2018. Development of robotic voice
Firoj Alam, Promila Kanti Nath, and Mumit Khan. 2007. conversion for ribo using text-to-speech synthesis. In
Text to speech for bangla language using festival. 2018 4th International Conference on Electrical En-
Technical report, BRAC University. gineering and Information & Communication Tech-
A. Baevski et al. 2022a. Vall-e: A generative neural nology (iCEEiCT), pages 422–425. IEEE.
audio codec for zero-shot tts. In Proceedings of the
2022 Conference on Empirical Methods in Natural H. Ju et al. 2024. Naturalspeech3: Speech generation
Language Processing, pages 234–245. Association with naturalness and flexibility. In Proceedings of the
for Computational Linguistics. 2024 Conference on Neural Information Processing
Systems, pages 456–467. NeurIPS.
A. Baevski et al. 2022b. Voicebox: A generalist neural
speech synthesizer. In Proceedings of the 2022 Con- Elyor Kodirov, Tao Xiang, Zhenyong Fu, and Shao-
ference on Neural Information Processing Systems, gang Gong. 2015. Unsupervised domain adaptation
pages 3001–3011. NeurIPS. for zero-shot learning. In Proceedings of the IEEE
international conference on computer vision, pages
Y. Bai et al. 2022. Yourtts: Towards zero-shot multi- 2452–2460.
lingual text-to-speech. In Proceedings of the 2022
Conference on Empirical Methods in Natural Lan- Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020.
guage Processing, pages 123–132. Association for Hifi-gan: Generative adversarial networks for effi-
Computational Linguistics. cient and high fidelity speech synthesis. Advances
in neural information processing systems, 33:17022–
James Betker. 2023. Better speech synthesis through 17033.
scaling. arXiv preprint arXiv:2305.07243.
Gokul Karthik Kumar, Praveen S V, Pratyush Kumar,
Edresson Casanova, Christopher Shulby, Eren Mitesh M. Khapra, and Karthik Nandakumar. 2023.
Gölge, Nicolas Michael Müller, Frederico San- Towards building text-to-speech systems for the next
tos de Oliveira, Arnaldo Candido Jr., Anderson billion users. In ICASSP 2023 - 2023 IEEE Interna-
da Silva Soares, Sandra Maria Aluisio, and tional Conference on Acoustics, Speech and Signal
Moacir Antonelli Ponti. 2021. SC-GlowTTS: An Processing (ICASSP), pages 1–5.
Efficient Zero-Shot Multi-Speaker Text-To-Speech
Model. In Proc. Interspeech 2021, pages 3645–3649. Q. Le et al. 2023. Voicebox: A versatile neural speech
synthesis system. In Proceedings of the 2023 Con-
P. Casanova et al. 2024. Xtts: Extending zero-shot ference on Empirical Methods in Natural Language
tts to multilingual domains. In Proceedings of the Processing, pages 120–130. Association for Compu-
2024 Conference on Empirical Methods in Natural tational Linguistics.
Language Processing, pages 300–310. Association
for Computational Linguistics. Haohe Liu, Qiuqiang Kong, Qiao Tian, Yan Zhao,
DeLiang Wang, Chuanzeng Huang, and Yux-
Yeunju Choi, Youngmoon Jung, Youngjoo Suh, and
uan Wang. 2021. Voicefixer: Toward general
Hoirin Kim. 2022. Learning to maximize speech
speech restoration with neural vocoder. Preprint,
quality directly using mos prediction for neural text-
arXiv:2109.13731.
to-speech. IEEE Access, 10:52621–52629.
Alexandre Défossez, Nicolas Usunier, Léon Bottou, and Dongchan Min, Dong Bok Lee, Eunho Yang, and
Francis Bach. 2019. Music source separation in the Sung Ju Hwang. 2021. Meta-stylespeech: Multi-
waveform domain. arXiv preprint arXiv:1911.13254. speaker adaptive text-to-speech generation. In In-
ternational Conference on Machine Learning, pages
Eray Eren and Cenk Demiroglu. 2023. Deep learning- 7748–7759. PMLR.
based speaker-adaptive postfiltering with limited
adaptation data for embedded text-to-speech syn- Rabindra Nath Nandi, Mehadi Menon, Tareq Muntasir,
thesis systems. Computer Speech & Language, Sagor Sarker, Quazi Sarwar Muhtaseem, Md. Tariqul
81:101520. Islam, Shammur Chowdhury, and Firoj Alam. 2023.
Pseudo-labeling for domain-agnostic Bangla auto-
Cheng Gong, Erica Cooper, Xin Wang, Chunyu Qiang, matic speech recognition. In Proceedings of the First
Mengzhe Geng, Dan Wells, Longbiao Wang, Jianwu Workshop on Bangla Language Processing (BLP-
Dang, Marc Tessier, Aidan Pine, et al. 2024. An 2023), pages 152–162, Singapore. Association for
initial investigation of language adaptation for tts Computational Linguistics.
systems under low-resource scenarios. arXiv preprint
arXiv:2406.08911. Abu Naser, Devojyoti Aich, and Md Ruhul Amin. 2010.
Implementation of subachan: Bengali text to speech
Alexander Gutkin, Linne Ha, Martin Jansche, Knot synthesis software. In International Conference on
Pipatsrisawat, and Richard Sproat. 2016. Tts for Electrical & Computer Engineering (ICECE), pages
low resource languages: A bangla synthesizer. In 574–577.
Proceedings of the Tenth International Conference
on Language Resources and Evaluation (LREC’16), R OpenAI et al. 2023. Gpt-4 technical report. ArXiv,
pages 2005–2010. 2303:08774.
Alexis Plaquet and Hervé Bredin. 2023. Powerset multi- Z. Wang et al. 2023. Neural speech synthesis: One-
class cross entropy loss for neural speaker diarization. shot voice cloning techniques. In Proceedings of the
In Proc. INTERSPEECH 2023. 2023 Conference on Empirical Methods in Natural
Language Processing, pages 200–210. Association
A. Prakash and H. A. Murthy. 2020. Generic indic text- for Computational Linguistics.
to-speech synthesisers with rapid adaptation in an
end-to-end framework. In Interspeech 2020, 21st Y. Xu et al. 2022. Vall-e x: A generative speech model
Annual Conference of the International Speech Com- for zero-shot tts and speech-to-speech translation. In
munication Association, page 2962–2966. ISCA. Proceedings of the 2022 Conference on Empirical
Methods in Natural Language Processing, pages 400–
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, 410. Association for Computational Linguistics.
Dario Amodei, Ilya Sutskever, et al. 2019. Language
models are unsupervised multitask learners. OpenAI Y. Xu et al. 2023. Cross-lingual transfer for low-
blog, 1(8):9. resource text-to-speech. In Proceedings of the 2023
Conference on Empirical Methods in Natural Lan-
Takaaki Saeki, Soumi Maiti, Shinnosuke Takamichi, guage Processing, pages 510–520. Association for
Shinji Watanabe, and Hiroshi Saruwatari. 2024. Computational Linguistics.
SpeechBERTScore: Reference-aware automatic eval-
uation of speech generation leveraging nlp evaluation Y. Yan et al. 2021. Adaspeech 2: Adaptive text-to-
metrics. arXiv preprint arXiv:2401.16812. speech with one-shot voice cloning. In Proceedings
of the 2021 Conference on Neural Information Pro-
Abhayjeet Singh, Amala Nagireddi, Deekshitha G, cessing Systems, pages 122–134. NeurIPS.
Jesuraja Bandekar, Roopa R, Sandhya Badiger,
Sathvik Udupa, Prasanta Kumar Ghosh, Hema A Mingyang Zhang, Yi Zhou, Li Zhao, and Haizhou
Murthy, Pranaw Kumar, Keiichi Tokuda, Mark Li. 2021. Transfer learning from speech synthesis
Hasegawa-Johnson, and Philipp Olbrich. 2024. Lim- to voice conversion with non-parallel training data.
mits’24: Multi-speaker, multi-lingual indic tts with IEEE/ACM Transactions on Audio, Speech, and Lan-
voice cloning. In 2024 IEEE International Confer- guage Processing, 29:1290–1302.
ence on Acoustics, Speech, and Signal Processing
Workshops (ICASSPW), pages 61–62. W. Zhang et al. 2022. Universal text-to-speech for low-
resource languages. Journal of Speech Technology,
Keshan Sodimana, Knot Pipatsrisawat, Linne Ha, Mar- 14(3):210–220.
tin Jansche, Oddur Kjartansson, Pasindu De Silva,
and Supheakmungkol Sarin. 2018. A Step-by-Step
Process for Building TTS Voices Using Open Source
Data and Framework for Bangla, Javanese, Khmer,
Nepali, Sinhala, and Sundanese. In Proc. The 6th
Intl. Workshop on Spoken Language Technologies for
Under-Resourced Languages (SLTU), pages 66–70,
Gurugram, India.
Nimisha Srivastava, Rudrabha Mukhopadhyay, Prajwal
K R, and C V Jawahar. 2020. IndicSpeech: Text-to-
speech corpus for Indian languages. In Proceedings
of the Twelfth Language Resources and Evaluation
Conference, pages 6417–6422, Marseille, France. Eu-
ropean Language Resources Association.
Robert C Streijl, Stefan Winkler, and David S Hands.
2016. Mean opinion score (mos) revisited: methods
and applications, limitations and alternatives. Multi-
media Systems, 22(2):213–227.
Jenthe Thienpondt and Kris Demuynck. 2024. Ecapa2:
A hybrid neural network architecture and training
strategy for robust speaker embeddings. arXiv
preprint arXiv:2401.08342.
Petra Wagner, Jonas Beskow, Simon Betz, Jens
Edlund, Joakim Gustafson, Gustav Eje Henter,
Sébastien Le Maguer, Zofia Malisz, Éva Székely,
Christina Tånnander, et al. 2019. Speech synthesis
evaluation—state-of-the-art assessment and sugges-
tion for a novel research program. In Proceedings of
the 10th Speech Synthesis Workshop (SSW10).
A TTS Data Acquisition Framework lowing transcription, a LLM is employed to restore
appropriate punctuation (OpenAI et al., 2023). This
step is crucial for improving grammatical accuracy
and ensuring that the text is clear and coherent,
aiding in further processing.
3. Audio and Transcription Segmentation:
The audio and transcription are segmented based
on terminal punctuation (full-stop, question mark,
exclamatory mark, comma). This ensures that each
audio segment aligns with a complete sentence,
maintaining the speaker’s prosody throughout.
4. Noise and Music Suppression: To improve
audio quality, noise and music suppression tech-
niques (Défossez et al., 2019) are applied. This
step ensures that the resulting audio is free of back-
ground disturbances, which could degrade TTS
performance.
5. Audio SuperResolution: After noise sup-
pression, the audio files undergo super-resolution
processing to enhance audio fidelity (Liu et al.,
2021). This ensures high-quality audio, crucial for
producing natural-sounding TTS outputs.
This pipeline effectively enhances raw audio and
corresponding transcription, resulting in a high-
quality pseudo-labeled dataset. By combining
ASR, LLM-based punctuation restoration, noise
suppression, and super-resolution, the framework
can generate very high-quality speech data suitable
Figure 2: Overview of our TTS Data Acquisition Frame-
for training speech synthesis models.
work. The acquisition process involves using a Speech- A.1 Dataset Filtering Criteria
to-Text model to obtain transcription, an LLM to restore
transcription’s punctuation, a noise suppression model The pseudo-labeled data are further refined using
to remove unwanted noise, and finally an audio super- the following criteria:
resolution model to enhance audio quality and loudness.
• Diarization: Pyannote’s Speaker Diarization
v3.1 is employed to filter audio files by sep-
Bangla is a low-resource language, and large-
arating multi-speaker audios, ensuring that
scale, high-quality TTS speech data are particularly
each instance contains only one speaker (Pla-
scarce. To address this gap, we developed a TTS
quet and Bredin, 2023), which is essential for
Data Acquisition Framework (Figure 2) designed
effective TTS model training.
to collect high-quality speech data with aligned
transcripts. This framework leverages advanced • Audio Duration: Audio segments shorter
speech processing models and carefully designed than 0.5 seconds are discarded, as they pro-
algorithms to process raw audio inputs and generate vide insufficient information for our model.
refined audio outputs with word-aligned transcripts. Similarly, segments longer than 11 seconds
Below, we provide a detailed breakdown of the key are excluded to match the model’s sequence
components of the framework. length.
1. Speech-to-Text (STT): The audio files are
• Text Length: Segments with transcriptions
first processed through an in-house our STT system,
exceeding 200 characters are removed to en-
which transcribes the spoken content into text. The
sure manageable input size during training.
STT system used here is an enhanced version of
the model proposed in (Nandi et al., 2023). • Silence-based Filtering: Audio files where
2. Punctuation Restoration Using LLM: Fol- over 35% of the duration consists of silence
are discarded, as they negatively impact model
performance.
• Text-to-Audio Ratio: Based on our analysis,
audio segments where the text-to-audio dura-
tion ratio falls outside (Figure 3b) the range
of 6 to 25 are excluded (Figure 3c), ensur-
ing alignment with natural speech patterns
observed in Pseudo-labeled data from Phase
A (Figure 3a).
(a) The diagram illustrates the linear relationship between
B Human Guided Data Preparation audio duration and character length in manually-reviewed
Pseudo-labeled Data - Phase A.
We curated approximately 82.39 hours of speech
data through human-level observation, which we
refer to as Pseudo-Labeled Data - Phase A (Table
4). The audio samples, averaging 10 minutes in du-
ration, are sourced from copyright-free audiobooks
and podcasts, preferably featuring a single speaker
in most cases.
Annotators were tasked with identifying
prosodic sentences by segmenting the audio into
meaningful chunks while simultaneously correct-
ing ASR-generated transcriptions and restoring (b) The diagram depicts the relationship between audio
proper punctuation in the provided text. If a se- duration and character length in Pseudo-Labeled Data -
lected audio chunk contained multiple speakers, it Phase B.
was discarded to maintain dataset consistency. Ad-
ditionally, background noise, mispronunciations,
and unnatural speech patterns were carefully re-
viewed and eliminated to ensure the highest quality
TTS training data.
C Dataset
Table 4 summarizes the statistics and metadata
of the datasets used in this study. We utilized
four open-source datasets: OpenSLR Bangla TTS
Dataset (Sodimana et al., 2018), Limmits (Singh
et al., 2024), Comprehensive Bangla TTS Dataset (c) The diagram illustrates the audio duration vs. charac-
ter length graph in Pseudo-Labeled Data - Phase B after
(Srivastava et al., 2020), and CRBLP TTS Dataset filtering.
(Alam et al., 2007), amounting to a total of 117
hours of training data. To further enhance our Figure 3: These figures demonstrate how the ratio of
dataset, we synthesized 16.44 hours of speech us- text length to audio duration changes before and after
processing the data.
ing Google’s TTS API, ensuring high-quality tran-
scriptions. Additionally, 4.22 hours of profession-
ally recorded studio speech from four speakers Phase B was generated through our TTS Data
were collected for fine-tuning. Acquisition Framework and was not manually re-
The majority of our dataset originates from viewed.
Pseudo-Labeled Data-Phase A and Phase B. Phase
A, containing 82.39 hours of speech, underwent C.1 Evaluation Dataset
thorough evaluation, with insights from this phase For evaluating the performance of our TTS sys-
informing the refinement of the large-scale data tem, we curated two datasets: BnStudioEval and
acquisition process used in Phase B. In contrast, BnTTSTextEval, each serving distinct evaluation
Duration provide a comprehensive basis for evaluating var-
Dataset Remarks
(Hour)
ious aspects of our TTS performance, including
Pseudo-Labeled Manually
82.39 phonetic diversity, named entity pronunciation, and
Data - Phase A reviewed
Pseudo-Labeled conversational fluency.
3636.47 Not Reviewed
Data - Phase B
Synthetic (GTTS) 16.44 Synthetic D Training Objectives
Comprehensive
Open-source Our BnTTS model is composed of two primary
Bangla TTS 20.08
Data
Dataset modules (GPT-2 and HiFi-GAN), which are trained
OpenSLR Bangla Open-source separately. The GPT-2 module is trained using
3.82
TTS Dataset Data
a Language Modeling objective, while the HiFi-
Open-source
Limmits 79 GAN module is optimized using HiFi-GAN loss
Data
CRBLP TTS Open-source objective. This section provides an overview of the
13.59
Dataset Data loss functions applied during training.
Studio Quality,
In-House HQ Data 4.22 Manually D.1 Language Modeling Loss
reviewed
Total Duration 3856.01 1. Text Generation Loss: Denoted as Ltext , it quan-
tifies the difference between predicted logits and
Table 4: Dataset Information ground truth labels using cross-entropy. Let ŷtext
represent the predicted logits and ytext the ground
purposes. truth target labels. For a sequence with N text
BnStudioEval: This dataset comprises 80 high- tokens, the Text Generation Loss is calculated as:
quality instances (text and audio pair) taken from N
1 X (i) (i)
our in-house studio recordings. This dataset was se- Ltext = CE(ŷtext , ytext ) (1)
N
lected to assess the model’s capability in replicating i=1
high-fidelity speech output with speaker imperson- 2. Audio Generation Loss: Denoted as Laudio , it
ation. evaluates the accuracy of generated acoustic tokens
BnTTSTextEval: The BnTTSTextEval dataset en- against target VQ-VAE codes using cross-entropy
compasses three subsets: loss:
• BengaliStimuli53: A linguist-curated set of N
1 X (i) (i)
53 instances, created to cover a comprehen- Laudio = CE(ŷaudio , yaudio ) (2)
N
sive range of Bengali phonetic elements. This i=1
subset ensures that the model handles diverse where ŷaudio represents the predicted logits for
phonemes. the audio token, yaudio are the corresponding target
• BengaliNamedEntity1000: A set of 1,000 VQ-VAE tokens, and N is the number of audio
instances focusing on proper nouns such as token in the sequence.
person, place, and organization names. This Total loss combines the text generation and audio
subset tests the model’s handling of named generation losses with weighted factors:
entities, which is crucial for real-world con- Ltotal = αLtext + βLaudio (α = 0.01, β = 1.0)
versational accuracy. (3)
• ShortText200: Composed of 200 instances, where α and β are scaling factors that control
this subset includes short sentences, filler the relative importance of each loss term.
words, and common conversational phrases D.2 HiFi-GAN Loss
(less than three words) to evaluate the model’s
We used a HiFi-GAN-based vocoder (Kong et al.,
performance in natural, day-to-day dialogue
2020) that comprises multiple discriminators: the
scenarios.
Multi-Period Discriminator, and Multi-Scale Dis-
The BnStudioEval dataset, with reference audio criminator. For the sake of clarity, we will refer to
for each text, will be for reference-aware evalu- these discriminators as a single entity. The HiFi-
ation, while BnTTSTextEval supports reference- GAN module is trained using multiple losses men-
independent evaluation. Together, these datasets tioned below:
1. Adversarial Loss: The adversarial losses Subjective Mean Opinion Score (SMOS): SMOS
for the generator G and the discriminator D are is a perceptual evaluation where listeners rate syn-
defined as follows: thesized speech on a Likert scale from 1 (poor) to
5 (excellent). It considers naturalness, clarity, and
LAdv (D; G) = E(x,s) (D(x) − 1)2 + D(G(s))2
fluency, providing an absolute score for each sam-
(4)
ple. A higher SMOS indicates better overall speech
2
LAdv (G; D) = Es (D(G(s)) − 1) (5) quality.
where x represents the real audio samples, and s SpeechBERTScore: SpeechBERTScore adapts
denotes the input conditions. BERTScore for speech, using self-supervised learn-
2. Mel-Spectrogram Loss: This loss calculates ing (SSL) models to compare dense representations
L1 distance between the mel-spectrograms of the of generated and reference speech. For generated
real and generated audio. This loss is formulated speech waveform X̂ and reference waveform X,
as: the feature representations Ẑ and Z are extracted
using a pretrained model. SpeechBERTScore is
LMel (G) = E(x,s) [∥ϕ(x) − ϕ(G(s))∥1 ] (6)
defined as the average maximum cosine similarity
where ϕ represents the transformation function between feature vectors:
that maps a waveform to its corresponding mel-
Ngen
spectrogram. 1 X
3. Feature Matching Loss: The feature match- SpeechBERTScore = max cos(ẑi , zj )
Ngen j
i=1
ing loss calculates the L1 distance between the in-
termediate features of the real and generated audio,
where ẑi and zj represent the SSL embeddings for
as extracted from multiple layers of the discrimina-
generated and reference speech, respectively.
tor. It is defined as:
Character Error Rate (CER): CER measures
transcription accuracy by calculating the ratio of
T
X 1 errors (substitutions S, deletions D, and insertions
LFM (G; D) = E(x,s) Di (x) − Di (G(s)) 1
Ni I) in automatic speech recognition (ASR) transcrip-
i=1
(7) tions:
S+D+I
where T denotes the number of discriminator CER =
N
layers, and Di and Ni represent the features and
where N is the total number of characters in the
number of features at the i-th layer, respectively.
reference transcription. A lower CER indicates
Final Loss: Given that the discriminator is com- better transcription accuracy.
posed of multiple sub-discriminators, the final ob- Speaker Encoder Cosine Similarity (SECS):
jectives for training the generator and the discrimi- SECS evaluates speaker similarity by calculating
nator are defined as follows: the cosine similarity between speaker embeddings
K
X of the reference and synthesized speech:
LG = [LAdv (G; Dk ) + λFM LFM (G; Dk )]
k=1 eref · esyn
SECS = ,
+ λMel LMel (G) (8) ∥eref ∥∥esyn ∥
K
X where eref and esyn are the speaker embeddings
LD = LAdv (Dk ; G) (9)
for reference and synthesized speech, respectively.
k=1
SECS ranges from -1 (low similarity) to 1 (high
where Dk denotes the k-th sub-discriminator and similarity).
λFM = 2, λMel = 45.
Duration Equality Score: This metric quantifies
E Evaluation Metrics how closely the durations of the reference (a) and
synthesized (b) speech match, with a score of 1
We employed a combination of subjective and ob- indicating identical durations:
jective metrics to rigorously evaluate the perfor-
mance of our TTS system, focusing on intelligibil-
ity, naturalness, speaker similarity, and transcrip- 1
DurationEquality(a, b) = a b
.
tion accuracy. max b, a
This score helps in assessing duration similarity we also recommend using fractional scores. For
between reference and generated audio, ensuring example, a 1.5 indicates quality between ’Bad’ and
consistency in pacing. ’Poor,’ a 2.5 signifies improvement over ’Poor’ but
Each metric provides a different perspective, not quite reaching ’Fair,’ a 3.5 suggests better than
allowing a holistic evaluation of the synthesized ’Fair’ but not up to ’Good,’ and a 4.5 reflects per-
speech quality. formance that surpasses ’Good’ but falls short of
’Excellent.’ This fractional scoring allows for a
F Subjective Evaluation more precise and detailed reflection of the system’s
For subjective evaluation of our system, we em- quality, enhancing the accuracy and depth of the
ploy the Mean Opinion Score (MOS), a widely MOS evaluation.
recognized metric primarily focusing on assessing
F.2 Evaluation Process
the perceptual quality of audio outputs. To en-
sure the reliability and accuracy of our evaluations, We have developed an evaluation platform specif-
we carefully select a panel of ten experts who are ically designed for the subjective assessment of
thoroughly trained in the intricacies of MOS scor- Text-to-Speech (TTS) systems. This platform fea-
ing. These experts are equipped with the necessary tures several key attributes that enhance the effec-
skills and knowledge to critically assess and score tiveness and reliability of the evaluation process.
the system, providing invaluable insights that help Key features include anonymity of audio sources,
guide the refinement and enhancement of our tech- ensuring that evaluators are unaware of whether
nology. This structured approach guarantees that the audio is synthetically generated or recorded
our evaluations are both comprehensive and pre- from studio environment, or which TTS model, if
cise, reflecting the true quality of the audio outputs any, was used. This promotes unbiased assessments
under review. based purely on audio quality. Comprehensive eval-
uation criteria allow evaluators to rate each audio
F.1 Evaluation Guideline sample on naturalness, clarity, fluency, consistency,
For calculating MOS, we consider five essential and emotional expressiveness, ensuring a holistic
evaluation criteria: review of speech synthesis quality. The user-centric
interface is streamlined for ease of use, enabling
• Naturalness: Evaluates how closely the TTS efficient playback of audio samples and score en-
output resembles natural human speech. try, which reduces evaluator fatigue and maintains
focus on the task. Finally, the structured data col-
• Clarity: Assesses the intelligibility and clear
lection method systematically captures all ratings,
articulation of the spoken words.
facilitating precise analysis and enabling targeted
• Fluency: Examines the smoothness of speech, improvements to TTS technologies. This platform
including appropriate pacing, pausing, and is a vital tool for developers and researchers aiming
intonation. to refine the effectiveness and naturalness of speech
outputs in TTS systems.
• Consistency: Checks the uniformity of voice
quality across different texts. F.3 Evaluator Statistics
• Emotional Expressiveness: Measures the For our evaluation process, we carefully selected
ability of the TTS system to convey the in- 10 expert native speakers, achieving a balanced rep-
tended emotion or tone. resentation with 5 males and 5 females. The age
range for these evaluators is between 20 to 28 years,
In the evaluation, we employ a five-point rat- ensuring a youthful perspective that aligns well
ing scale to meticulously assess performance based with our target demographic. All evaluators are ei-
on specific criteria. This scale ranges from 1, de- ther currently enrolled as graduate students or have
noting ’Bad’ where the output has significant dis- already completed their graduate studies. They
tortions, to 5, representing ’Excellent’ where the hail from a variety of academic backgrounds, in-
output nearly replicates natural human speech and cluding economics, engineering, computer science,
excels in all evaluation aspects. To capture more and social sciences, which provides a diverse range
subtle nuances in the TTS output that might not of insights and expertise. This careful selection
perfectly fit into these whole-number categories, of qualified individuals ensures a comprehensive
and informed assessment process, suitable for our H Symbols and Notations
needs in evaluating advanced systems or processes
where diverse, educated opinions are crucial. Variable Description
F.4 Subjective Evaluation Data Preparation T Text sequence with N tokens
N Number of tokens in the text sequence
For reference-aware evaluation, we selected 20 au- S Speaker’s mel-spectrogram with L frames
dio samples from each of the four speakers, result- Ŷ Generated speech
ing in 80 Ground Truth (GT) audios. To facilitate Y Ground truth mel-spectrogram
comparison, we generated 400 synthetic samples F LLM Model
(80 × 5) using the TTS systems examined in this z Discrete codes
study. Including the GT samples, the total dataset C Codebook of discrete codes
for this evaluation amounts to 480 audio files (400 l Number of layers
+ 80). k Number of attention heads i
For the reference-independent evaluation, we Sz speaker spectrogram embd.RL×d
utilized 453 text samples from BnTTSTextEval, d Embedding
comprising BengaliStimuli53 (53), BengaliName- Q, K, V Query, Key, Value
dEntity1000 (200), and ShortText200 (200). Given P Perceiver Resampler
the four speakers in both BnTTS-0 and BnTTS- R Fixed-size output in RP ×d
n, this resulted in 3,624 audio samples (4 × Te Continuous embedding RN ×d
453 × 2). Additionally, IndicTTS, GTTS, and Sp Speaker embeddings
AzureTTS contributed 1,359 samples (3 × 453). Yz Ground truth
IndicTTS samples were evenly distributed between X Input of LLM
two male and female speakers, while GTTS and ⊕ Concatenation operation
AzureTTS used the "bn-IN-Wavenet-C" and "bn- H Output from the LLM
IN-TanishaaNeural" voices, respectively. HY Spectrogram embedding
In total, the reference-independent evaluation S′ Resized embedding HY
dataset comprised 5,436 audio samples. When W Final audio waveform
combined with the 480 samples from the reference- gHiFi HiFi-GAN function
aware evaluation, the overall dataset for subjective
evaluation amounted to 5,916 audio files. These Table 5: Table of Variables and Descriptions
samples were randomly mixed and distributed to
the reviewer team to ensure unbiased evaluations.
G Use of AI assistant
We used AI assistants such as GPT-4o for spelling
and grammar checking for the text of the paper.