Conwst: Non-native multi-source knowledge distillation for low resource speech translation

W Zhu, H Jin, JW Chen, L Luo, J Wang, Q Lu… - … Conference on Cognitive …, 2021 - Springer
W Zhu, H Jin, JW Chen, L Luo, J Wang, Q Lu, A Li
International Conference on Cognitive Systems and Signal Processing, 2021Springer
In the absence of source speech information, most end-to-end speech translation (ST)
models showed unsatisfactory results. So, for low-resource non-native speech translation,
we propose a self-supervised bidirectional distillation processing system. It improves speech
ST performance by using a large amount of untagged speech and text in a complementary
way without adding source information. The framework is based on an attentional Sq2sq
model, which uses wav2vec2. 0 pre-training to guide the Conformer encoder for …
Abstract
In the absence of source speech information, most end-to-end speech translation (ST) models showed unsatisfactory results. So, for low-resource non-native speech translation, we propose a self-supervised bidirectional distillation processing system. It improves speech ST performance by using a large amount of untagged speech and text in a complementary way without adding source information. The framework is based on an attentional Sq2sq model, which uses wav2vec2.0 pre-training to guide the Conformer encoder for reconstructing the acoustic representation. The decoder generates a target token by fusing the out-of-domain embeddings. We investigate the use of Byte pair encoding (BPE) and compare it with several fusion techniques. Under the framework of ConWST, we conducted experiments on language transcription from Swahili to English. The experimental results show that the transcription under the framework has a better performance than the baseline model, which seems to be one of the best transcriptional methods.
Springer
Showing the best result for this search. See all results