AACL Machine Translation Tutorial 2023
AACL Machine Translation Tutorial 2023
https://github.com/AI4Bharat/aacl23-mnmt-tutorial
(under construction)
2
Self Introduction: Jay Gala (jaygala24.github.io)
● Experience
○ 2022 - present: AI Resident, AI4Bharat (IIT Madras)
○ 2021 - 2022: Research Intern, UCSD
● Research
○ Multilingual NLP
■ Translation, Language Modeling: 2022 - present
○ Efficient Deep Learning
■ Data Pruning: 2022 - present
■ Neural Architecture Search: 2021 - 2022
○ Federated Learning: 2021 - 2022
3
Self Introduction: Pranjal A. Chitale ([email protected])
● Experience
○ 2021-present: MS Student at IIT Madras (AI4Bharat)
○ 2017-2021: BE Computer Engineering, University of Mumbai
● Research
○ Multilingual NLP
■ Translation, Language Modeling.
○ Efficient Deep Learning
4
Self Introduction: Raj Dabre ([email protected])
● Experience
○ 2018-present: Researcher at NICT, Japan
■ Visiting researcher at AI4Bharat, IIT Madras (and perhaps more soon
🤫)
○ 2014-2018: MEXT Ph.D. scholar at Kyoto University, Japan
○ 2011-2014: M.Tech. Government RA at IIT Bombay, India
● Research
○ Low-Resource Natural Language Processing
■ Multilingual Machine Translation: 2012-present
■ Document Level Machine Translation: 2021-
■ Large Scale Pre-training for Generation: 2021-
○ Efficient Deep Learning:
■ Compact, flexible and fast models (2018-present) 5
Table of Contents
● Q/A
6
Why Machine Translation is still an important task?
Inclusivity and Accessibility Data Augmentation for
Multilingual Performance
Enhancement
Rule-Based Example-Based
Statistical Machine
Machine Machine Neural Machine
Translation
Translation Translation Translation (NMT)
(SMT)
(RBMT) (EBMT)
8
Evolution of Machine Translation
Rule-Based Example-Based
Statistical Machine
Machine Machine Neural Machine
Translation
Translation Translation Translation (NMT)
(SMT)
(RBMT) (EBMT)
This Tutorial
9
Neural MT Basics: Encoder-Decoder Paradigm
13
Neural MT Basics: Subword MT
frequency word
5 low
2 lower
6 newest
3 wildest
{l, o, w, e, r, n, w, s, t, i, d}
5 low low
2 lower lower
6 newest n e w es t
3 wildest w i l d es t
{l, o, w, e, r, n, w, s, t, i, d, es}
6 newest n e w es t n e w est
3 wildest w i l d es t w i l d est
Inclusion of all unicode characters Outlined in Schuster et al. 2012, Initialize a large vocabulary and
increases base vocabulary popularized by BERT (Delvin et al., trim it based on training data
2018). likelihood while minimizing loss
Use 256 base tokens to overcome increase.
the above and ensure coverage Initialize character-level vocab
with effectively no UNK token. similar to BPE. Prune until desired size is
reached, retaining base
GPT-2 used BBPE, with 256 base Now, instead of including most characters.
tokens and 50K merges frequent symbol, choose the
symbol that maximizes likelihood Store tokenization options with
of training data post adding to the corpus probabilities, defaulting to
vocabulary. the most likely choice.
20
Google’s Multilingual NMT
● First approach to train a single enc-dec based model for multilingual NMT
25
Arivazhagan et al. 2019
M2M-100: Beyond English-Centric NMT
● First MNMT model trained with large-scale non-English centric mined data.
● Similar to MLM models such as BERT, RoBERTa, etc for enc-dec LMs
● BART: English-pretraining
● Fine-tune the models on specific tasks → transfer learning
● mBART-25 (Liu et al. 2020) and mBART-50 (Tang et al., 2020) extends the
same idea for multilingual models.
Denoising Pretraining
Ma et al. 2021 30
NLLB-200 : No Language Left Behind
MoEs Dense Distilled XSTS
Word-level low-resource
LID-200 Stopes library
at later
stages
Costa-jussà et al. 2022 31
NLLB-200: Recipe to Scale up to 200 languages
● Mixture-of-Experts Model
● Curriculum learning
● Self-supervised learning
● Diversified back-translation
○ Leverage BT data from various sources including Bilingual SMT models and
existing MNMT models (BT data diversity).
○ translation - standard CE
objective
39
Monolingual data : Introduction & Need
40
How is monolingual data curated ?
42
Sentence Embedding: LABSE
● Supports 109+ languages
Schwenk et al. 2017, Artetxe et al. 2018, Heffernan et al. 2022, Tan et al. 2023 45
Sentence Embedding: MuSR
○ Speech-text alignment
○ Text-text alignment
https://news.un.org/sw/
___
___
___
En ___
(429M ___
)
___
___
___
FAISS Index for efficient indexing,
Hi ___
(473M clustering, semantic matching and
___ retrieval of dense vectors.
)
Brute-force search (429M x 473M)
is infeasible. (~1000 sent/sec).
A: Depends.
54
Data Quality v/s Scale Tradeoff
55
Embedding COMET
Cos_sim Referenceles
thresholds s
thresholds
Multilingual efforts:
● NLLB-seed (Maillard et al. 2023) - multi-domain, inclusion of low-resource.
58
Benchmarks
59
Existing benchmarks
63
Evaluation Set Leakage Elimination Strategies for NMT Data
Strictness
Eliminate X,Y pairs Eliminate all pairs Eliminate all pairs
from training data if with monolingual with monolingual
present in side of sentences side of sentences
benchmark from benchmark. from benchmark.
65
Core considerations
● Vocabulary
● Architecture
● Training
66
Core considerations
● Vocabulary
● Architecture
● Training
67
On Vocabulary
○ Higher fertilities
○ Longer sequences
○ Original distribution:
○ Modified distribution:
● Special case: Creoles (Dabre et al. 2014 & 2022, Lent et al. 2022 & 2023)
73
Universal Romanization
● Bhojpuri, Chattisgari
○ Extremely low-resource
○ Very similar to Hindi
■ Many spelling
variations
● Noise
○ Character span noise
○ Unigram noise
(Force?)
Ensure
Share
balance
vocabulary
Noise is
Reorder
your friend
79
Core considerations
● Vocabulary
● Architecture
● Training
80
Architecture Variants
NLU NLG
image credits 81
Architecture Variants: Block Choices
● Dense
○ Most commonly used
○ Standard transformers
● Sparse
○ Recent interest
○ Mixtures-of-experts
● Hybrid
○ Partially explored (M2M)
○ Extra hyperparameter
82
(Sparsely Gated) Mixtures Of Experts
● Route to 1 or more
○ Load balancing
important
○ Lephkin et al. 2020
● Difficult to train
90
Core considerations
● Vocabulary
● Architecture
● Training
91
All At Once (Joint) or Stage-wise (Incremental)?
Modification Modification
93
Training Schedule: Joint Training
95
Are Language Family Specific Models Better?
96
Child Parent Language Multilingual
Language Δ(FS - FA )
Turkish Arabic Hindi
Yes They Are!
(135M) (134M) (60M)
● From my
Hausa Ph.D. thesis (2018)
(1.6M)
-0.05 +0.85 +0.01 +0.66
Uzbek ● Goyal et al. 2020
(8M)
+1.33 +0.79 +1.1 +2.79 ● Jointly training HRL and LRL
Marathi
(7.3M) ○ Similar HRL and LRL is
+1.64 +1.35 +1.87 +2.88 best
Malayala
m ● Training joint multilingual
(4M)
+2.27 +1.53 +2.27 +0.44 models
Punjabi
(5.7M) ○ FS = Family Specific
+1.15 +0.34 +1.88 +3.75
Somali ○ FA = Family Agnostic
(3.5M)
+2.28 +2.68 +2.55 +0.96 ○ Family specific 97
Visualization Of MNMT Representations
Predetermined language
families
Empirically
determined language
families via
embedding clustering 99
Is There An Optimal Number of Languages
● Does empirical clustering help? (Upper table)
○ Mostly yes
● Next steps
101
Joint Denoising and Adversarial Approaches For Alignment
Forcing representations of
related languages to be similar
helps!
103
Importance Of Temperature Based
Sampling
● Naive approaches:
○ Ignore corpora size distributions
○ Sample from all corpora equally
● New approach: Temperature based sampling
(pL(1/i) )
○ Where pL is the probability of sampling a
sentence from a corpus
○ i is the sampling temperature
○ Strongly benefits low-resource pairs
104
Arivazhagan et al. 2019
Stage-wise Training: A Bit at a Time!
105
Why Stage-wise/Incremental Training?
● All at once cant be learned effectively
○ Missing languages
○ Data skew
● Benefits
■ New tokens
■ New layers
■ Adapters
106
Incorporating New Languages
111
Family Specific Adapters
● Definite advantage over language ● Gains regardless of supervised or
agnosticism unsupervised directions!
● Strong gains for distant languages
○ Balto-slavic are more similar to English
113
Multi-stage training (Curriculum?)
(Stage 2) (Stage 2)
Auxiliary Downstream
116
Summary: Wise Choices Maketh a Good Model!
Build
Freeze for Add
incrementally,
efficiency and vocabulary as
in stages, with
not forgetting applicable
curricula
Leverage Group
noise, specific Careful about
paraphrases, models and balance of
adversaries adapters are languages
wisely promising
117
Model Compression: Light as a Feather!
118
Why Compression?
I am a boy 私は男の子です
Koishekenov+ 2022121
Dettmers et al. 2022, Dettmers et al. 2022
Quantization
7B params = 28 GB in FP32
= 3.5 GB in
FP4
Go low!
122
Distillation
● Interpolation (hybrid)
○ Use both losses
Adapt for
Train Large NLLB 54B MoE
NLLB 54B MoE Domain
Model Wiki
(optional)
Offline sequence
distillation is too
expensive for largeWord Level
NLLB models! Distillation
Corpus (online)
Quantize
128
Implementations and Toolkits
129
Toolkits: The Big 4
● Fairseq (v1/v2) by Meta
○ Pytorch
○ Comprehensive for MT pre-training and fine-tuning
○ All rounded and most popular among researchers
● Transformers by HuggingFace
○ Pytorch/Tensorflow Large code bases.
Overwhelming for beginners!
○ Most popular for fine-tuners
○ Has a hub for all models
● Tensor2tensor by google
○ Tensorflow
○ Deprecated in favor of TRAX
● OpenNMT by various researchers
○ One of the earliest
○ Both pytorch and tensorflow
130
More Toolkits
● MarianMT by several researchers
○ Written in C++ and minimal dependencies
● JoeyNMT by Amsterdam and Heidlberg universities
○ Minimal
○ For beginners and learners
● Sockeye by Amazon
○ Pytorch
○ Distributed training and efficient inference
● YANMTT by NICT (actually mainly ME)
○ Pytorch
○ Distributed multilingual pre-training and fine-tuning (of lightweight
models) at scale
■ Started out as a pre-training script
○ My hobby/pet project :-)
131
On Evaluation
We built a model but how good is it?
132
Taxonomy of MT Metrics
COMET
Direct
Source Assessment COMET
(1 score) DA
Translation
Indic Comet
Comet DA
DA
Indic DA
Fine-tune
Indic MQM
Indic Comet
Comet
Improved humanMQMcorrelations!
MQM
Excellent zero-shot capability.
Influence of related languages?
137
LLMs For Evaluation (GEMBA)
● String-based
● Model-based
○ IndicCOMET2 - WIP
139
Human Evaluation: For Humans By Humans!
140
Human Evaluation
141
Human Evaluation : Relative Ranking
142
Direct Assessment
DA-Adequacy
DA fluency
Annotators rate on a scale of 0-100 about how much they agree that a
given translation is fluent target language text.
Reference-free evaluation.
System 2 �
�
Averaging /
Relative Ranking
Max-voting.
2❌
System 2 ∑
3✅
Direct Assessment
144
Multidimensional Quality Metrics - MQM
More focus on
Adequacy
than Fluency !
Why ?
Fluency is
subjective.
Modern use-
cases like
social media
lack fluency
by design.
Solution ? Calibration!
● Annotators perform the task on actual data + calibration set (English output
+ Reference) for cross-lingual consistency, as evaluation is english-centric.
● Compute average scores per annotator for calibration set.
150
What still lacks ?
● Low-resource performance can still be improved.
○ Still many extremely low-resource languages
○ Still many relatedness ideas to explore and exploit
● Idiomatic usage needs to be covered.
● Non-english numerals expressed in words might result in hallucinations.
○ Many specific cases
● Coverage of dialects and more languages
● Extension to speech-translation
○ Speech to speech is the dream
● Improve LLMs for MT
○ Will Decoders replace Encoder-Decoders? (I think not)
● Document-level translation underexplored
○ Improve long context handling 151
Bringing in dialects
● Dialects in India
● Big Picture
● Data Creation
● Modeling
○ Models at scale
○ Compactness
● Evaluation
○ Automatic
○ Human 153
Q&A
Thank You
154