0% found this document useful (0 votes)
36 views145 pages

AACL Machine Translation Tutorial 2023

The document discusses the development of advanced multilingual machine translation (MNMT) systems, focusing on the importance of inclusivity and accessibility in bridging gaps between high-resource and low-resource languages. It outlines various architectures and techniques, including the evolution of machine translation methods, the significance of data quality, and the role of tokenization strategies in enhancing translation performance. The tutorial also highlights prominent MNMT models and their contributions to improving translation capabilities across multiple languages.

Uploaded by

Lesly HONFO
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views145 pages

AACL Machine Translation Tutorial 2023

The document discusses the development of advanced multilingual machine translation (MNMT) systems, focusing on the importance of inclusivity and accessibility in bridging gaps between high-resource and low-resource languages. It outlines various architectures and techniques, including the evolution of machine translation methods, the significance of data quality, and the role of tokenization strategies in enhancing translation performance. The tutorial also highlights prominent MNMT models and their contributions to improving translation capabilities across multiple languages.

Uploaded by

Lesly HONFO
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Developing State-Of-The-Art Massively

Multilingual Machine Translation


Systems for Related Languages

Jay Gala Pranjal A. Chitale Raj Dabre


AI4Bharat AI4Bharat, NICT
IIT Madras, India IIT Madras, India Kyoto, Japan
1
Get access to the slides here

https://github.com/AI4Bharat/aacl23-mnmt-tutorial
(under construction)
2
Self Introduction: Jay Gala (jaygala24.github.io)

● Experience
○ 2022 - present: AI Resident, AI4Bharat (IIT Madras)
○ 2021 - 2022: Research Intern, UCSD

● Research
○ Multilingual NLP
■ Translation, Language Modeling: 2022 - present
○ Efficient Deep Learning
■ Data Pruning: 2022 - present
■ Neural Architecture Search: 2021 - 2022
○ Federated Learning: 2021 - 2022

3
Self Introduction: Pranjal A. Chitale ([email protected])

● Experience
○ 2021-present: MS Student at IIT Madras (AI4Bharat)
○ 2017-2021: BE Computer Engineering, University of Mumbai
● Research
○ Multilingual NLP
■ Translation, Language Modeling.
○ Efficient Deep Learning

4
Self Introduction: Raj Dabre ([email protected])

● Experience
○ 2018-present: Researcher at NICT, Japan
■ Visiting researcher at AI4Bharat, IIT Madras (and perhaps more soon
🤫)
○ 2014-2018: MEXT Ph.D. scholar at Kyoto University, Japan
○ 2011-2014: M.Tech. Government RA at IIT Bombay, India

● Research
○ Low-Resource Natural Language Processing
■ Multilingual Machine Translation: 2012-present
■ Document Level Machine Translation: 2021-
■ Large Scale Pre-training for Generation: 2021-
○ Efficient Deep Learning:
■ Compact, flexible and fast models (2018-present) 5
Table of Contents

● Introduction + Prominent MNMT (35 mins)

● Data + Benchmark (40 mins)

● Vocabulary (20 mins)

● Break & Q/A (15-20 mins)

● Architecture + Training (70 mins)

● Automatic Evaluation (10 mins)

● Human Evaluation (10 mins)

● Future work (10 mins)

● Q/A
6
Why Machine Translation is still an important task?
Inclusivity and Accessibility Data Augmentation for
Multilingual Performance
Enhancement

Bridge gap between low-resource


languages (HRL) and high-resource
Transfer Learning via Translation
languages (HRL)
Unlocking Multilingual
Improve language coverage
Capabilities of LLMs
(only covers ~1K of ~7K in the
world)
7
Evolution of Machine Translation

Rule-Based Example-Based
Statistical Machine
Machine Machine Neural Machine
Translation
Translation Translation Translation (NMT)
(SMT)
(RBMT) (EBMT)

● Direct MT ● Word-based ● RNNs


● Transfer-based MT ● Syntax-based ● LSTMs
● Interlingua MT ● Phrase-based ● Transformers

1950 - 1980 1980 - 1990 1990 - 2015 2015 -

8
Evolution of Machine Translation

Rule-Based Example-Based
Statistical Machine
Machine Machine Neural Machine
Translation
Translation Translation Translation (NMT)
(SMT)
(RBMT) (EBMT)

● Direct MT ● Word-based ● RNNs


● Transfer-based MT ● Syntax-based ● LSTMs
● Interlingua MT ● Phrase-based ● Transformers

1950 - 1980 1980 - 1990 1990 - 2015 2015 -

This Tutorial

9
Neural MT Basics: Encoder-Decoder Paradigm

image credits Sutskever et al. 2014 10


Neural MT Basics: Encoder-Decoder with Attention

image credits Bahdanau et al. 2015 11


Neural MT Basics: Transformer Architecture

image credits Vaswani et al. 2017 12


Neural MT Basics: Tokenization
Word-level tokenization Character-level Sub-word level
tokenization tokenization
Split on whitespace.
Split on characters. Intermediate solution between
Drawbacks:
word-level and character-level.
No OOV.
1. Cannot handle OOV cases.
Very frequent words at word-level.
Small Vocab-size.
2. Large vocabulary with
Rare words represented at
redundant entries. Drawbacks: character level.
1. Longer token sequences. Best of both worlds.
2. Lacks semantic meaning
which is present at word-
level.

13
Neural MT Basics: Subword MT
frequency word

5 low

2 lower

6 newest

3 wildest

{l, o, w, e, r, n, w, s, t, i, d}

Senrich et al. 2016 14


Neural MT Basics: Subword MT - Sennrich et al. 2016
frequency word word

5 low low

2 lower lower

6 newest n e w es t

3 wildest w i l d es t

{l, o, w, e, r, n, w, s, t, i, d, es}

Senrich et al. 2016 15


Neural MT Basics: Subword MT - Sennrich et al. 2016
frequency word word word

5 low low low

2 lower lower lower

6 newest n e w es t n e w est

3 wildest w i l d es t w i l d est

{l, o, w, e, r, n, w, s, t, i, d, es, est}

Senrich et al. 2016 16


Neural MT Basics: Subword MT - Sennrich et al. 2016
frequency word word word word

5 low low low lo w

2 lower lower lower lo w e r

6 newest n e w es t n e w est n e w est

3 wildest w i l d es t w i l d est w i l d est

{l, o, w, e, r, n, w, s, t, i, d, es, est, lo}

Senrich et al. 2016 17


Neural MT Basics: Subword MT - Sennrich et al. 2016
frequency word word word word word

5 low low low lo w low

2 lower lower lower lo w e r low e r

6 newest n e w es t n e w est n e w est n e w est

3 wildest w i l d es t w i l d est w i l d est w i l d est

{l, o, w, e, r, n, w, s, t, i, d, es, est, lo, low}

Senrich et al. 2016 18


Neural MT Basics: Advances in Subword-level Tokenization

Byte-level BPE (BBPE) WordPiece Unigram (SentencePiece)

Inclusion of all unicode characters Outlined in Schuster et al. 2012, Initialize a large vocabulary and
increases base vocabulary popularized by BERT (Delvin et al., trim it based on training data
2018). likelihood while minimizing loss
Use 256 base tokens to overcome increase.
the above and ensure coverage Initialize character-level vocab
with effectively no UNK token. similar to BPE. Prune until desired size is
reached, retaining base
GPT-2 used BBPE, with 256 base Now, instead of including most characters.
tokens and 50K merges frequent symbol, choose the
symbol that maximizes likelihood Store tokenization options with
of training data post adding to the corpus probabilities, defaulting to
vocabulary. the most likely choice.

Wang et al. 2019, Delvin et al. 2018, Kudo et al. 2018 19


Prominent Massively Multilingual NMT systems

20
Google’s Multilingual NMT

● First approach to train a single enc-dec based model for multilingual NMT

● Shared vocabulary + Prepend target language token (<2en> / <2es>).

● Improves performance on low-resource languages (transfer-learning).

● Enables zero-shot translation.

● Data and compute efficient compared to bilingual NMT models.

Johnson et al. 2016 21


Google’s Massively MNMT

● 59-lingual low-resource models (6-layer transformer base)

● One to many models better than many-to-many for non-English targets

○ 2-3 BLEU improvement

○ Relative over-representation of English

● Many to one models worse than many-to-many for English targets

○ 2-3 BLEU drop

○ Relative under-representation of English

Aharoni et al. 2019 22


Google’s Massively MNMT

● 103-lingual model (6-layer variation of transformer-big)

● Many-to-one and one-to-many are both better than many-to-many

○ N-way corpora is detrimental to many-to-one performance

● Supervised NMT quality degrades with more languages

● Zero-shot NMT quality increases with more languages

Ukranian to Russian Zero-Shot

Aharoni et al. 2019 23


Google’s Massively MNMT Model In the Wild
● Main contributions

○ Temperature based data sampling for transfer-interference balance

○ Pushing number of supported language pairs to limit

○ Pushing MNMT performance with 1+ billion parameters

● Starting point: Quality is directly proportional to data size

Arivazhagan et al. 2019 24


tl;dr
One-to-many and many-to-one
models better than many-to-many
model

25
Arivazhagan et al. 2019
M2M-100: Beyond English-Centric NMT
● First MNMT model trained with large-scale non-English centric mined data.

● It outperforms English-centric systems by 10 points on the widely used BLEU


metric for evaluating machine translations
● M2M-100 is trained on a total of 2,200 language directions — or 10x more
than previous best, English-centric multilingual models.
● LASER2 model used for mining and fastext for language identification.

Fan et al. 2020 26


● M2M-100 Model : 15B parameter model.
● 12B dense and ~3B sparse language-specific (family) parameters.
● 1.2 BLEU point improvement on an average over 1B model (24E-24D).

Fan et al. 2020 27


BART

● Similar to MLM models such as BERT, RoBERTa, etc for enc-dec LMs
● BART: English-pretraining
● Fine-tune the models on specific tasks → transfer learning
● mBART-25 (Liu et al. 2020) and mBART-50 (Tang et al., 2020) extends the
same idea for multilingual models.

Denoising Pretraining

Lewis et al. 2020 28


DeltaLM

Interleaved Decoder: 1. Initialize Encoder and Decoder


with pretrained multilingual
Leverages complete weights of PME,
encoders (InfoXLM).
while XLM does random init of CA.
2. Trained with monolingual +
SA and bottom FFNs initialized using bilingual data using MLM + TLM.
odd layers, CA and top FFNs
initialized using even.
Ma et al. 2021 29
DeltaLM

Ma et al. 2021 30
NLLB-200 : No Language Left Behind
MoEs Dense Distilled XSTS

Word-level low-resource
LID-200 Stopes library
at later
stages
Costa-jussà et al. 2022 31
NLLB-200: Recipe to Scale up to 200 languages
● Mixture-of-Experts Model

○ Better sharing by routing low-resource through shared-weights.

○ Prevents overfitting on low-resource.

● Curriculum learning

○ Train on high-resource first, then introduce low-resource, to prevent overfitting.

● Self-supervised learning

○ Self-supervised learning on monolingual data for low-resource and linguistically


similar high-resource languages for improved performance.

● Diversified back-translation

○ Leverage BT data from various sources including Bilingual SMT models and
existing MNMT models (BT data diversity).

Costa-jussà et al. 2022 32


Average BLEU score on Baseline on FLORES-200 (english-
FLORES-101 (english- centric).
centric) M2M-100 and 3.3B Transformer (base, with SSL, with
DeltaLM BT) 54B MoE (SSL + BT)

Costa-jussà et al. 2022 33


MADLAD-400

● Rigorous filtering of monolingual data from CC across 419


languages.
● Monolingual data -> 3T tokens.
● Joint training (MASS + CE) -> 3 Enc-Dec variants (3B, 7.2B, 10.7B).
● UL2 - 8B-Decoder-only (monolingual data).

Kudugunta et al. 2023 34


Towards the Next 1000 Languages in Multilingual MT

● Scaling to 1000 languages by


leveraging supervised and
unsupervised objectives.
● Leveraging all HRL monolingual
and parallel data available to
enable transfer to LRL.
● Single stage joint training

○ denoising - MASS objective

○ translation - standard CE
objective

Siddhant et al. 2022 35


Towards the Next 1000 Languages in Multilingual MT :
Findings
● High-quality data leads to significant performance improvements on related
zero-resource pairs.
● Num (Supervised Directions) > Num (Self-supervised / Zero-resource
Directions) for maintaining performance.
● Scaling parallel data is more important than scaling monolingual data.

● Self-supervised pre-training leads to domain robustness in NMT.

● Joint-pretraining + NMT better than 2-stage training (like BART).

Siddhant et al. 2022 36


Models for related languages / Demography-specific 37
Main considerations to build SOTA models

Robust SOTA MT models

Data Modeling Benchmark

high-quality data deeper architectures multi-domain


domain diversity training objectives demography-specific
language relatedness language relatedness formality levels

Standard recipe for deep


learning
38
Data: Parallel Corpus Creation

39
Monolingual data : Introduction & Need

● Abundant on the Internet and in electronic format books


● Primarily English and available in document-level format
● Data from books not standardized, difficult to crawl and use
○ needs sophisticated extraction techniques
● Regional language websites - Valuable sources for Low-resource languages
● Collected using large-scale web-crawling efforts
○ example: C4 (English) and mC4 (multilingual)
● Web data crucial for training language models and other purposes

40
How is monolingual data curated ?

URL Filtering Language Line-wise Filtering


Identification
offensive, copyright script-based (unicode) remove undesirable
content and spam model-based lines, repetitions,
filtering (fasttext, cld3, etc) toxicity filters, etc

Text Extraction Document-wise Deduplication


Filtering
tools for extraction in-document fuzzy / exact
include boilerpipe, repetition removal, substring
warcio, trafilatura, etc toxicity filters, etc deduplication

Data Filterin Deduplicatio


Preparation g n
Penedo et al. 2033 41
Monolingual Data Curation Efforts

Large-scale Language-group focused

● CommonCrawl (C4) ● IndicCorp (v1, v2), Varta (Indian)


● mC4 ● IndoNLG (Indonesian)
● Pile ● IndCorpus (Indigenous)
● RedPajama ● WebCrawl African corpora (African)
● RefinedWeb ● CreoleMT/CreoleEval

... Large-scale corpora with more ... Fine-grained language-


general-purpose or one-fit-all specific heuristics designed
heuristics, might yield noisy to ensure high quality corpora
corpora for few languages even though scales of data
at scale. curated might be lower.

42
Sentence Embedding: LABSE
● Supports 109+ languages

● Dual-encoder approach for training

○ Stage 1: Continual pre-training


on MLM + TLM
○ Stage 2: Translation-ranking +
in-batch negative sampling
● Additive margin softmax similar to
SVM to discriminate good and bad
translation pairs
● One-for-all approach

Feng et al. 2020 43


Sentence Embedding: LEALLA
● A distilled LaBSE with increased
inference efficiency and low-dimensional
sentence embedding (128, 192 or 256)
● Performs comparably with LaBSE for
109+ languages

Mao and Nakagawa, 2023 44


Sentence Embedding: LASERx
● LASER1 Supports 93 languages

● LASER - encoder of BiLSTM enc-dec


NMT model
● LASER2 - SPM instead of BPE +
upsampling of low-resource
● LASER3 - distilled LASER2 into
language-specific encoders
● LASER3 competitive with LABSE

● Additional support for low-resource


languages (147 encoders in total)

Schwenk et al. 2017, Artetxe et al. 2018, Heffernan et al. 2022, Tan et al. 2023 45
Sentence Embedding: MuSR

Gao et al. 2023 46


Sentence Embedding: SONAR
● Supports 200 text languages and
37 speech.
● Dual-encoder approach for training

○ Speech-text alignment

○ Text-text alignment

Duquenne et al. 2023 47


Sentence Embedding: SONAR

Duquenne et al. 2023 48


Mining parallel data at scale: Basics

https://news.un.org/sw/

Shared multilingual vector


space*
https://news.un.org/
en/
LABSE / LASER3 49
Issue : Infeasible at scale due to very large search space

___
___
___
En ___
(429M ___
)
___

___
___
FAISS Index for efficient indexing,
Hi ___
(473M clustering, semantic matching and
___ retrieval of dense vectors.
)
Brute-force search (429M x 473M)
is infeasible. (~1000 sent/sec).

Johnson et al. 2019 50


CC-Matrix: Global Monolingual Mining

● Global comparison of all unique sentences between source and target


languages.
● High computational cost, but eliminates the need for manual intervention or
heuristics in identifying document-aligned data.
● FAISS-index used for efficient indexing and retrieval.

● Yields large-scale data, although compromising on quality due to global-level


comparison.
● Maximizes Recall, Precision might be compromised.

Schwenk et al. 2021 51


CC-Aligned: Comparable Corpora Mining

● Heuristic-based document-level alignment identification. (human effort).

● Sentence extraction from aligned documents.

● Local search only on sentences between aligned documents.

● Fast, scalable, more chance of mining high-quality bitext.

● Lower computational requirements compared to CC-Matrix style global


mining.
● Rich resource for non-English pairs (same English document translated to
multiple languages).
● Emphasis more on precision.

El-Kishky et al. 2020 52


Choosing the appropriate sentence embedding model

Check correlation with


human STS for the set
of languages you wish
to consider.

Gala et al. 2023 53


Parallel Corpus filtering
Q: Does noise affect model training ?

A: Depends.

If you operate at very large data and model scales.

⇒ Data at scale matters, noise has minimal impact.

[Gordon et al. 2021, Bansal et al. 2022]

If you are operating at lower data and model scales ?

=> It does. Eliminating noisy data improves NMT performance.

[Gala et al. 2023, Batheja et al. 2023]

54
Data Quality v/s Scale Tradeoff

55

Data Quality matters over


scale
Gala et al. 2023
Parallel Corpus Filtering

Embedding COMET
Cos_sim Referenceles
thresholds s
thresholds

0.80 threshold optimal for Check COMET calibration


LABSE ( for your desired language
Ramesh et al. 2022, set.
Gala et al. 2023).
Empirically determine
Embedding models might language-specific QE
Token count threshold vary in performance with thresholds. (No prior
heuristics might be
length. work).
useful to reduce false
positives. 56
Seed Data: High-quality human-annotated data

Language-family-specific: ILCI (Jha et al. 2010) (Tourism).

Multilingual efforts:
● NLLB-seed (Maillard et al. 2023) - multi-domain, inclusion of low-resource.

● MASSIVE (FitzGerald et al. 2022) - spoken content.

BPCC-Human (Gala et al. 2023)


● Enhanced domain coverage, emphasizing diverse domains and demographic
representation to improve performance in India-centric use cases.
● Inclusion of content not typically covered in web crawls, such as informal or spoken
content, to make models more robust.
57
Benchmark

58
Benchmarks

● Benchmarks are indicators of performance of systems on the task across


different settings.
● Benchmarks are harder to create, limited in quantity as well as demographic
Why?
diversity, and also variable in quality.
● NMT systems have reached So many systems
a fair point, out
multi-domain demography-specific
benchmarks importantthere, biases
to drive are progress
further bound toin terms of quality and
suitability for production-usage. come in
QC Procedure ?
● Creation of benchmark for tasks like NMT is much harder than it seems.

59
Existing benchmarks

WMT Shared Task FLORES-x


Benchmarks NTREX-128
+ Largest coverage
+ Direction-specific + 128 languages.
200+ langs. + Human-generated
+ Human- + Multi-domain
generated + Human-generated
- Limited
- Limited language demographic
- Limited
coverage coverage
demographic -
- Limited domain Only news (WMT).
coverage -
coverage (usually Only prose or formal
- Only prose or formal
news). style of text.
style of text. 60
Wishlist for creating a NMT benchmark

Multidomai Diverse Demographic


n Sources Coverage

Formal + Quality Deduplicated with


Informal Controlled existing benchmarks
61
Benchmark Creation Procedure

Identifying Sources Sampling sentences Automatic QC


Automated plagiarism
Multi-domain content Sample source
detection mechanism
from diverse sources sentences, keeping
to avoid biases
of the said length and domain
towards external NMT
demographics. diversity in mind.
systems.

Sentence Verification Translation Review : Manual QC


Get content from the
sources, conduct Get the sentences Human verification
sentence verification translated into the and correction to
and domain target language. ensure high-quality.
classification.

Data Translatio Quality


Preparation n Check
62
Deduplication with Benchmarks

● Most benchmarks being created from Wikimedia entities / data on web.


● Most data crawls include entire internet, so susceptible to leakages.
● Apply strict deduplication technique with benchmarks, to ensure minimal
data leakages and ensure robust evaluation.

Pre-hoc deduplication Post-hoc deduplication


(Existing benchmarks) (New benchmarks)

Eliminate data leakages from Only evaluate on those sentences


training / pre-training data to obtain from the benchmark which did not
an unbiased estimation of translation have overlaps in training data.
quality.

63
Evaluation Set Leakage Elimination Strategies for NMT Data

Parallel / Exact Exact Monolingual Fuzzy Monolingual


Dedup Dedup Dedup

Strictness
Eliminate X,Y pairs Eliminate all pairs Eliminate all pairs
from training data if with monolingual with monolingual
present in side of sentences side of sentences
benchmark from benchmark. from benchmark.

Only eliminates Eliminates exact Here the matching is


exact pair matches matches as well as fuzzy, n-gram based
from benchmark. paraphrases. and hence stricter.

Gopher and Megatron


uses 13-gram fuzzy 64
dedup
Modeling
Decision choices affecting massively multilingual MT

65
Core considerations

● Vocabulary

● Architecture

● Training

66
Core considerations
● Vocabulary

● Architecture

● Training

67
On Vocabulary

● Goal: Fair representation across languages


● Problem: Imbalance of data

○ Lower-resource languages get smaller share of vocab space

○ More character level representation

○ Higher fertilities

○ Longer sequences

○ Training time impacted

○ Downstream performance impacted

Arivazhagan et al. 2019 68


Solving The Vocabulary Skew
● Same number of tokens per language

○ Wasteful since not enough training


data anyway
● Balance the distribution with temperature

○ Original distribution:

○ Modified distribution:

○ S = T-1 where T is the temperature

○ T=1 for original distribution


T=100 for equal distribution
○ Minor impact??

Arivazhagan et al. 2019 69


When does skew really hurt?

● Study by Zhang et al. 2022

○ Just dont go to character level and have high unk rates

○ Training data balancing >> Vocab data balancing

Zhang et al. 2022 70


Vocabulary For Related Languages

● Shared scripts enable smaller vocabularies without going to character level

○ Opportunity to have compact models


○ Smaller softmaxes means faster training and decoding
● Transliteration to boost cognates: Boost transfer

○ Into English - Universal Romanization (Hermjakob et al. 2018,


Sennrich et al. 2016)
○ Into Kanji/Hanzi: Different languages - Similar scripts (
Song et al. 2020)
○ Into a common related language - IndicBERT (Kakwani et al. 2021),
IndicBART (Dabre et al. 2022), IndicTrans2 (Gala et al. 2023)
○ Boosting overlaps (NLU): Overlap BPE (Patil et al. 2022) 72
Vocabulary For Related Languages (2)

● Special case: Creoles (Dabre et al. 2014 & 2022, Lent et al. 2022 & 2023)

○ Free ride due to high similarity with parent languages

● Multiple segmentations as related languages

○ Kambhatla et al. 2022

○ Multiple segmentations of same sentences boost translation

■ Potato_, Po tat o_, Potat o_

■ BPE-dropout (Provlikov et al. 2020)

● Subword regularization boosts low-resource performance

73
Universal Romanization

Hermjakob et al. 2018 74


Family Specific Transliteration

● Hindi to Tamil (Indic NLP Library)


○ input_text = राजस्थान (rajasthan)
○ output_text = ராஜஸ்தாந
● Typical solution: Map to the highest resource or related language
○ IndicBART (Dabre et al. 2022)
○ IndicTrans2 (Gala et al. 2023)
○ RelateLM (Khemchandani et al. 2021)
■ Xlingual noising
○ Khatari et al. 2021
● Other solutions: Subgroup?
○ Indo-Aryan vs Dravidian
Kunchukuttan et al. 2017 Goyal et al.2020 75
Zero-Shot Translation Capabilities of IndicTrans2

tl;dr. Zero-shot (into-English) performance of IndicTrans2 close to


NLLB-54B (N54) (explicitly tuned for those languages).
Potential for light-weight and rapid adaptation to extremely low-
resource languages. (Neubig et al. 2018)
Gala et al. 2023 76
Artificial Noise: Induce Vocabulary And Zero Shot Translation

● Bhojpuri, Chattisgari
○ Extremely low-resource
○ Very similar to Hindi
■ Many spelling
variations

● Noise
○ Character span noise
○ Unigram noise

● Improved zero shot


translation
○ Up to 6 BLEU gains

● Linguistic noise ineffective


○ Scope Aepli+,
for investigation
2022, Maurya+, 2023 77
Orthogonal: Leveraging Ordering Information

Reorder parent-source sentence to match child-source sentence


Ensures better alignment of encoder contextual embeddings

● Significant improvements over baseline finetuning: needs a parser and re-ordering


system
○ Popovic et al. 2016, Mao et al., 2020, Jones et al. 2021, and Mao et al., 2022 also
focus on syntactic differences
○ Puduppully et al. 2023 show that monotone word order helps related language MT
○ Philippy et al. 2023 highlight more considerations for crosslingual transfer (survey)

Murthy et al., 2019 78


Summary: Make friends, not enemies!

(Force?)
Ensure
Share
balance
vocabulary

Noise is
Reorder
your friend

Leverage Size impacts


dictionaries minimally

79
Core considerations

● Vocabulary

● Architecture

● Training

80
Architecture Variants

NLU NLG

image credits 81
Architecture Variants: Block Choices

● Dense
○ Most commonly used
○ Standard transformers

● Sparse
○ Recent interest
○ Mixtures-of-experts

● Hybrid
○ Partially explored (M2M)
○ Extra hyperparameter

82
(Sparsely Gated) Mixtures Of Experts

● 1 FFN becomes N FFNs

○ Insert every Kth layer

● Route to 1 or more

○ Load balancing
important
○ Lephkin et al. 2020

● Explosively increase params

● Difficult to train

○ ST-MOE to the rescue

○ Zoph et al. 2022


83
Routing Decisions Lead To Language Clusters

● Costa-jussà et al. 2022 analyze gating


vectors (NLLB)

● Plot of cosine similarity between


language level gating vectors
● Language families have similar
gating behavior
● Kudugunta et al. 2021 manually route
tasks or languages to experts
○ Suitable for related languages

○ MoLEs by Gu et al. 2018


85
Note On LLM Only MT Modeling

Zhang et al. 2022 87


Decoder Only MT Performance

Depth > Width at smaller scales!


Encoder-Decoder is still the best at scale! 88
But!

Better zero-shot performance of prefix-LMs is intriguing!


(Cool implications for related languages) 89
Summary: Wise Choices Maketh a Model!
Design language
Use MoEs for
group specific
Scale
modules

LMs are decent Dense models Careful about PEs


MT models for are not ideal and Normalization
zero shot

90
Core considerations

● Vocabulary

● Architecture

● Training

91
All At Once (Joint) or Stage-wise (Incremental)?

All Corpora Train Model

Corpora Model A Corpora Corpora


Model B
Subset A Subset B Subset B

Modification Modification

Train Stage 1 Train Stage 2 Train Stage 3 ....


92
Joint Training: All At Once!

93
Training Schedule: Joint Training

● Mixed language pairs batch (Johnson et al. 2017)

○ Mix all corpora, shuffle and then choose batches

● Useful for fully shared models

● For models with separate language encoders/decoders

○ Shard batch and feed to appropriate components

○ Special encoders for language families

95
Are Language Family Specific Models Better?

96
Child Parent Language Multilingual
Language Δ(FS - FA )
Turkish Arabic Hindi
Yes They Are!
(135M) (134M) (60M)
● From my
Hausa Ph.D. thesis (2018)
(1.6M)
-0.05 +0.85 +0.01 +0.66
Uzbek ● Goyal et al. 2020
(8M)
+1.33 +0.79 +1.1 +2.79 ● Jointly training HRL and LRL
Marathi
(7.3M) ○ Similar HRL and LRL is
+1.64 +1.35 +1.87 +2.88 best
Malayala
m ● Training joint multilingual
(4M)
+2.27 +1.53 +2.27 +0.44 models
Punjabi
(5.7M) ○ FS = Family Specific
+1.15 +0.34 +1.88 +3.75
Somali ○ FA = Family Agnostic
(3.5M)
+2.28 +2.68 +2.55 +0.96 ○ Family specific 97
Visualization Of MNMT Representations

● SVCCA similarity between representations (


Kuduganta et al. 2019)

○ Also see Dabre et al. 2017 and


Johnson et al. 2017

● Encoder representations cluster sentences into


language families

○ Regardless of script sharing

○ Script sharing for stronger clustering

● High resource languages cause partition

○ Low-resource languages ride the wave

● Evidence of representation invariance when fine-


tuning

○ Explains poor zero shot quality between


distant pairs 98
Empirically
Determined
Language
Families
● Train many-to-many model with language tokens
● Hierarchical clustering of tokens
○ Set number of clusters by elbow-sampling
● Tan et al. 2019
● Also see Oncevay et al. 2020
● LangRank: find similar languages (
Zhou et al. 2021)

Predetermined language
families

Empirically
determined language
families via
embedding clustering 99
Is There An Optimal Number of Languages
● Does empirical clustering help? (Upper table)

○ Mostly yes

○ Random clustering gives poorer results

○ Predetermined clustering is equally good

● Language family specific models (Bottom table)

○ Universal model < Individual models

○ Family specific model > Individual models

○ Related to observations by Dabre et al. 2018

● Next steps

○ Family specific adapter layers (Bapna et al. 2019)

○ Family specific vocabulary and decoder separation

○ Behavior in extremely low-resource settings (<20k


pairs; Dabre et al, 2019) 100
Leveraging Noise and Paraphrasing For Related Languages (Aly+ 2021)

101
Joint Denoising and Adversarial Approaches For Alignment

Forcing representations of
related languages to be similar
helps!

Ko et al. 2021 102


Training Schedule: Addressing Language Equality

● Source of inequality: Corpora size skew

● Solutions: Oversampling smaller corpora

● Oversampling before training or during training?


○ Matter of implementation choice

○ Oversampling prior to training creates large duplicated corpora

103
Importance Of Temperature Based
Sampling
● Naive approaches:
○ Ignore corpora size distributions
○ Sample from all corpora equally
● New approach: Temperature based sampling
(pL(1/i) )
○ Where pL is the probability of sampling a
sentence from a corpus
○ i is the sampling temperature
○ Strongly benefits low-resource pairs

104
Arivazhagan et al. 2019
Stage-wise Training: A Bit at a Time!

105
Why Stage-wise/Incremental Training?
● All at once cant be learned effectively

○ Missing languages

○ Data skew

○ Difficulty in handling language group phenomena explicitly

● Benefits

○ Incorporating new languages


○ Expanding capacity

■ New tokens

■ New layers

■ Adapters
106
Incorporating New Languages

● Language specific transfer ● Expanding to new languages


○ Replace vocabulary ○ Expand vocabulary
○ Fine-tune on new data ○ Fine-tune on old+new data
● Similar to Zoph et al. 2016 ● Increase computational capacity?
WECHSEL for initialization Surafel et al. 2018108
Capacity Expansion Of Existing Models

● Add new components while freezing existing components


○ Lightweight training BUT
○ Previous components may not be aware of new languages
■ Poor transfer learning
○ Potential zero-shot learning
■ Will it work for distant languages
■ Similar recipe for multimodality (Duquenne et al. 2022)

● Lessons from Sachan et al. 2018; Firat et al. 2016a/b;


Bapna et al. 2019
○ Deepen encoders and decoders
○ Only train new components with old and/or new data
■ Vocabulary expansion by Surafel et al. 2018 will help
Escolano et al. 2019 109
Adapting Previously
Trained Models
● Feed forward layers to refine
outputs
○ Bapna et al. 2019
○ Partial solution to
bottleneck
○ Language pair specific
■ Zero-shot
performance?
● 13.5% larger models
● Improved high-resource pair
performance
○ Low-resource
performance kept
110
Adapters For Related Languages
● Language family specific adapters
○ Gumma et al. 2023
○ Chronopoulou et al. 2023
● Mixtures of adapters
○ Wang et al. 2023
○ Automatic adapter clustering
○ Language branches (Sun et al. 2022)
● Adapter ensembling
○ Use similar languages
○ Wang et al. 2021
● Hyperadapters (Baziotis et al. 2022)
○ Base parameters to create specific adapter instances
■ Contextual parameters (Platanois et al. 2018)
○ Similar languages support each other
○ Networks encode relatedness

111
Family Specific Adapters
● Definite advantage over language ● Gains regardless of supervised or
agnosticism unsupervised directions!
● Strong gains for distant languages
○ Balto-slavic are more similar to English

Chronopoulou et al. 2022112


Vocabulary Expansion Not Needed For Related Languages

● Creoles and dialects can be well segmented by related language tokenizers


○ Selected tokenizer impacts performance (choose wisely!)
○ Chen et al. 2023 (mBART50 for South American indigenous languages
[Spanish based])
○ Dabre et al. 2022 (mBART50 for Mauritian Creole [French Creole])

113
Multi-stage training (Curriculum?)

Train the model on the BPCC +


Train the model on the BPCC
BPCC-BT
(Stage 1)
(Stage 1)
Data Augmentation
Experiment well with
forward as wellback
using as back
translation: BPCC-BT
translations!
(~400M bitext pairs)
Fine-tune the model on the seed Fine-tune the model on the seed
corpora corpora

(Stage 2) (Stage 2)

Auxiliary Downstream

Train the Model from Finetune the model from


scratch stage 1
114
Importance of Tuning on Clean Corpora

En-Indic did better with


forward translated data!
(Relatively smaller
monolingual corpora to
blame?)

Gala et al. 2023115


Importance of Curricula
● Dabre et al. 2019: Introduce low-resource later
○ Jointly training HRLs and LRLs was suboptimal
○ Best recipe: HRL --> HRL+LRL --> LRL
■ Noisy --> Noisy+Clean --> Clean
○ Add complexity then reduce!
● Mohiuddin et al. 2022: Score and select subsets
○ Start with General Domain model
○ Score instances:
■ Use external scorer
■ Use model itself
○ Related to: Wang et al. 2017
● Faster convergence
● Better performance
● Low resource languages:
○ Scoring using related HRL?

116
Summary: Wise Choices Maketh a Good Model!

Build
Freeze for Add
incrementally,
efficiency and vocabulary as
in stages, with
not forgetting applicable
curricula

Leverage Group
noise, specific Careful about
paraphrases, models and balance of
adversaries adapters are languages
wisely promising
117
Model Compression: Light as a Feather!

118
Why Compression?

I am a boy 私は男の子です

Behavior #Params Translation Translatio


Quality n
Speed Achieved using
Default Many High Slow sequence
distillation Kim
Few Low Fast
and Rush, 2016
Desired Few High Fast

Kim and Rush, 2016119


Compression Approach: Pruning

● Structured Pruning: Remove specific layers or groups


● Unstructured Pruning: Remove least important weights or neurons
Han et al. 2015120
Memory-efficient NLLB-200
● Determine statistics: Expert importance and routing probabilities
● Prune: Per layer or Globally

Koishekenov+ 2022121
Dettmers et al. 2022, Dettmers et al. 2022

Quantization

7B params = 28 GB in FP32
= 3.5 GB in
FP4
Go low!

122
Distillation

● Word level distillation (online)


○ Use parent distributions
○ Not always good

● Sequence level distillation (offline)


○ Translate to get parents token distributions
○ Mostly enough

● Interpolation (hybrid)
○ Use both losses

Kim and Rush, 2016123


Wiki
Distilling Large Models Corpus

Adapt for
Train Large NLLB 54B MoE
NLLB 54B MoE Domain
Model Wiki
(optional)

Offline sequence
distillation is too
expensive for largeWord Level
NLLB models! Distillation
Corpus (online)

1.3B Dense 600M Dense

Costa-jussà et al. 2022 125


Summary: The Holy Trinity

Prune and Distill


Tune (online?)

Quantize

128
Implementations and Toolkits

129
Toolkits: The Big 4
● Fairseq (v1/v2) by Meta
○ Pytorch
○ Comprehensive for MT pre-training and fine-tuning
○ All rounded and most popular among researchers
● Transformers by HuggingFace
○ Pytorch/Tensorflow Large code bases.
Overwhelming for beginners!
○ Most popular for fine-tuners
○ Has a hub for all models
● Tensor2tensor by google
○ Tensorflow
○ Deprecated in favor of TRAX
● OpenNMT by various researchers
○ One of the earliest
○ Both pytorch and tensorflow
130
More Toolkits
● MarianMT by several researchers
○ Written in C++ and minimal dependencies
● JoeyNMT by Amsterdam and Heidlberg universities
○ Minimal
○ For beginners and learners
● Sockeye by Amazon
○ Pytorch
○ Distributed training and efficient inference
● YANMTT by NICT (actually mainly ME)
○ Pytorch
○ Distributed multilingual pre-training and fine-tuning (of lightweight
models) at scale
■ Started out as a pre-training script
○ My hobby/pet project :-)
131
On Evaluation
We built a model but how good is it?

132
Taxonomy of MT Metrics

Lee et al. 2023, Sai et al. 2020 133


Which String-based Metric Is Reliable These Days?

chrF/chrF++ most reliable.


Do significance testing.
Especially for Indic languages.
End of BLEU?

Sai et al. 2023 134


Rei et al. 2020 Freitag et al. 2021

COMET

Direct
Source Assessment COMET
(1 score) DA

Translation

Reference Annotations to MQM COMET


(optional) score via (Annotations) MQM
formula
135
IndicComet

Indic Comet
Comet DA
DA

Indic DA

Fine-tune

Indic MQM

Indic Comet
Comet
Improved humanMQMcorrelations!
MQM
Excellent zero-shot capability.
Influence of related languages?

Sai et al. 2023 136


Limitations of Learned Metrics
● Amrhein et al. 2022: COMET makes mistakes
○ Not sensitive to number and named entities discrepancies
○ Hard to fix biases via fine-tuning
● Moghe et al. 2023: Poor quality estimation - downstream performance
correlation
○ Metrics have negligible correlations with the extrinsic evaluation
of downstream outcomes
○ Scores provided by neural metrics are not interpretable
○ Diverse references and metrics to produce labels instead of scores
(MQM?)

137
LLMs For Evaluation (GEMBA)

Large PALM models prompted


for MQM gave better
correlation with human
annotations!

Kocmi et al. 2023, Fernandes et al. 2023 138


Evaluation Implementations

● String-based

○ SacreBLEU (Post et al. 2018)

● Model-based

○ COMET (Rei et al. 2020, Freitag et al. 2021)

○ BLEURT (Sellam et al. 2020, Pu et al. 2021)

○ IndicCOMET2 - WIP

always perform significance testing

139
Human Evaluation: For Humans By Humans!

140
Human Evaluation

● Adequacy :- Faithfulness of the translation to its source.

○ Errors like omission, unwarranted additions / mistranslations.

○ Needs bilingual evaluators

● Fluency :- How fluent is the translation as a standalone sentence in the


target language.
○ Factors include naturalness, grammatical / spelling errors.

○ Can be conducted by monolingual evaluators

● Most critical aspect of Human evaluation : Inter-Annotator-Agreement

○ Humans can be subjective, sometimes too strict or too lenient, we want


consistency.

141
Human Evaluation : Relative Ranking

● Annotators shown outputs of 2 or more systems and have to rank the


systems.
● Relative measure.

● Not an indicator of how-good or how-bad a system is.

● Usually poor Inter-Annotator-Agreement (IAA) observed for relative ranking.

● Lacks interpretability and offers limited insights about aspects to improve


upon (Adequacy / Fluency).

142
Direct Assessment

Absolute 0–100 rating.

DA-Adequacy

Annotators rate how adequately the candidate expresses the meaning


of the corresponding reference translation on a scale of 0-100.

DA fluency

Annotators rate on a scale of 0-100 about how much they agree that a
given translation is fluent target language text.

Reference-free evaluation.

Graham et al, (2013, 2014); 143


Relative Ranking v/s Direct Assessment
Input text MT System Output Human Aggregation
text evaluators Strategies ?
System 1 �

System 2 �

Averaging /
Relative Ranking
Max-voting.

2❌
System 2 ∑
3✅
Direct Assessment
144
Multidimensional Quality Metrics - MQM

Lommel et al. 2014; Freitag et al. 2021 145


Annotators assign scores based on
the quality of translation and
Example of a MQM annotation identified errors.

MQM provides more informed


judgements

Sai et al. 2023 146


Semantic Textual Similarity - STS

More focus on
Adequacy
than Fluency !

Why ?

Fluency is
subjective.

Modern use-
cases like
social media
lack fluency
by design.

Agirre et al. 2016 147


Cross-lingual STS - XSTS

● Large degree of variance in STS ratings across different language pairs as


pool of annotators and outputs are different.

Solution ? Calibration!
● Annotators perform the task on actual data + calibration set (English output
+ Reference) for cross-lingual consistency, as evaluation is english-centric.
● Compute average scores per annotator for calibration set.

● Check annotator-wise deviations on the calibration set and apply this


correction factor on the actual data as well to normalize it and impose some
cross-lingual consistency.

Licht et al. 2022 148


The future
Deepening the niche

150
What still lacks ?
● Low-resource performance can still be improved.
○ Still many extremely low-resource languages
○ Still many relatedness ideas to explore and exploit
● Idiomatic usage needs to be covered.
● Non-english numerals expressed in words might result in hallucinations.
○ Many specific cases
● Coverage of dialects and more languages
● Extension to speech-translation
○ Speech to speech is the dream
● Improve LLMs for MT
○ Will Decoders replace Encoder-Decoders? (I think not)
● Document-level translation underexplored
○ Improve long context handling 151
Bringing in dialects

● Dialects in India

○ Dialects of Marathi (42 dialects)

○ I speak only 3! :-(


Language resource papers
● Dialects in Japan should not be looked down
upon!
○ I can handle Osaka and Kyoto dialect but theres MORE!

● Dialects is what people use but mostly spoken

○ Spoken language data collection needs focus

○ Creoles are mostly spoken

○ Case for direct speech-speech MT?

● Let us collect data aggressively!


152
Summary

● Big Picture

○ The current state MT

● Data Creation

○ Manual and mining

● Modeling

○ Models at scale

○ Compactness

● Evaluation

○ Automatic

○ Human 153
Q&A
Thank You

154

You might also like