0% found this document useful (0 votes)
2 views

Research Synopsis

This research aims to develop a high-quality Neural Machine Translation (NMT) system for Tigrigna-English by addressing the challenges posed by the lack of parallel corpora and the language's morphological complexity. The study will utilize transfer learning, synthetic data generation, and fine-tuning of pre-trained multilingual models to enhance translation quality. The outcomes are expected to improve language accessibility for Tigrigna speakers and contribute to the development of a robust parallel corpus.

Uploaded by

mehari kiros
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Research Synopsis

This research aims to develop a high-quality Neural Machine Translation (NMT) system for Tigrigna-English by addressing the challenges posed by the lack of parallel corpora and the language's morphological complexity. The study will utilize transfer learning, synthetic data generation, and fine-tuning of pre-trained multilingual models to enhance translation quality. The outcomes are expected to improve language accessibility for Tigrigna speakers and contribute to the development of a robust parallel corpus.

Uploaded by

mehari kiros
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Research Title: Developing a High-Quality Tigrigna Neural Machine

Translation (NMT) System


1. Introduction

1.1.Background

Neural Machine Translation (NMT) has significantly improved the quality of automatic
translation for various languages by leveraging deep learning models such as Transformer,
BERT, and mBART. However, low-resource languages like Tigrigna (spoken primarily in
Ethiopia and Eritrea) lack sufficient parallel corpora, which limits the performance of
machine translation systems.

Traditional Statistical Machine Translation (SMT) and Rule-Based Translation approaches


have proven ineffective for morphologically rich and underrepresented languages like
Tigrigna due to lexical, syntactic, and grammatical complexities. This research aims to
develop a high-quality Tigrigna-English NMT system by applying transfer learning, synthetic
data generation, and fine-tuning pre-trained multilingual models.

1.2.Problem Statement

The main problem in the Tigrigna language is the lack of a large-scale parallel corpus for
Tigrigna-English translation. In addition, the poor performance of existing machine
translation systems for Tigrigna, due to data scarcity, along with the morphological richness
and syntactic complexity of the language, makes translation challenging.

1.3.Research Questions

1. How can transfer learning improve the translation quality of Tigrigna-English NMT
models?

2. What data augmentation techniques (such as synthetic data generation) can help
overcome parallel corpus limitations?

3. How effective are pre-trained multilingual models (such as mBART and mT5) for
Tigrigna NMT?

1.4.Research Significance

This research will:

 Enhance machine translation capabilities for low-resource languages like Tigrigna.

 Develop a high-quality parallel corpus for Tigrigna-English translation.


 Contribute to AI-powered language accessibility for Tigrigna speakers in education,
communication, and information retrieval.

2. Objective of the Research


2.1. General Objective

To develop a high-quality Neural Machine Translation (NMT) system for Tigrigna-English


using deep learning-based models such as Transformer, BERT, and mBART.

2.2.Specific Objectives

 To collect and construct a large-scale Tigrigna-English parallel corpus using web


crawling, manual annotation, and synthetic data generation techniques.

 To explore the effectiveness of transfer learning by fine-tuning multilingual models


(e.g., mBART, mT5, and XLM-RoBERTa) for Tigrigna-English translation.

 To evaluate different model architectures (Transformer, mBART, BERT-based models)


for improving BLEU, METEOR, and TER scores in Tigrigna NMT.

 To deploy and test the developed model in real-world applications such as Tigrigna AI
chatbots, multilingual search engines, and speech-to-text translation services.

3. Literature Review
3.1.Overview of Neural Machine Translation (NMT)

Evolution from Rule-Based Machine Translation (RBMT) to Statistical Machine Translation


(SMT) and finally to Neural Machine Translation (NMT). Strengths of NMT models in
learning contextual relationships between words.

3.2.Challenges of Tigrigna Machine Translation

 Rich Morphology: Tigrigna has complex inflectional and derivational structures.

 Limited Parallel Data: Unlike English or French, Tigrigna lacks a large, high-quality
bilingual dataset.

 Syntax and Grammar Complexity: Word order and subject-object relationships


differ significantly from English.

3.3. Related Work on Low-Resource Language NMT

 Studies on Amharic-English and Swahili-English NMT models using transfer


learning.

 The role of pre-trained multilingual models (mBART, mT5, and XLM-R) in


improving translation for low-resource languages.
 Data augmentation techniques (such as back-translation and monolingual data
augmentation) to address data scarcity in NLP.

3.4 Research Gap

Mainly Lack of research on Tigrigna-English NMT using deep learning-based approaches. As we


know No standard Tigrigna parallel corpus is publicly available for training state-of-the-art translation
models. In addition to that, Limited studies on using mBART and mT5 for Tigrigna NMT.

4. Methodology of the Research

4.1 Research Approach

This research will follow an experimental approach, combining data collection, model training, and
evaluation to optimize the Tigrigna-English NMT system.

4.2 Data Collection and Preprocessing

 Parallel Corpus Construction:

o Web Crawling: Extract Tigrigna-English text from government websites, news


portals, and religious texts.

o Manual Annotation: Collaborate with linguists and translators to create a gold-


standard bilingual dataset.

o Back-Translation: Generate synthetic parallel data by translating monolingual


Tigrigna text to English and vice versa.

 Data Preprocessing Techniques:

o Sentence segmentation, tokenization, and subword encoding (Byte Pair Encoding -


BPE).

o Data Cleaning: Removing noise, duplicated sentences, and translation errors.

4.3 Model Development

 Baseline Model: Train a standard Transformer-based NMT model for comparison.

 Advanced NMT Models:

o Fine-Tuning Pre-Trained Multilingual Models: Train mBART, mT5, and XLM-R on


the collected dataset.

o Hybrid Approach: Combine Transformer-based models with attention mechanisms


for better context understanding.

4.4 Model Training and Evaluation

 Training Process:
o Use GPU-based training on TensorFlow and PyTorch.

o Implement hyperparameter tuning to optimize model performance.

 Evaluation Metrics:

o BLEU (Bilingual Evaluation Understudy) Score

o METEOR (Metric for Evaluation of Translation with Explicit ORdering)

o TER (Translation Edit Rate)

4.5 Deployment and Testing

 Develop an API-based Tigrigna NMT service for real-world testing.

 Collect feedback from linguists and native Tigrigna speakers for manual evaluation.

You might also like