aacl23-mnmt-tutorial

📘 slides | ▶️ Recording

Reading List

Fundamental concepts

Architecture

Sequence to Sequence Learning with Neural Networks
Paper
Neural Machine Translation by Jointly Learning to Align and Translate
Paper
Attention Is All You Need
Paper
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Paper
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
Paper Code

Vocabulary

Neural Machine Translation of Rare Words with Subword Units
Paper Code
Neural Machine Translation with Byte-Level Subwords
Paper
Neural Machine Translation with Byte-Level Subwords
Paper
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
Paper Code

Prominent Massively Multilingual NMT systems

Google's Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation
Paper
Massively Multilingual Neural Machine Translation
Paper
Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges
Paper
Beyond English-Centric Multilingual Machine Translation (M2M-100)
Paper Code
Multilingual Denoising Pre-training for Neural Machine Translation (MBART-25)
Paper Code
Multilingual Translation from Denoising Pre-Training (MBART-50)
Paper Code
DeltaLM: Encoder-Decoder Pre-training for Language Generation and Translation by Augmenting Pretrained Multilingual Encoders
Paper Code
No Language Left Behind: Scaling Human-Centered Machine Translation (NLLB-200)
Paper Code
MADLAD-400: A Multilingual And Document-Level Large Audited Dataset
Paper Model
Towards the Next 1000 Languages in Multilingual Machine Translation: Exploring the Synergy Between Supervised and Self-Supervised Learning
Paper

Models for related languages.

African

MMTAfrica: Multilingual Machine Translation for African Languages
Paper Code
AfroMT: Pretraining Strategies and Reproducible Benchmarks for Translation of 8 African Languages
Paper Code
ANVITA-African: A Multilingual Neural Machine Translation System for African Languages
Paper
AfroLM: A Self-Active Learning-based Multilingual Pretrained Language Model for 23 African Languages
Paper Code

Middle-East / North-African

AraBERT: Transformer-based Model for Arabic Language Understanding
Paper Code
The Interplay of Variant, Size, and Task Type in Arabic Pre-trained Language Models (CAMeLBERT)
Paper Code

South-East Asian

SG Translate Together - Uplifting Singapore’s translation standards with the community through technology
Paper
IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language Generation (IndoBART, IndoGPT)
Paper Code
WangchanBERTa: Pretraining transformer-based Thai Language Models
Paper Code
WangchanBERTa: Pretraining transformer-based Thai Language Models
Paper Code

European languages

OPUS-MT – Building open translation services for the World
Paper Code

Indigenous languages of America

IndT5: A Text-to-Text Transformer for 10 Indigenous Languages
Paper Code
Enhancing Translation for Indigenous Languages: Experiments with Multilingual Models
Paper

Indian subcontinent

Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages (IndicTrans1)
Paper Code
IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages
Paper Code
IndicBART: A Pre-trained Model for Indic Natural Language Generation
Paper Code

China

ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information
Paper Code
CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation
Paper Code

Creoles

KreolMorisienMT: A Dataset for Mauritian Creole Machine Translation
Paper Code
CreoleVal: Multilingual Multitask Benchmarks for Creoles
Paper Code

Dataset Curation

Monolingual Data Curation - Large scale

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (Data:C4, Model:T5)
Paper Code
mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer (Data:mC4, Model:mT5)
Paper Code
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Paper Data
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
Paper

Monolingual Data Curation - Language-family specific

NOTE

We refer the reader to the papers on language-family specific models, as these include monolingual data creation, bitext mining and model training.

Additional papers other than those mentioned above are included in this subsection.

IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages
Paper Code
Towards Leaving No Indic Language Behind: Building Monolingual Corpora, Benchmark and Models for Indic Languages
Paper Code
Varta: A Large-Scale Headline-Generation Dataset for Indic Languages
Paper Code
WebCrawl African : A Multilingual Parallel Corpora for African Languages
Paper Code

Parallel Corpora Creation

CCAligned: A Massive Collection of Cross-Lingual Web-Document Pairs
Paper Data
Billion-scale similarity search with GPUs (FAISS)
Paper Code
CCMatrix: Mining Billions of High-Quality Parallel Sentences on the Web
Paper Code
xSIM++: An Improved Proxy to Bitext Mining Performance for Low-Resource Languages
Paper Data

Sentence Embedding Models

Language-agnostic BERT Sentence Embedding
Paper Code
LEALLA: Learning Lightweight Language-agnostic Sentence Embeddings with Knowledge Distillation
Paper Code
Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond
(LASER1)
Paper Code
Bitext Mining Using Distilled Sentence Representations for Low-Resource Languages
(LASER3)
Paper Code
Multilingual Representation Distillation with Contrastive Learning (LASER3-CO)
Paper
Learning Multilingual Sentence Representations with Cross-lingual Consistency Regularization (MuSR)
Paper Code
SONAR: Sentence-Level Multimodal and Language-Agnostic Representations
Paper Code

Data Quality v/s Scale

Data and Parameter Scaling Laws for Neural Machine Translation
Paper Code
Data Scaling Laws in NMT: The Effect of Noise and Architecture
Paper
“A Little is Enough”: Few-Shot Quality Estimation based Corpus Filtering improves Machine Translation
Paper

Human-annotated Seed Corpora

The TDIL Program and the Indian Language Corpora Intitiative (ILCI)
Paper
Small Data, Big Impact: Leveraging Minimal Data for Effective Machine Translation
Paper Data
MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages
Paper Data

Benchmarks

The FLORES Evaluation Datasets for Low-Resource Machine Translation: Nepali–English and Sinhala–English
Paper
The Flores-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation
Paper Data
NTREX-128 – News Test References for MT Evaluation of 128 Languages
Paper Data

Modeling

NOTE

We refer the reader to the papers on massively multilingual models, as these include some aspects of modeling.

Additional papers other than those mentioned above are included in this subsection.

Vocabulary

How Robust is Neural Machine Translation to Language Imbalance in Multilingual Tokenizer Training?
Paper
Out-of-the-box Universal Romanization Tool uroman
Paper
The IndicNLP Library
Paper Code
Pre-training via Leveraging Assisting Languages for Neural Machine Translation
Paper Code
BPE-Dropout: Simple and Effective Subword Regularization
Paper
Efficient Neural Machine Translation for Low-Resource Languages via Exploiting Related Languages
Paper Code
Exploiting Language Relatedness for Low Web-Resource Language Model Adaptation: An Indic Languages Study
Paper Code
Language Relatedness and Lexical Closeness can help Improve Multilingual NMT: IITBombay@MultiIndicNMT WAT2021
Paper
Auxiliary Subword Segmentations as Related Languages for Low Resource Multilingual Translation
Paper
Overlap-based Vocabulary Generation Improves Cross-lingual Transfer Among Related Languages
Paper Code
Transfer Learning in Multilingual Neural Machine Translation with Dynamic Vocabulary
Paper

Leveraging Ordering Information

Addressing word-order Divergence in Multilingual Neural Machine Translation for extremely Low Resource Languages Paper Code
Language Related Issues for Machine Translation between Closely Related South Slavic Languages
Paper
A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space
Paper Code
Towards a Common Understanding of Contributing Factors for Cross-Lingual Transfer in Multilingual Language Models: A Review
Paper
Decomposed Prompting for Machine Translation Between Related Languages using Large Language Models
Paper Code
JASS: Japanese-specific Sequence to Sequence Pre-training for Neural Machine Translation
Paper Code

Training

Joint Training / Language-Relatedness

Multi-Way, Multilingual Neural Machine Translation with a Shared Attention Mechanism
Paper
Multi-Task Learning for Multiple Language Translation
Paper
Multi-Task Learning for Multiple Language Translation
Paper
Contact Relatedness can help improve multilingual NMT: Microsoft STCI-MT @ WMT20
Paper
Investigating Multilingual NMT Representations at Scale
Paper
Enabling Multi-Source Neural Machine Translation By Concatenating Source Sentences In Multiple Languages
Paper
Multilingual Neural Machine Translation with Language Clustering
Paper
Bridging Linguistic Typology and Multilingual Machine Translation with Multi-View Language Representations
Paper Code
Delexicalized Cross-lingual Dependency Parsing for Xibe
Paper
An Empirical Study of Language Relatedness for Transfer Learning in Neural Machine Translation
Paper
Efficient Unsupervised NMT for Related Languages with Cross-Lingual Language Models and Fidelity Objectives
Paper
Adapting High-resource NMT Models to Translate Low-resource Related Languages without Parallel Data
Paper
Linguistically-driven Multi-task Pre-training for Low-resource Neural Machine Translation
Paper Code

Data Curriculum / Multi-stage training

Instance Weighting for Neural Machine Translation Domain Adaptation
Paper Code
Exploiting Multilingualism through Multistage Fine-Tuning for Low-Resource Neural Machine Translation
Paper
Data Selection Curriculum for Neural Machine Translation
Paper

Modeling

Mixture of Experts

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Paper
ST-MoE: Designing Stable and Transferable Sparse Expert Models
Paper Code
Towards Understanding Mixture of Experts in Deep Learning
Paper Code
Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference
Paper
Universal Neural Machine Translation for Extremely Low Resource Languages
Paper Code
Transfer Learning across Low-Resource, Related Languages for Neural Machine Translation
Paper Code

Decoder-only MT models

Examining Scaling and Transfer of Language Model Architectures for Machine Translation (LM4MT)
Paper
ALMA: Advanced Language Model-based translator
Paper Code

Zero-shot transfer-learning / Adaptation to new languages.

Rapid Adaptation of Neural Machine Translation to New Languages
Paper Code
Improving Zero-Shot Cross-lingual Transfer Between Closely Related Languages by Injecting Character-Level Noise
Paper
Utilizing Lexical Similarity to Enable Zero-Shot Machine Translation for Extremely Low-resource Languages
Paper
Improving Zero-Shot Translation by Disentangling Positional Information
Paper Code
Simple, Scalable Adaptation for Neural Machine Translation Paper
T-Modules: Translation Modules for Zero-Shot Cross-Modal Machine Translation
Paper
Parameter Sharing Methods for Multilingual Self-Attentional Translation Models
Paper Code
From Bilingual to Multilingual Neural Machine Translation by Incremental Training
Paper
Language-Family Adapters for Low-Resource Multilingual Neural Machine Translation
Paper
Improving Neural Machine Translation of Indigenous Languages with Multilingual Transfer Learning
Paper

Model Compression

Sequence-Level Knowledge Distillation
Paper Code
Learning both Weights and Connections for Efficient Neural Networks
Paper
Memory-efficient NLLB-200: Language-specific Expert Pruning of a Massively Multilingual Machine Translation Model
Paper Code
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
Paper Code
The case for 4-bit precision: k-bit Inference Scaling Laws Paper
An Empirical Study of Leveraging Knowledge Distillation for Compressing Multilingual Neural Machine Translation Models
Paper Code
Multilingual Neural Machine Translation with Language Clustering
Paper

Evaluation

Automatic Evaluation

Bleu: a Method for Automatic Evaluation of Machine Translation
Paper
chrF: character n-gram F-score for automatic MT evaluation
Paper
chrF++: words helping character n-grams
Paper
A Call for Clarity in Reporting BLEU Scores
Paper Code
BLEURT: Learning Robust Metrics for Text Generation
Paper Code
Learning Compact Metrics for MT
Paper
IndicMT Eval: A Dataset to Meta-Evaluate Machine Translation Metrics for Indian Languages
Paper Code
COMET: A Neural Framework for MT Evaluation
Paper Code
Identifying Weaknesses in Machine Translation Metrics Through Minimum Bayes Risk Decoding: A Case Study for COMET
Paper Code
Extrinsic Evaluation of Machine Translation Metrics
Paper
Large Language Models Are State-of-the-Art Evaluators of Translation Quality
Paper Code
The Devil is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation Paper

Human Evaluation

Continuous Measurement Scales in Human Evaluation of Machine Translation
Paper
Is Machine Translation Getting Better over Time?
Paper
Multidimensional quality metrics: a flexible system for assessing translation quality
Paper
Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation
Paper Code
SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation
Paper
Consistent Human Evaluation of Machine Translation across Language Pairs
Paper

Toolkits

Citation

@InProceedings{gala-chitale-dabre:2023:ijcnlp,
  author    = {Gala, Jay  and  Chitale, Pranjal A.  and  Dabre, Raj},
  title     = {Developing State-Of-The-Art Massively Multilingual Machine Translation Systems for Related Languages},
  booktitle      = {Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics},
  month          = {November},
  year           = {2023},
  address        = {Nusa Dua, Bali},
  publisher      = {Association for Computational Linguistics},
  pages     = {35--42},
  url       = {https://aclanthology.org/2023.ijcnlp-tutorials.6}
}

AI4Bharat/aacl23-mnmt-tutorial

aacl23-mnmt-tutorial

Reading List

Fundamental concepts

Architecture

Vocabulary

Prominent Massively Multilingual NMT systems

Models for related languages.

African

Middle-East / North-African

South-East Asian

European languages

Indigenous languages of America

Indian subcontinent

China

Creoles

Dataset Curation

Monolingual Data Curation - Large scale

Monolingual Data Curation - Language-family specific

Parallel Corpora Creation

Sentence Embedding Models

Data Quality v/s Scale

Human-annotated Seed Corpora

Benchmarks

Modeling

Vocabulary

Leveraging Ordering Information

Training

Joint Training / Language-Relatedness

Data Curriculum / Multi-stage training

Modeling

Mixture of Experts

Decoder-only MT models

Zero-shot transfer-learning / Adaptation to new languages.

Model Compression

Evaluation

Automatic Evaluation

Human Evaluation

Toolkits

Citation