/aacl23-mnmt-tutorial

Additional resources from our AACL tutorial

aacl23-mnmt-tutorial

📘 slides | ▶️ Recording

Reading List

Fundamental concepts

Architecture

  • Sequence to Sequence Learning with Neural Networks
    Paper

  • Neural Machine Translation by Jointly Learning to Align and Translate
    Paper

  • Attention Is All You Need
    Paper

  • BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
    Paper

  • BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
    Paper Code

Vocabulary

  • Neural Machine Translation of Rare Words with Subword Units
    Paper Code

  • Neural Machine Translation with Byte-Level Subwords
    Paper

  • Neural Machine Translation with Byte-Level Subwords
    Paper

  • SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
    Paper Code

Prominent Massively Multilingual NMT systems

  • Google's Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation
    Paper

  • Massively Multilingual Neural Machine Translation
    Paper

  • Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges
    Paper

  • Beyond English-Centric Multilingual Machine Translation (M2M-100)
    Paper Code

  • Multilingual Denoising Pre-training for Neural Machine Translation (MBART-25)
    Paper Code

  • Multilingual Translation from Denoising Pre-Training (MBART-50)
    Paper Code

  • DeltaLM: Encoder-Decoder Pre-training for Language Generation and Translation by Augmenting Pretrained Multilingual Encoders
    Paper Code

  • No Language Left Behind: Scaling Human-Centered Machine Translation (NLLB-200)
    Paper Code

  • MADLAD-400: A Multilingual And Document-Level Large Audited Dataset
    Paper Model

  • Towards the Next 1000 Languages in Multilingual Machine Translation: Exploring the Synergy Between Supervised and Self-Supervised Learning
    Paper

Models for related languages.

African

  • MMTAfrica: Multilingual Machine Translation for African Languages
    Paper Code

  • AfroMT: Pretraining Strategies and Reproducible Benchmarks for Translation of 8 African Languages
    Paper Code

  • ANVITA-African: A Multilingual Neural Machine Translation System for African Languages
    Paper

  • AfroLM: A Self-Active Learning-based Multilingual Pretrained Language Model for 23 African Languages
    Paper Code

Middle-East / North-African

  • AraBERT: Transformer-based Model for Arabic Language Understanding
    Paper Code

  • The Interplay of Variant, Size, and Task Type in Arabic Pre-trained Language Models (CAMeLBERT)
    Paper Code

South-East Asian

  • SG Translate Together - Uplifting Singapore’s translation standards with the community through technology
    Paper

  • IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language Generation (IndoBART, IndoGPT)
    Paper Code

  • WangchanBERTa: Pretraining transformer-based Thai Language Models
    Paper Code

  • WangchanBERTa: Pretraining transformer-based Thai Language Models
    Paper Code

European languages

  • OPUS-MT – Building open translation services for the World
    Paper Code

Indigenous languages of America

  • IndT5: A Text-to-Text Transformer for 10 Indigenous Languages
    Paper Code

  • Enhancing Translation for Indigenous Languages: Experiments with Multilingual Models
    Paper

Indian subcontinent

  • Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages (IndicTrans1)
    Paper Code

  • IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages
    Paper Code

  • IndicBART: A Pre-trained Model for Indic Natural Language Generation
    Paper Code

China

  • ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information
    Paper Code

  • CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation
    Paper Code

Creoles

  • KreolMorisienMT: A Dataset for Mauritian Creole Machine Translation
    Paper Code

  • CreoleVal: Multilingual Multitask Benchmarks for Creoles
    Paper Code

Dataset Curation

Monolingual Data Curation - Large scale

  • Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (Data:C4, Model:T5)
    Paper Code

  • mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer (Data:mC4, Model:mT5)
    Paper Code

  • The Pile: An 800GB Dataset of Diverse Text for Language Modeling
    Paper Data

  • The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
    Paper

Monolingual Data Curation - Language-family specific


NOTE

We refer the reader to the papers on language-family specific models, as these include monolingual data creation, bitext mining and model training.

Additional papers other than those mentioned above are included in this subsection.


  • IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages
    Paper Code

  • Towards Leaving No Indic Language Behind: Building Monolingual Corpora, Benchmark and Models for Indic Languages
    Paper Code

  • Varta: A Large-Scale Headline-Generation Dataset for Indic Languages
    Paper Code

  • WebCrawl African : A Multilingual Parallel Corpora for African Languages
    Paper Code

Parallel Corpora Creation

  • CCAligned: A Massive Collection of Cross-Lingual Web-Document Pairs
    Paper Data

  • Billion-scale similarity search with GPUs (FAISS)
    Paper Code

  • CCMatrix: Mining Billions of High-Quality Parallel Sentences on the Web
    Paper Code

  • xSIM++: An Improved Proxy to Bitext Mining Performance for Low-Resource Languages
    Paper Data

Sentence Embedding Models

  • Language-agnostic BERT Sentence Embedding
    Paper Code

  • LEALLA: Learning Lightweight Language-agnostic Sentence Embeddings with Knowledge Distillation
    Paper Code

  • Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond
    (LASER1)
    Paper Code

  • Bitext Mining Using Distilled Sentence Representations for Low-Resource Languages
    (LASER3)
    Paper Code

  • Multilingual Representation Distillation with Contrastive Learning (LASER3-CO)
    Paper

  • Learning Multilingual Sentence Representations with Cross-lingual Consistency Regularization (MuSR)
    Paper Code

  • SONAR: Sentence-Level Multimodal and Language-Agnostic Representations
    Paper Code

Data Quality v/s Scale

  • Data and Parameter Scaling Laws for Neural Machine Translation
    Paper Code

  • Data Scaling Laws in NMT: The Effect of Noise and Architecture
    Paper

  • “A Little is Enough”: Few-Shot Quality Estimation based Corpus Filtering improves Machine Translation
    Paper

Human-annotated Seed Corpora

  • The TDIL Program and the Indian Language Corpora Intitiative (ILCI)
    Paper

  • Small Data, Big Impact: Leveraging Minimal Data for Effective Machine Translation
    Paper Data

  • MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages
    Paper Data

Benchmarks

  • The FLORES Evaluation Datasets for Low-Resource Machine Translation: Nepali–English and Sinhala–English
    Paper

  • The Flores-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation
    Paper Data

  • NTREX-128 – News Test References for MT Evaluation of 128 Languages
    Paper Data

Modeling


NOTE

We refer the reader to the papers on massively multilingual models, as these include some aspects of modeling.

Additional papers other than those mentioned above are included in this subsection.


Vocabulary

  • How Robust is Neural Machine Translation to Language Imbalance in Multilingual Tokenizer Training?
    Paper

  • Out-of-the-box Universal Romanization Tool uroman
    Paper

  • The IndicNLP Library
    Paper Code

  • Pre-training via Leveraging Assisting Languages for Neural Machine Translation
    Paper Code

  • BPE-Dropout: Simple and Effective Subword Regularization
    Paper

  • Efficient Neural Machine Translation for Low-Resource Languages via Exploiting Related Languages
    Paper Code

  • Exploiting Language Relatedness for Low Web-Resource Language Model Adaptation: An Indic Languages Study
    Paper Code

  • Language Relatedness and Lexical Closeness can help Improve Multilingual NMT: IITBombay@MultiIndicNMT WAT2021
    Paper

  • Auxiliary Subword Segmentations as Related Languages for Low Resource Multilingual Translation
    Paper

  • Overlap-based Vocabulary Generation Improves Cross-lingual Transfer Among Related Languages
    Paper Code

  • Transfer Learning in Multilingual Neural Machine Translation with Dynamic Vocabulary
    Paper

Leveraging Ordering Information

  • Addressing word-order Divergence in Multilingual Neural Machine Translation for extremely Low Resource Languages Paper Code

  • Language Related Issues for Machine Translation between Closely Related South Slavic Languages
    Paper

  • A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space
    Paper Code

  • Towards a Common Understanding of Contributing Factors for Cross-Lingual Transfer in Multilingual Language Models: A Review
    Paper

  • Decomposed Prompting for Machine Translation Between Related Languages using Large Language Models
    Paper Code

  • JASS: Japanese-specific Sequence to Sequence Pre-training for Neural Machine Translation
    Paper Code

Training

Joint Training / Language-Relatedness

  • Multi-Way, Multilingual Neural Machine Translation with a Shared Attention Mechanism
    Paper

  • Multi-Task Learning for Multiple Language Translation
    Paper

  • Multi-Task Learning for Multiple Language Translation
    Paper

  • Contact Relatedness can help improve multilingual NMT: Microsoft STCI-MT @ WMT20
    Paper

  • Investigating Multilingual NMT Representations at Scale
    Paper

  • Enabling Multi-Source Neural Machine Translation By Concatenating Source Sentences In Multiple Languages
    Paper

  • Multilingual Neural Machine Translation with Language Clustering
    Paper

  • Bridging Linguistic Typology and Multilingual Machine Translation with Multi-View Language Representations
    Paper Code

  • Delexicalized Cross-lingual Dependency Parsing for Xibe
    Paper

  • An Empirical Study of Language Relatedness for Transfer Learning in Neural Machine Translation
    Paper

  • Efficient Unsupervised NMT for Related Languages with Cross-Lingual Language Models and Fidelity Objectives
    Paper

  • Adapting High-resource NMT Models to Translate Low-resource Related Languages without Parallel Data
    Paper

  • Linguistically-driven Multi-task Pre-training for Low-resource Neural Machine Translation
    Paper Code

Data Curriculum / Multi-stage training

  • Instance Weighting for Neural Machine Translation Domain Adaptation
    Paper Code

  • Exploiting Multilingualism through Multistage Fine-Tuning for Low-Resource Neural Machine Translation
    Paper

  • Data Selection Curriculum for Neural Machine Translation
    Paper

Modeling

Mixture of Experts

  • GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
    Paper

  • ST-MoE: Designing Stable and Transferable Sparse Expert Models
    Paper Code

  • Towards Understanding Mixture of Experts in Deep Learning
    Paper Code

  • Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference
    Paper

  • Universal Neural Machine Translation for Extremely Low Resource Languages
    Paper Code

  • Transfer Learning across Low-Resource, Related Languages for Neural Machine Translation
    Paper Code

Decoder-only MT models

  • Examining Scaling and Transfer of Language Model Architectures for Machine Translation (LM4MT)
    Paper

  • ALMA: Advanced Language Model-based translator
    Paper Code

Zero-shot transfer-learning / Adaptation to new languages.

  • Rapid Adaptation of Neural Machine Translation to New Languages
    Paper Code

  • Improving Zero-Shot Cross-lingual Transfer Between Closely Related Languages by Injecting Character-Level Noise
    Paper

  • Utilizing Lexical Similarity to Enable Zero-Shot Machine Translation for Extremely Low-resource Languages
    Paper

  • Improving Zero-Shot Translation by Disentangling Positional Information
    Paper Code

  • Simple, Scalable Adaptation for Neural Machine Translation Paper

  • T-Modules: Translation Modules for Zero-Shot Cross-Modal Machine Translation
    Paper

  • Parameter Sharing Methods for Multilingual Self-Attentional Translation Models
    Paper Code

  • From Bilingual to Multilingual Neural Machine Translation by Incremental Training
    Paper

  • Language-Family Adapters for Low-Resource Multilingual Neural Machine Translation
    Paper

  • Improving Neural Machine Translation of Indigenous Languages with Multilingual Transfer Learning
    Paper

Model Compression

  • Sequence-Level Knowledge Distillation
    Paper Code

  • Learning both Weights and Connections for Efficient Neural Networks
    Paper

  • Memory-efficient NLLB-200: Language-specific Expert Pruning of a Massively Multilingual Machine Translation Model
    Paper Code

  • LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
    Paper Code

  • The case for 4-bit precision: k-bit Inference Scaling Laws Paper

  • An Empirical Study of Leveraging Knowledge Distillation for Compressing Multilingual Neural Machine Translation Models
    Paper Code

  • Multilingual Neural Machine Translation with Language Clustering
    Paper

Evaluation

Automatic Evaluation

  • Bleu: a Method for Automatic Evaluation of Machine Translation
    Paper

  • chrF: character n-gram F-score for automatic MT evaluation
    Paper

  • chrF++: words helping character n-grams
    Paper

  • A Call for Clarity in Reporting BLEU Scores
    Paper Code

  • BLEURT: Learning Robust Metrics for Text Generation
    Paper Code

  • Learning Compact Metrics for MT
    Paper

  • IndicMT Eval: A Dataset to Meta-Evaluate Machine Translation Metrics for Indian Languages
    Paper Code

  • COMET: A Neural Framework for MT Evaluation
    Paper Code

  • Identifying Weaknesses in Machine Translation Metrics Through Minimum Bayes Risk Decoding: A Case Study for COMET
    Paper Code

  • Extrinsic Evaluation of Machine Translation Metrics
    Paper

  • Large Language Models Are State-of-the-Art Evaluators of Translation Quality
    Paper Code

  • The Devil is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation Paper

Human Evaluation
  • Continuous Measurement Scales in Human Evaluation of Machine Translation
    Paper

  • Is Machine Translation Getting Better over Time?
    Paper

  • Multidimensional quality metrics: a flexible system for assessing translation quality
    Paper

  • Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation
    Paper Code

  • SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation
    Paper

  • Consistent Human Evaluation of Machine Translation across Language Pairs
    Paper

Toolkits

Citation

@InProceedings{gala-chitale-dabre:2023:ijcnlp,
  author    = {Gala, Jay  and  Chitale, Pranjal A.  and  Dabre, Raj},
  title     = {Developing State-Of-The-Art Massively Multilingual Machine Translation Systems for Related Languages},
  booktitle      = {Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics},
  month          = {November},
  year           = {2023},
  address        = {Nusa Dua, Bali},
  publisher      = {Association for Computational Linguistics},
  pages     = {35--42},
  url       = {https://aclanthology.org/2023.ijcnlp-tutorials.6}
}