-
Sequence to Sequence Learning with Neural Networks
Paper -
Neural Machine Translation by Jointly Learning to Align and Translate
Paper -
Attention Is All You Need
Paper -
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Paper -
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
Paper Code
-
Neural Machine Translation of Rare Words with Subword Units
Paper Code -
Neural Machine Translation with Byte-Level Subwords
Paper -
Neural Machine Translation with Byte-Level Subwords
Paper -
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
Paper Code
-
Google's Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation
Paper -
Massively Multilingual Neural Machine Translation
Paper -
Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges
Paper -
Beyond English-Centric Multilingual Machine Translation (M2M-100)
Paper Code -
Multilingual Denoising Pre-training for Neural Machine Translation (MBART-25)
Paper Code -
Multilingual Translation from Denoising Pre-Training (MBART-50)
Paper Code -
DeltaLM: Encoder-Decoder Pre-training for Language Generation and Translation by Augmenting Pretrained Multilingual Encoders
Paper Code -
No Language Left Behind: Scaling Human-Centered Machine Translation (NLLB-200)
Paper Code -
MADLAD-400: A Multilingual And Document-Level Large Audited Dataset
Paper Model -
Towards the Next 1000 Languages in Multilingual Machine Translation: Exploring the Synergy Between Supervised and Self-Supervised Learning
Paper
-
MMTAfrica: Multilingual Machine Translation for African Languages
Paper Code -
AfroMT: Pretraining Strategies and Reproducible Benchmarks for Translation of 8 African Languages
Paper Code -
ANVITA-African: A Multilingual Neural Machine Translation System for African Languages
Paper -
AfroLM: A Self-Active Learning-based Multilingual Pretrained Language Model for 23 African Languages
Paper Code
-
AraBERT: Transformer-based Model for Arabic Language Understanding
Paper Code -
The Interplay of Variant, Size, and Task Type in Arabic Pre-trained Language Models (CAMeLBERT)
Paper Code
-
SG Translate Together - Uplifting Singapore’s translation standards with the community through technology
Paper -
IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language Generation (IndoBART, IndoGPT)
Paper Code -
WangchanBERTa: Pretraining transformer-based Thai Language Models
Paper Code -
WangchanBERTa: Pretraining transformer-based Thai Language Models
Paper Code
-
IndT5: A Text-to-Text Transformer for 10 Indigenous Languages
Paper Code -
Enhancing Translation for Indigenous Languages: Experiments with Multilingual Models
Paper
-
Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages (IndicTrans1)
Paper Code -
IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages
Paper Code -
IndicBART: A Pre-trained Model for Indic Natural Language Generation
Paper Code
-
ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information
Paper Code -
CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation
Paper Code
-
KreolMorisienMT: A Dataset for Mauritian Creole Machine Translation
Paper Code -
CreoleVal: Multilingual Multitask Benchmarks for Creoles
Paper Code
-
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (Data:C4, Model:T5)
Paper Code -
mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer (Data:mC4, Model:mT5)
Paper Code -
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Paper Data -
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
Paper
NOTE
We refer the reader to the papers on language-family specific models, as these include monolingual data creation, bitext mining and model training.
Additional papers other than those mentioned above are included in this subsection.
-
IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages
Paper Code -
Towards Leaving No Indic Language Behind: Building Monolingual Corpora, Benchmark and Models for Indic Languages
Paper Code -
Varta: A Large-Scale Headline-Generation Dataset for Indic Languages
Paper Code -
WebCrawl African : A Multilingual Parallel Corpora for African Languages
Paper Code
-
CCAligned: A Massive Collection of Cross-Lingual Web-Document Pairs
Paper Data -
Billion-scale similarity search with GPUs (FAISS)
Paper Code -
CCMatrix: Mining Billions of High-Quality Parallel Sentences on the Web
Paper Code -
xSIM++: An Improved Proxy to Bitext Mining Performance for Low-Resource Languages
Paper Data
-
LEALLA: Learning Lightweight Language-agnostic Sentence Embeddings with Knowledge Distillation
Paper Code -
Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond
(LASER1)
Paper Code -
Bitext Mining Using Distilled Sentence Representations for Low-Resource Languages
(LASER3)
Paper Code -
Multilingual Representation Distillation with Contrastive Learning (LASER3-CO)
Paper -
Learning Multilingual Sentence Representations with Cross-lingual Consistency Regularization (MuSR)
Paper Code -
SONAR: Sentence-Level Multimodal and Language-Agnostic Representations
Paper Code
-
Data and Parameter Scaling Laws for Neural Machine Translation
Paper Code -
Data Scaling Laws in NMT: The Effect of Noise and Architecture
Paper -
“A Little is Enough”: Few-Shot Quality Estimation based Corpus Filtering improves Machine Translation
Paper
-
The TDIL Program and the Indian Language Corpora Intitiative (ILCI)
Paper -
Small Data, Big Impact: Leveraging Minimal Data for Effective Machine Translation
Paper Data -
MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages
Paper Data
-
The FLORES Evaluation Datasets for Low-Resource Machine Translation: Nepali–English and Sinhala–English
Paper -
The Flores-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation
Paper Data -
NTREX-128 – News Test References for MT Evaluation of 128 Languages
Paper Data
NOTE
We refer the reader to the papers on massively multilingual models, as these include some aspects of modeling.
Additional papers other than those mentioned above are included in this subsection.
-
How Robust is Neural Machine Translation to Language Imbalance in Multilingual Tokenizer Training?
Paper -
Out-of-the-box Universal Romanization Tool uroman
Paper -
Pre-training via Leveraging Assisting Languages for Neural Machine Translation
Paper Code -
BPE-Dropout: Simple and Effective Subword Regularization
Paper -
Efficient Neural Machine Translation for Low-Resource Languages via Exploiting Related Languages
Paper Code -
Exploiting Language Relatedness for Low Web-Resource Language Model Adaptation: An Indic Languages Study
Paper Code -
Language Relatedness and Lexical Closeness can help Improve Multilingual NMT: IITBombay@MultiIndicNMT WAT2021
Paper -
Auxiliary Subword Segmentations as Related Languages for Low Resource Multilingual Translation
Paper -
Overlap-based Vocabulary Generation Improves Cross-lingual Transfer Among Related Languages
Paper Code -
Transfer Learning in Multilingual Neural Machine Translation with Dynamic Vocabulary
Paper
-
Addressing word-order Divergence in Multilingual Neural Machine Translation for extremely Low Resource Languages Paper Code
-
Language Related Issues for Machine Translation between Closely Related South Slavic Languages
Paper -
A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space
Paper Code -
Towards a Common Understanding of Contributing Factors for Cross-Lingual Transfer in Multilingual Language Models: A Review
Paper -
Decomposed Prompting for Machine Translation Between Related Languages using Large Language Models
Paper Code -
JASS: Japanese-specific Sequence to Sequence Pre-training for Neural Machine Translation
Paper Code
-
Multi-Way, Multilingual Neural Machine Translation with a Shared Attention Mechanism
Paper -
Multi-Task Learning for Multiple Language Translation
Paper -
Multi-Task Learning for Multiple Language Translation
Paper -
Contact Relatedness can help improve multilingual NMT: Microsoft STCI-MT @ WMT20
Paper -
Investigating Multilingual NMT Representations at Scale
Paper -
Enabling Multi-Source Neural Machine Translation By Concatenating Source Sentences In Multiple Languages
Paper -
Multilingual Neural Machine Translation with Language Clustering
Paper -
Bridging Linguistic Typology and Multilingual Machine Translation with Multi-View Language Representations
Paper Code -
Delexicalized Cross-lingual Dependency Parsing for Xibe
Paper -
An Empirical Study of Language Relatedness for Transfer Learning in Neural Machine Translation
Paper -
Efficient Unsupervised NMT for Related Languages with Cross-Lingual Language Models and Fidelity Objectives
Paper -
Adapting High-resource NMT Models to Translate Low-resource Related Languages without Parallel Data
Paper -
Linguistically-driven Multi-task Pre-training for Low-resource Neural Machine Translation
Paper Code
-
Instance Weighting for Neural Machine Translation Domain Adaptation
Paper Code -
Exploiting Multilingualism through Multistage Fine-Tuning for Low-Resource Neural Machine Translation
Paper -
Data Selection Curriculum for Neural Machine Translation
Paper
-
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Paper -
ST-MoE: Designing Stable and Transferable Sparse Expert Models
Paper Code -
Towards Understanding Mixture of Experts in Deep Learning
Paper Code -
Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference
Paper -
Universal Neural Machine Translation for Extremely Low Resource Languages
Paper Code -
Transfer Learning across Low-Resource, Related Languages for Neural Machine Translation
Paper Code
-
Examining Scaling and Transfer of Language Model Architectures for Machine Translation (LM4MT)
Paper
-
Rapid Adaptation of Neural Machine Translation to New Languages
Paper Code -
Improving Zero-Shot Cross-lingual Transfer Between Closely Related Languages by Injecting Character-Level Noise
Paper -
Utilizing Lexical Similarity to Enable Zero-Shot Machine Translation for Extremely Low-resource Languages
Paper -
Improving Zero-Shot Translation by Disentangling Positional Information
Paper Code -
Simple, Scalable Adaptation for Neural Machine Translation Paper
-
T-Modules: Translation Modules for Zero-Shot Cross-Modal Machine Translation
Paper -
Parameter Sharing Methods for Multilingual Self-Attentional Translation Models
Paper Code -
From Bilingual to Multilingual Neural Machine Translation by Incremental Training
Paper -
Language-Family Adapters for Low-Resource Multilingual Neural Machine Translation
Paper -
Improving Neural Machine Translation of Indigenous Languages with Multilingual Transfer Learning
Paper
-
Learning both Weights and Connections for Efficient Neural Networks
Paper -
Memory-efficient NLLB-200: Language-specific Expert Pruning of a Massively Multilingual Machine Translation Model
Paper Code -
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
Paper Code -
The case for 4-bit precision: k-bit Inference Scaling Laws Paper
-
An Empirical Study of Leveraging Knowledge Distillation for Compressing Multilingual Neural Machine Translation Models
Paper Code -
Multilingual Neural Machine Translation with Language Clustering
Paper
-
Bleu: a Method for Automatic Evaluation of Machine Translation
Paper -
chrF: character n-gram F-score for automatic MT evaluation
Paper -
chrF++: words helping character n-grams
Paper -
BLEURT: Learning Robust Metrics for Text Generation
Paper Code -
Learning Compact Metrics for MT
Paper -
IndicMT Eval: A Dataset to Meta-Evaluate Machine Translation Metrics for Indian Languages
Paper Code -
Identifying Weaknesses in Machine Translation Metrics Through Minimum Bayes Risk Decoding: A Case Study for COMET
Paper Code -
Extrinsic Evaluation of Machine Translation Metrics
Paper -
Large Language Models Are State-of-the-Art Evaluators of Translation Quality
Paper Code -
The Devil is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation Paper
-
Continuous Measurement Scales in Human Evaluation of Machine Translation
Paper -
Is Machine Translation Getting Better over Time?
Paper -
Multidimensional quality metrics: a flexible system for assessing translation quality
Paper -
Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation
Paper Code -
SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation
Paper -
Consistent Human Evaluation of Machine Translation across Language Pairs
Paper
@InProceedings{gala-chitale-dabre:2023:ijcnlp,
author = {Gala, Jay and Chitale, Pranjal A. and Dabre, Raj},
title = {Developing State-Of-The-Art Massively Multilingual Machine Translation Systems for Related Languages},
booktitle = {Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics},
month = {November},
year = {2023},
address = {Nusa Dua, Bali},
publisher = {Association for Computational Linguistics},
pages = {35--42},
url = {https://aclanthology.org/2023.ijcnlp-tutorials.6}
}