๐Ÿง  Training a Multilingual Translator

A practical and extensible project for fine-tuning multilingual translation models using LoRA. Covers English โ†” Chinese, English โ†” Nepali, and combines both for multi-task learning on a compact LLaMA-3.2-3B base.


๐Ÿš€ Project Highlights

  • โœ… Fine-tuned translation models on:
  • โœ… English โ†” Chinese, English โ†” Nepali fine-tuning performed via LoRA (Low-Rank Adaptation) on top of "Qwen/Qwen2.5-0.5B"
  • โœ… Combined multilingual model (Chinese โ†” English โ†” Nepali) fine-tuning performed via LoRA (Low-Rank Adaptation) on top of meta-llama/Llama-3.2-3B
  • โœ… Used Hugging Face Transformers and Trainer APIs for reproducible training on Jupyter Notebook
  • โœ… Evaluation metrics: BLEU and chrF++ (details below)

๐Ÿ“Š Evaluation Results (Multilingual Model)

Language Pair BLEU Score Progression chrF++ Score Progression
ZH โ†’ EN 2.86 โ†’ 18.63 โ†’ 19.05 14.10 โ†’ 45.92 โ†’ 45.81
EN โ†’ ZH 0.00 โ†’ - โ†’ 0.00 1.42 โ†’ 18.80 โ†’ 19.21
NE โ†’ EN 0.00 โ†’ 19.34 โ†’ 20.42 8.73 โ†’ 41.94 โ†’ 42.01
EN โ†’ NE 0.00 โ†’ - โ†’ 0.00 0.12 โ†’ 27.90 โ†’ 30.38

๐Ÿ“Ž These improvements were achieved through progressive fine-tuning with LoRA using merged datasets.


๐Ÿ“ Notebooks

You can open and run each notebook for training or inference:

  • En-Zh.ipynb: English โ†” Chinese LoRA fine-tuning
  • En-Ne.ipynb: English โ†” Nepali LoRA fine-tuning
  • Zh-En-Ne.ipynb: Multilingual fine-tuning with merged datasets

๐Ÿงช All training done using Jupyter Notebook & Hugging Face Trainer
๐Ÿง  Easily extendable to new language pairs by updating the lang_pair field in dataset format:
{"input": ..., "output": ..., "lang_pair": "en-zh"}


โœจ Future Plans

  • ๐Ÿ”„ Investigate the use of English as a bridging high-resource language to enhance zero-shot or few-shot performance on unseen language translation between low-resource pairs (e.g., Nepali โ†” Chinese via English)
  • ๐Ÿงช Experiment with pivot translation and triangular training setups to evaluate whether they improve generalization
  • ๐Ÿ”— Explore multilingual token alignment and shared embedding space techniques for better bridging

๐Ÿ™ Acknowledgements


๐Ÿ“„ License

This project is open-sourced under the MIT License.
Please credit the original dataset providers and model creators when reusing.