/ViPubmed

[EACL 2023] ViPubmed: Enriching Biomedical Knowledge for Low-resource Language Through Large-Scale Translation

Primary LanguagePythonMIT LicenseMIT

ViPubmed: Enriching Biomedical Knowledge for Low-resource Language Through Large-Scale Translation

PRs Welcome arXiv

Overview

Biomedical data and benchmarks are highly valuable yet very limited in low-resource languages other than English, such as Vietnamese. In this paper, we use a state-of-the-art translation model in English-Vietnamese to translate and produce both pretrained and supervised data in the biomedical domains. Further, we release ViMedNLI - a new NLP task in Vietnamese translated from MedNLI using the recently public En-vi translation model and carefully refined by human experts.

📝 Paper

📝 Blog Post

Methods

We large scale translate 20M Pubmed Abstract from English to Vietnamese and pretrained a biomedical Encoder-Decoder model on this translated dataset.

image

1. Pretrained Models (ViPubmedT5)

Vocabulary: ViT5_vocab

Model Gin File Location Checkpoint Location Domain Pretraining Corpus
ViPubmedT5 Base ViT5_base.gin gs://vietai_public/vipubmedt5_base/checkpoint_1500000 Biomedical Translated ViPubmed

2. Finetunning

Finetunning example with T5X and Flaxformer: finetunning_vipubmedt5_example.ipynb

3. Released Datasets

  • ViMedNLI: A Natural Language Inference Dataset For The Vietnamese Clinical Domain
  • ViPubmed: 20M Vietnamese Biomedical abstracts generated by large scale translation

Citation

If you find our work helpful, please cite the following:

@misc{vipubmed,
  doi = {10.48550/ARXIV.2210.05598},
  url = {https://arxiv.org/abs/2210.05598},
  author = {Phan, Long and Dang, Tai and Tran, Hieu and Trinh, Trieu H. and Phan, Vy and Chau, Lam D. and Luong, Minh-Thang},
  keywords = {Computation and Language (cs.CL), Artificial Intelligence (cs.AI), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {Enriching Biomedical Knowledge for Low-resource Language Through Large-Scale Translation},
  publisher = {arXiv},
  year = {2022},
  copyright = {Creative Commons Attribution 4.0 International}
}

Acknowledgment

We would like to thank the Google TPU Research Cloud (TRC) program and Soonson Kwon (Google ML Ecosystem programs Lead) for their supports.