/X-retroMAE-2

X-retroMAE-2: Duplex Masked Auto-Encoder for for RoBERTa language model

Primary LanguagePython

X-DupMAE

Running RetroMAE v2: Duplex Masked Auto-Encoder For Pre-Training Retrieval-Oriented Language Models on RoBERTa model.

This repository source is cloned from @hieudx149 X-RetroMAE repository.

X-DupMAE tries to modify RetroMAE v2 to be compatible with RoBERTa and XLM-RoBERTa, hope this project will help anyone who wants to apply RetroMAE v2 to their own language rather than English.

Modification

Compare to hieudx149 version:

  • Copy RetroMAE v2 modeling_duplex.py and change all Bert* to Roberta*
  • Copy DupMAECollator class to data.py
  • Copy code to switch between retromae and dupmae in run.py

Setup

pip install --upgrade pip
pip install -r requirements.txt

Run pretraining

First make sure that you have preprocessed your own data first by running the preprocessing.py in examples/pretrain, then run:

sh src/run_pretrain.sh

Citation

@inproceedings{RetroMAE,
  title={RetroMAE: Pre-Training Retrieval-oriented Language Models Via Masked Auto-Encoder},
  author={Shitao Xiao, Zheng Liu, Yingxia Shao, Zhao Cao},
  url={https://arxiv.org/abs/2205.12035},
  booktitle ={EMNLP},
  year={2022},
}