Pre-training has become an essential part for NLP tasks. UER-py (Universal Encoder Representations) is a toolkit for pre-training on general-domain corpus and fine-tuning on downstream task. UER-py maintains model modularity and supports research extensibility. It facilitates the use of existing pre-training models, and provides interfaces for users to further extend upon. With UER-py, we build a model zoo which contains pre-trained models based on different corpora, encoders, and targets. See the Wiki for Full Documentation.
- Features
- Requirements
- Quickstart
- Datasets
- Modelzoo
- Instructions
- Competition solutions
- Citation
- Contact information
UER-py has the following features:
- Reproducibility UER-py has been tested on many datasets and should match the performances of the original pre-training model implementations such as BERT, GPT-2, ELMo, and T5.
- Multi-GPU UER-py supports CPU mode, single GPU mode, and distributed training mode.
- Model modularity UER-py is divided into multiple components: embedding, encoder, and target. Ample modules are implemented in each component. Clear and robust interface allows users to combine modules to construct pre-training models with as few restrictions as possible.
- Efficiency UER-py refines its pre-processing, pre-training, fine-tuning, and inference stages, which largely improves speed while requires less memory and disk space.
- Model zoo With the help of UER-py, we pre-trained models with different corpora, encoders, and targets. Proper selection of pre-trained models is important to the performances of downstream tasks.
- SOTA results UER-py supports comprehensive downstream tasks (e.g. classification and machine reading comprehension) and provides winning solutions of many NLP competitions.
- Abundant functions UER-py provides abundant functions related with pre-training, such as feature extractor and mixed precision training.
- Python >= 3.6
- torch >= 1.1
- six >= 1.12.0
- argparse
- packaging
- For the mixed precision training you will need apex from NVIDIA
- For the pre-trained model conversion (related with TensorFlow) you will need TensorFlow
- For the tokenization with sentencepiece model you will need SentencePiece
- For developing a stacking model you will need LightGBM and BayesianOptimization
- For the pre-training with whole word masking you will need word segmentation tool such as jieba
- For the use of CRF in sequence labeling downstream task you will need pytorch-crf
This section uses several commonly-used examples to demonstrate how to use UER-py. More details are discussed in Instructions section. We firstly use BERT model on Douban book review classification dataset. We pre-train model on book review corpus and then fine-tune it on classification dataset. There are three input files: book review corpus, book review classification dataset, and vocabulary. All files are encoded in UTF-8 and included in this project.
The format of the corpus for BERT is as follows (one sentence per line and documents are delimited by empty lines):
doc1-sent1
doc1-sent2
doc1-sent3
doc2-sent1
doc3-sent1
doc3-sent2
The book review corpus is obtained from book review classification dataset. We remove labels and split a review into two parts from the middle (see book_review_bert.txt in corpora folder).
The format of the classification dataset is as follows:
label text_a
1 instance1
0 instance2
1 instance3
Label and instance are separated by \t . The first row is a list of column names. The label ID should be an integer between (and including) 0 and n-1 for n-way classification.
We use Google's Chinese vocabulary file models/google_zh_vocab.txt, which contains 21128 Chinese characters.
We firstly pre-process the book review corpus. We need to specify the model's target in pre-processing stage (--target):
python3 preprocess.py --corpus_path corpora/book_review_bert.txt --vocab_path models/google_zh_vocab.txt \
--dataset_path dataset.pt --processes_num 8 --target bert
Notice that six>=1.12.0 is required.
Pre-processing is time-consuming. Using multiple processes can largely accelerate the pre-processing speed (--processes_num). BERT tokenizer is used in default (--tokenizer bert). After pre-processing, the raw text is converted to dataset.pt, which is the input of pretrain.py. Then we download Google's pre-trained Chinese BERT model google_zh_model.bin (in UER format and the original model is from here), and put it in models folder. We load the pre-trained Chinese BERT model and further pre-train it on book review corpus. Pre-training model is composed of embedding, encoder, and target layers. To build a pre-training model, we should explicitly specify model's embedding (--embedding), encoder (--encoder and --mask), and target (--target). Suppose we have a machine with 8 GPUs:
python3 pretrain.py --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt \
--pretrained_model_path models/google_zh_model.bin \
--output_model_path models/book_review_model.bin \
--world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
--total_steps 5000 --save_checkpoint_steps 1000 --batch_size 32 \
--embedding word_pos_seg --encoder transformer --mask fully_visible --target bert
mv models/book_review_model.bin-5000 models/book_review_model.bin
--mask specifies the attention mask types. BERT uses bidirectional LM. The word token can attend to all tokens and therefore we use fully_visible mask type. The embedding layer of BERT is the sum of word (token), position, and segment embeddings and therefore --embedding word_pos_seg is specified. By default, models/bert/base_config.json is used as configuration file, which specifies the model hyper-parameters. Notice that the model trained by pretrain.py is attacted with the suffix which records the training step (--total_steps). We could remove the suffix for ease of use.
Then we fine-tune the pre-trained model on downstream classification dataset. We use book_review_model.bin, which is the output of pretrain.py:
python3 finetune/run_classifier.py --pretrained_model_path models/book_review_model.bin \
--vocab_path models/google_zh_vocab.txt \
--train_path datasets/douban_book_review/train.tsv \
--dev_path datasets/douban_book_review/dev.tsv \
--test_path datasets/douban_book_review/test.tsv \
--epochs_num 3 --batch_size 32 \
--embedding word_pos_seg --encoder transformer --mask fully_visible
It is noticeable that we don't need to specify the target in fine-tuning stage. Pre-training target is replaced with task-specific target.
The default path of the fine-tuned classifier model is models/finetuned_model.bin . Then we do inference with the fine-tuned model.
python3 inference/run_classifier_infer.py --load_model_path models/finetuned_model.bin \
--vocab_path models/google_zh_vocab.txt \
--test_path datasets/douban_book_review/test_nolabel.tsv \
--prediction_path datasets/douban_book_review/prediction.tsv \
--labels_num 2 \
--embedding word_pos_seg --encoder transformer --mask fully_visible
--test_path specifies the path of the file to be predicted. The file should contain text_a column.
--prediction_path specifies the path of the file with prediction results.
We need to explicitly specify the number of labels by --labels_num. Douban book review is a two-way classification dataset.
The above content provides basic ways of using UER-py to pre-process, pre-train, fine-tune, and do inference. More use cases can be found in complete ➡️ quickstart ⬅️ . The complete quickstart contains comprehensive use cases, covering most of the pre-training related application scenarios. It is recommended that users read the complete quickstart in order to use the project reasonably.
We collected a range of ➡️ downstream datasets ⬅️ and converted them into the format that UER can load directly.
With the help of UER, we pre-trained models with different corpora, encoders, and targets. Detailed introduction of pre-trained models and their download links can be found in ➡️ modelzoo ⬅️ . All pre-trained models can be loaded by UER directly. More pre-trained models will be released in the future.
UER-py is organized as follows:
UER-py/
|--uer/
| |--encoders/ # contains encoders such as RNN, CNN, Transformer
| |--targets/ # contains targets such as language modeling, masked language modeling
| |--layers/ # contains frequently-used NN layers, such as embedding layer, normalization layer
| |--models/ # contains model.py, which combines embedding, encoder, and target modules
| |--utils/ # contains frequently-used utilities
| |--model_builder.py
| |--model_loader.py
| |--model_saver.py
| |--trainer.py
|
|--corpora/ # contains corpora for pre-training
|--datasets/ # contains downstream tasks
|--models/ # contains pre-trained models, vocabularies, and configuration files
|--scripts/ # contains useful scripts for pre-training models
|--finetune/ # contains fine-tuning scripts for downstream tasks
|--inference/ # contains inference scripts for downstream tasks
|
|--preprocess.py
|--pretrain.py
|--README.md
|--README_ZH.md
|--requirements.txt
|--logo.jpg
The code is well-organized. Users can use and extend upon it with little efforts.
More examples of using UER can be found in ➡️ instructions ⬅️ , which help users quickly implement pre-training models such as BERT, GPT-2, ELMo, T5 and fine-tune pre-trained models on a range of downstream tasks.
UER-py has been used in winning solutions of many NLP competitions. In this section, we provide some examples of using UER-py to achieve SOTA results on NLP competitions, such as CLUE. See ➡️ competition solutions ⬅️ for more detailed information.
If you are using the work (e.g. pre-trained model) in UER-py for academic work, please cite the system paper published in EMNLP 2019:
@article{zhao2019uer,
title={UER: An Open-Source Toolkit for Pre-training Models},
author={Zhao, Zhe and Chen, Hui and Zhang, Jinbin and Zhao, Xin and Liu, Tao and Lu, Wei and Chen, Xi and Deng, Haotang and Ju, Qi and Du, Xiaoyong},
journal={EMNLP-IJCNLP 2019},
pages={241},
year={2019}
}
For communication related to this project, please contact Zhe Zhao (helloworld@ruc.edu.cn; nlpzhezhao@tencent.com) or Yudong Li (liyudong123@hotmail.com) or Cheng Hou (chenghoubupt@bupt.edu.cn) or Xin Zhao (zhaoxinruc@ruc.edu.cn).
This work is instructed by my enterprise mentors Qi Ju, Xuefeng Yang, Haotang Deng and school mentors Tao Liu, Xiaoyong Du.
We also got a lot of help from Weijie Liu, Lusheng Zhang, Jianwei Cui, Xiayu Li, Weiquan Mao, Hui Chen, Jinbin Zhang, Zhiruo Wang, Peng Zhou, Haixiao Liu, and Weijian Wu.