Pre-training has become an essential part for NLP tasks. UER-py (Universal Encoder Representations) is a toolkit for pre-training on general-domain corpus and fine-tuning on downstream task. UER-py maintains model modularity and supports research extensibility. It facilitates the use of existing pre-training models, and provides interfaces for users to further extend upon. With UER-py, we build a model zoo which contains pre-trained models of different properties. See the Wiki for Full Documentation.
News: Our refactored new version supports multi-modal pre-training and large language models like LLaMA, see TencentPretrain.
Table of Contents
- Features
- Requirements
- Quickstart
- Datasets
- Modelzoo
- Instructions
- Competition solutions
- Citation
- Contact information
Features
UER-py has the following features:
- Reproducibility UER-py has been tested on many datasets and should match the performances of the original pre-training model implementations such as BERT, GPT-2, ELMo, and T5.
- Model modularity UER-py is divided into the following components: embedding, encoder, target embedding (optional), decoder (optional), and target. Ample modules are implemented in each component. Clear and robust interface allows users to combine modules to construct pre-training models with as few restrictions as possible.
- Model training UER-py supports CPU mode, single GPU mode, distributed training mode, and gigantic model training with DeepSpeed.
- Model zoo With the help of UER-py, we pre-train and release models of different properties. Proper selection of pre-trained models is important to the performances of downstream tasks.
- SOTA results UER-py supports comprehensive downstream tasks (e.g. classification and machine reading comprehension) and provides winning solutions of many NLP competitions.
- Abundant functions UER-py provides abundant functions related with pre-training, such as feature extractor and text generation.
Requirements
- Python >= 3.6
- torch >= 1.1
- six >= 1.12.0
- argparse
- packaging
- regex
- For the mixed precision training you will need apex from NVIDIA
- For the pre-trained model conversion (related with TensorFlow) you will need TensorFlow
- For the tokenization with sentencepiece model you will need SentencePiece
- For developing a stacking model you will need LightGBM and BayesianOptimization
- For the pre-training with whole word masking you will need word segmentation tool such as jieba
- For the use of CRF in sequence labeling downstream task you will need pytorch-crf
- For the gigantic model training you will need DeepSpeed
Quickstart
This section uses several commonly-used examples to demonstrate how to use UER-py. More details are discussed in Instructions section. We firstly use BERT model on Douban book review classification dataset. We pre-train model on book review corpus and then fine-tune it on book review classification dataset. There are three input files: book review corpus, book review classification dataset, and vocabulary. All files are encoded in UTF-8 and included in this project.
The format of the corpus for BERT is as follows (one sentence per line and documents are delimited by empty lines):
doc1-sent1
doc1-sent2
doc1-sent3
doc2-sent1
doc3-sent1
doc3-sent2
The book review corpus is obtained from book review classification dataset. We remove labels and split a review into two parts from the middle to construct a document with two sentences (see book_review_bert.txt in corpora folder).
The format of the classification dataset is as follows:
label text_a
1 instance1
0 instance2
1 instance3
Label and instance are separated by \t . The first row is a list of column names. The label ID should be an integer between (and including) 0 and n-1 for n-way classification.
We use Google's Chinese vocabulary file models/google_zh_vocab.txt, which contains 21128 Chinese characters.
We firstly pre-process the book review corpus. In the pre-processing stage, the corpus needs to be processed into the format required by the specified pre-training model (--data_processor):
python3 preprocess.py --corpus_path corpora/book_review_bert.txt --vocab_path models/google_zh_vocab.txt \
--dataset_path dataset.pt --processes_num 8 --data_processor bert
Notice that six>=1.12.0 is required.
Pre-processing is time-consuming. Using multiple processes can largely accelerate the pre-processing speed (--processes_num). BERT tokenizer is used in default (--tokenizer bert). After pre-processing, the raw text is converted to dataset.pt, which is the input of pretrain.py. Then we download Google's pre-trained Chinese BERT model google_zh_model.bin (in UER format and the original model is from here), and put it in models folder. We load the pre-trained Chinese BERT model and further pre-train it on book review corpus. Pre-training model is usually composed of embedding, encoder, and target layers. To build a pre-training model, we should provide related information. Configuration file (--config_path) specifies the modules and hyper-parameters used by pre-training models. More details can be found in models/bert/base_config.json. Suppose we have a machine with 8 GPUs:
python3 pretrain.py --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt \
--pretrained_model_path models/google_zh_model.bin \
--config_path models/bert/base_config.json \
--output_model_path models/book_review_model.bin \
--world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
--total_steps 5000 --save_checkpoint_steps 1000 --batch_size 32
mv models/book_review_model.bin-5000 models/book_review_model.bin
Notice that the model trained by pretrain.py is attacted with the suffix which records the training step (--total_steps). We could remove the suffix for ease of use.
Then we fine-tune the pre-trained model on downstream classification dataset. We use embedding and encoder layers of book_review_model.bin, which is the output of pretrain.py:
python3 finetune/run_classifier.py --pretrained_model_path models/book_review_model.bin \
--vocab_path models/google_zh_vocab.txt \
--config_path models/bert/base_config.json \
--train_path datasets/douban_book_review/train.tsv \
--dev_path datasets/douban_book_review/dev.tsv \
--test_path datasets/douban_book_review/test.tsv \
--epochs_num 3 --batch_size 32
The default path of the fine-tuned classifier model is models/finetuned_model.bin . It is noticeable that the actual batch size of pre-training is --batch_size times --world_size ; The actual batch size of downstream task (e.g. classification) is --batch_size . Then we do inference with the fine-tuned model.
python3 inference/run_classifier_infer.py --load_model_path models/finetuned_model.bin \
--vocab_path models/google_zh_vocab.txt \
--config_path models/bert/base_config.json \
--test_path datasets/douban_book_review/test_nolabel.tsv \
--prediction_path datasets/douban_book_review/prediction.tsv \
--labels_num 2
--test_path specifies the path of the file to be predicted. The file should contain text_a column. --prediction_path specifies the path of the file with prediction results. We need to explicitly specify the number of labels by --labels_num. Douban book review is a two-way classification dataset.
The above content provides basic ways of using UER-py to pre-process, pre-train, fine-tune, and do inference. More use cases can be found in complete
Datasets
We collected a range of
Modelzoo
With the help of UER, we pre-trained models of different properties (e.g. models based on different corpora, encoders, and targets). Detailed introduction of pre-trained models and their download links can be found in
Instructions
UER-py is organized as follows:
UER-py/
|--uer/
| |--embeddings/ # contains embeddings
| |--encoders/ # contains encoders such as RNN, CNN,
| |--decoders/ # contains decoders
| |--targets/ # contains targets such as language modeling, masked language modeling
| |--layers/ # contains frequently-used NN layers, such as embedding layer, normalization layer
| |--models/ # contains model.py, which combines embedding, encoder, and target modules
| |--utils/ # contains frequently-used utilities
| |--model_builder.py
| |--model_loader.py
| |--model_saver.py
| |--trainer.py
|
|--corpora/ # contains corpora for pre-training
|--datasets/ # contains downstream tasks
|--models/ # contains pre-trained models, vocabularies, and configuration files
|--scripts/ # contains useful scripts for pre-training models
|--finetune/ # contains fine-tuning scripts for downstream tasks
|--inference/ # contains inference scripts for downstream tasks
|
|--preprocess.py
|--pretrain.py
|--README.md
|--README_ZH.md
|--requirements.txt
|--logo.jpg
The code is well-organized. Users can use and extend upon it with little efforts.
Comprehensive examples of using UER can be found in
Competition solutions
UER-py has been used in winning solutions of many NLP competitions. In this section, we provide some examples of using UER-py to achieve SOTA results on NLP competitions, such as CLUE. See
Citation
system paper published in EMNLP 2019:
If you are using the work (e.g. pre-trained models) in UER-py for academic work, please cite the@article{zhao2019uer,
title={UER: An Open-Source Toolkit for Pre-training Models},
author={Zhao, Zhe and Chen, Hui and Zhang, Jinbin and Zhao, Xin and Liu, Tao and Lu, Wei and Chen, Xi and Deng, Haotang and Ju, Qi and Du, Xiaoyong},
journal={EMNLP-IJCNLP 2019},
pages={241},
year={2019}
}
Contact information
For communication related to this project, please contact Zhe Zhao (helloworld@ruc.edu.cn; nlpzhezhao@tencent.com) or Yudong Li (liyudong123@hotmail.com) or Cheng Hou (chenghoubupt@bupt.edu.cn) or Wenhang Shi (wenhangshi@ruc.edu.cn).
This work is instructed by my enterprise mentors Qi Ju, Xuefeng Yang, Haotang Deng and school mentors Tao Liu, Xiaoyong Du.
We also got a lot of help from Weijie Liu, Lusheng Zhang, Jianwei Cui, Xiayu Li, Weiquan Mao, Xin Zhao, Hui Chen, Jinbin Zhang, Zhiruo Wang, Peng Zhou, Haixiao Liu, and Weijian Wu.