Pytorch implementation of Microsoft's Unified Language Model Pre-training for Korean
Unified Language Model Pre-training Paper URL : https://arxiv.org/abs/1905.03197
A large portion of this implementation is from BERT-pytorch
Place a train text file and a test file (both are one sentence per line) into ./data
directory and set each path for main.py
's train_dataset_path
and test_dataset_path arguments
I have used crawled Korean articles' head for both train and test text files
Since the model is trained for Korean, I have used Korean sentencepiece tokenizer from KoBERT
python main.py
In the paper, authors shows the new language model training methods, which are "masked language model" and "predict next sentence".
Original Paper : 3.3.1 Task #1: Masked LM
Input Sequence : The man went to [MASK] store with [MASK] dog
Target Sequence : the his
Randomly 15% of input token will be changed into something, based on under sub-rules
- Randomly 80% of tokens, gonna be a
[MASK]
token - Randomly 10% of tokens, gonna be a
[RANDOM]
token(another word) - Randomly 10% of tokens, will be remain as same. But need to be predicted.
As stated in the paper, within one training batch, 1/3 of the time we use the bidirectional
LM objective, 1/3 of
the time we employ the sequence-to-sequence
LM objective, and both left-to-right
and right-to-left
LM objectives are sampled with rate of 1/6.
Please refer to dataset.py