Laboro BERT Japanese: Japanese BERT Pre-Trained With Web-Corpus

Introduction
Fine-Tuning with Our Model
- Classification
- Question Answering
About the Pre-Training of Our Model

Introduction

About our BERT model

This Japanese BERT model was pre-trained with our own web corpus, on the basis of the original BERT and this Japanese BERT. So far both base model (12-layer, 768-hidden, 12-heads, 110M parameters) and large model (24-layer, 1024-hidden, 16-heads, 340M parameters) pre-trained with the same web corpus have been released.

Download

base model with unigram tokenizer
large model with unigram tokenizer
base model with BPE tokenizer
large model with BPE tokenizer

How well is the performance

The models have been evaluated for two tasks, Livedoor news classification task and driving-domain question answering (DDQA) task. In Livedoor news classification, each piece of news is supposed to be classified into one of nine categories. In DDQA task, given question-article pairs, answers to the questions are expected to be found from the articles. The results of the evaluation are shown below, in comparison with a baseline model pre-trained with Japanese Wikipedia corpus released by this Japanese BERT repository. Note that the results are the averages of multiple-time mearsurement. Due to the small size of the evaluation datasets, the results may differ a little every time.

For Livedoow news classification task:

model size	corpus	corpus size	eval evironment	batch size	epoch	learning rate	measurement times	mean accuracy (%)	standard deviation
Base	JA-Wikipedia	2.9G	GPU	4	10	2e-5	5	97.23	2.38e-1
Base	Web Corpus	12G	GPU	4	10	2e-5	5	97.72	2.27e-1
Large	Web Corpus	12G	TPU	32	7	2e-5	30	98.07	2.45e-3

For Driving-domain QA task:

model size	corpus	corpus size	eval evironment	batch size	epoch	learning rate	measurement times	mean EM (%)	standard deviation
Base	JA-Wikipedia	2.9G	TPU	32	3	5e-5	100	76.3	5.16e-3
Base	Web Corpus	12G	TPU	32	3	5e-5	100	75.5	5.06e-3
Large	Web Corpus	12G	TPU	32	3	5e-5	30	77.3	4.96e-3

To cite this work

We haven't published any paper on this work. Please cite this repository:

@article{Laboro BERT Japanese,
  title = {Laboro BERT Japanese: Japanese BERT Pre-Trained With Web-Corpus},
  author = {"Zhao, Xinyi and Hamamoto, Masafumi and Fujihara, Hiromasa"},
  year = {2020},
  howpublished = {\url{https://github.com/laboroai/Laboro-BERT-Japanese}}
}

License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
For commercial use, please contact Laboro.AI Inc.

Fine-Tuning with Our Model

Classification

Text classification means assigning labels to text. Because the labels can be defined to describe any aspect of the text, text classification has a wide range of application. The most straightforward one would be categorizing the topic or sentiment of the text. Besides those, other examples include recognizing spam email, judging whether two sentences have same or similar meaning.

Dataset - Livedoor News Corpus

In the evaluation of English BERT model in classification task, several datasets (e.g. SST-2, MRPC) can be used as common benchmarks. As for Japanese BERT model, Livedoor news corpus can be used in the same fashion. Each piece of news in this corpus can be classified into one of the nine categories.

The original corpus is not devided in training, evaluation, and testing data. The dataset we provided in this repository was pre-processed based on Livedoor News Corpus in following steps:

concatenating all of the data
shuffling randomly
deviding into train:dev:test = 6:2:2

Requirements

Python 3.6.9
tensorflow==1.13.0
sentencepiece==0.1.85
GPU is recommended

To use the code

Before running the code, make sure

the livedoor dataset is in the data folder
the pre-trained BERT model is in the model folder, including model.ckpt.data, model.ckpt.meta, model.ckpt.index, bert_config.json
the sentencepiece model is also in the model folder, including webcorpus.model, webcorpus.vocab

git clone https://github.com/laboroai/Laboro-BERT-Japanese.git
cd ./Laboro-BERT-Japanese/src
./run_classifier.sh

Question Answering

Question answering task is another way to evaluate and apply BERT model. In English NLP, SQuAD is one the of most widely used datasets for this task. In SQuAD, questions and corresponding Wikipedia pages are given, and the answers to the questions are supposed to be found from the Wikipedia pages.

Dataset - Driving Domain QA

For QA task, we used Driving Domain QA dataset for evaluation. This dataset consists of PAS-QA dataset and RC-QA dataset. So far, we have only evaluated our model on the RC-QA dataset. The dataset is already in the format of SQuAD 2.0, so no pre-processing is needed for further use.

Requirements

Python 3.6.9
tensorflow==1.13.0
sentencepiece==0.1.85
TPU is recommended (in our experiments, out-of-memory error occurs when using GPU)
Google Cloud Storage if TPU is used

To use the code

TPU is recommended for this evaluation, and TPU can only read from and write to Google Cloud Storage, thus we recommend to place BERT model and output in cloud storage bucket. Before running the code, make sure

the livedoor dataset is in the data folder
the pre-trained BERT model is in the model folder in cloud storage bucket, including model.ckpt.data, model.ckpt.meta, model.ckpt.index, bert_config.json
the sentencepiece model is in the local model folder, including webcorpus.model, webcorpus.vocab

git clone https://github.com/laboroai/Laboro-BERT-Japanese.git
cd ./Laboro-BERT-Japanese/src
./run_squad.sh

About the Pre-Training of Our Model

Corpus

Our Japanese BERT model is pre-trained with a web-based corpus especially built for this project. It was built by using a web crawler, and in total 2,605,280 webpages from 4,307 websites were crawled. The source websites extend from news websites and part of Wikipedia to personal blogs, covering both formal and informal written Japanese.

The original English BERT model was trained on a 13GB corpus consisting of English Wikipedia and BooksCorpus. The size of raw text in our web-based corpus is 12GB, which is similar to the original one.

SentencePiece Model

SentencePiece is used as the tokenizer. The parameters when training the sentencepiece model are as followings:

vocab_size = 32000
shuffle_input_sentence = True
input_sentence_size = 18000000
character_coverage = 0.9995 #default
model_type = 'unigram' #default
ctlsymbols = '[CLS],[SEP],[MASK]'

Pre-Training

Hyper-parameters

The pre-training consists of two phases, in which the train_batch_size and max_sequence_length are changed.

Phase 1

train_batch_size = 256
max_seq_length = 128
num_train_steps = 2900000
num_warmup_steps = 10000
learning_rate = 1e-4

Phase 2

train_batch_size = 64
max_seq_length = 512
num_train_steps = 3900000
num_warmup_steps = 10000
learning_rate = 1e-4

Environment

Cloud TPU v3-8 on Google Cloud Platform
tensorflow==1.13.0