BERT model converted to PyTorch.
Please, do NOT use this repo, instead use the (better) library from HuggingFace: https://github.com/huggingface/pytorch-pretrained-BERT.git
This repo is kept as an example of converting TF model to PyTorch (utilis may be handy in case I need to do some thing like this again).
This is a literal port of BERT code from TensorFlow to PyTorch. See the original TF BERT repo here.
We provide a script to convert TF BERT pre-trained checkpoint to tBERT: tbert.cli.convert
Testing is done to ensure that tBERT code behaves exactly as TF BERT.
This work uses MIT license.
Original code is covered by Apache 2.0 License.
Python 3.6 or better is required.
Easiest way to install is with the pip
:
pip install tbert
Now you can start using tBERT models in your code!
Google-trained models, converted to tBERT format. For description of models, see the original TF BERT repo here:
- Base, Uncased
- Large, Uncased
- Base, Cased
- Large, Cased
- Base, Multilingual Cased (New, recommended)
- Base, Multilingual Uncased (Not recommended)
- Base, Chinese
This is the main juice - the Bert transformer. It is a normal PyTorch module. You can use it stand-alone or in combination with other PyTorch modules.
from tbert.bert import Bert
config = dict(
attention_probs_dropout_prob=0.1,
directionality="bidi",
hidden_act="gelu",
hidden_dropout_prob=0.1,
hidden_size=768,
initializer_range=0.02,
intermediate_size=3072,
max_position_embeddings=512,
num_attention_heads=12,
num_hidden_layers=12,
type_vocab_size=2,
vocab_size=105879
)
bert = Bert(config)
# ... should load trained parameters (see below)
input_ids = torch.LongTensor([[1, 2, 3, 4, 5, 0]])
input_type_ids = torch.LongTensor([[0, 0, 1, 1, 1, 0]])
input_mask = torch.LongTensor([[1, 1, 1, 1, 1, 0]])
activations = bert(input_ids, input_type_ids, input_mask)
Returns an array of activations (for each hidden layer). Typically only the topmost, or few top layers are used. Each element in the array is a Tensor of shape [B*S, H] where B is the batch size, S is the sequence length, and H is the size of the hidden layer.
This is the Bert transformer with pooling layer on the top.
Convenient for sequence classification tasks. Use is very similar to
that of tbert.bert.Bert
module:
from tbert.bert import Bert
config = dict(
attention_probs_dropout_prob=0.1,
directionality="bidi",
hidden_act="gelu",
hidden_dropout_prob=0.1,
hidden_size=768,
initializer_range=0.02,
intermediate_size=3072,
max_position_embeddings=512,
num_attention_heads=12,
num_hidden_layers=12,
type_vocab_size=2,
vocab_size=105879
)
bert_pooler = BertPooler(config)
# ... should load trained parameters (see below)
input_ids = torch.LongTensor([[1, 2, 3, 4, 5, 0]])
input_type_ids = torch.LongTensor([[0, 0, 1, 1, 1, 0]])
input_mask = torch.LongTensor([[1, 1, 1, 1, 1, 0]])
activation = bert_pooler(input_ids, input_type_ids, input_mask)
Returns a single tensor of size [B, H], where B is the batch size, and H is the size of the hidden layer.
To initialize tbert.bert.Bert
or tbert.bert.BertPooler
from pre-trained
saved checkpoint:
...
bert = Bert(config)
bert.load_pretrained(dir_name)
Here, dir_name
should be a directory containing pre-trained tBIRT model,
with bert_model.pickle
and pooler_model.pickle
files. See below to learn how
to convert published TF BERT pre-trained models to tBERT format.
Similarly, load_pretrained
method can be used on tbert.bert.BertPooler
instance.
Optional deps are needed to use CLI utilities:
- to convert TF BERT checkpoint to tBERT format
- to extract features from a sequence
- to run training of a classifier
pip install -r requirements.txt
mkdir tf
cd tf
git clone https://github.com/google-research/bert
cd ..
export PYTHONPATH=.:tf/bert
Now all is set up:
python -m tbert.cli.extract_features --help
python -m tbert.cli.convert --help
python -m tbert.cli.run_classifier --help
pip install pytest
pytest tbert/test
- Download TF BERT checkpoint and unzip it
mkdir data cd data wget https://storage.googleapis.com/bert_models/2018_11_03/multilingual_L-12_H-768_A-12.zip unzip multilingual_L-12_H-768_A-12.zip cd ..
- Run the converter
python -m tbert.cli.convert \ data/multilingual_L-12_H-768_A-12 \ data/tbert-multilingual_L-12_H-768_A-12
Make sure that you have pre-trained tBERT model (see section above).
echo "Who was Jim Henson ? ||| Jim Henson was a puppeteer" > /tmp/input.txt
echo "Empty answer is cool!" >> /tmp/input.txt
python -m tbert.cli.extract_features \
/tmp/input.txt \
/tmp/output-tbert.jsonl \
data/tbert-multilingual_L-12_H-768_A-12
Run TF BERT extract_features
:
echo "Who was Jim Henson ? ||| Jim Henson was a puppeteer" > /tmp/input.txt
echo "Empty answer is cool!" >> /tmp/input.txt
export BERT_BASE_DIR=data/multilingual_L-12_H-768_A-12
python -m extract_features \
--input_file=/tmp/input.txt \
--output_file=/tmp/output-tf.jsonl \
--vocab_file=$BERT_BASE_DIR/vocab.txt \
--bert_config_file=$BERT_BASE_DIR/bert_config.json \
--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
--layers=-1,-2,-3,-4 \
--max_seq_length=128 \
--batch_size=8
This creates file /tmp/output-tf.jsonl
. Now, compare this to the JSON-L file created
by tBERT:
python -m tbert.cli.cmp_jsonl \
--tolerance 5e-5 \
/tmp/output-tbert.jsonl \
/tmp/output-tf.jsonl
Expect output similar to this:
Max float values delta: 3.6e-05
Structure is identical
Download GLUE datasets, as explained
here.
In the following we assume that
GLUE datasets are in the glue_data
directory.
To train MRPC task, do this:
python -m tbert.cli.run_classifier \
data/tbert-multilingual_L-12_H-768_A-12 \
/tmp \
--problem mrpc \
--data_dir glue_data/MRPC \
--do_train \
--num_train_steps 600 \
--num_warmup_steps 60 \
--do_eval
Expect to see something similar to that:
...
Step: 550, loss: 0.039, learning rates: 1.888888888888889e-06
Step: 560, loss: 0.014, learning rates: 1.5185185185185186e-06
Step: 570, loss: 0.017, learning rates: 1.1481481481481482e-06
Step: 580, loss: 0.021, learning rates: 7.777777777777779e-07
Step: 590, loss: 0.053, learning rates: 4.074074074074075e-07
Step: 600, loss: 0.061, learning rates: 3.703703703703704e-08
Saved trained model
*** Evaluating ***
Number of samples evaluated: 408
Average per-sample loss: 0.4922609218195373
Accuracy: 0.8504901960784313