tBERT

BERT model converted to PyTorch.

Please, do NOT use this repo, instead use the (better) library from HuggingFace: https://github.com/huggingface/pytorch-pretrained-BERT.git

This repo is kept as an example of converting TF model to PyTorch (utilis may be handy in case I need to do some thing like this again).

This is a literal port of BERT code from TensorFlow to PyTorch. See the original TF BERT repo here.

We provide a script to convert TF BERT pre-trained checkpoint to tBERT: tbert.cli.convert

Testing is done to ensure that tBERT code behaves exactly as TF BERT.

License

This work uses MIT license.

Original code is covered by Apache 2.0 License.

Installation

Python 3.6 or better is required.

Easiest way to install is with the pip:

pip install tbert

Now you can start using tBERT models in your code!

Pre-trained models

Google-trained models, converted to tBERT format. For description of models, see the original TF BERT repo here:

Using tBERT model in your PyTorch code

tbert.bert.Bert

This is the main juice - the Bert transformer. It is a normal PyTorch module. You can use it stand-alone or in combination with other PyTorch modules.

from tbert.bert import Bert

config = dict(
    attention_probs_dropout_prob=0.1,
    directionality="bidi",
    hidden_act="gelu",
    hidden_dropout_prob=0.1,
    hidden_size=768,
    initializer_range=0.02,
    intermediate_size=3072,
    max_position_embeddings=512,
    num_attention_heads=12,
    num_hidden_layers=12,
    type_vocab_size=2,
    vocab_size=105879
)

bert = Bert(config)
# ... should load trained parameters (see below)

input_ids      = torch.LongTensor([[1, 2, 3, 4, 5, 0]])
input_type_ids = torch.LongTensor([[0, 0, 1, 1, 1, 0]])
input_mask     = torch.LongTensor([[1, 1, 1, 1, 1, 0]])

activations = bert(input_ids, input_type_ids, input_mask)

Returns an array of activations (for each hidden layer). Typically only the topmost, or few top layers are used. Each element in the array is a Tensor of shape [B*S, H] where B is the batch size, S is the sequence length, and H is the size of the hidden layer.

tbert.bert.BertPooler

This is the Bert transformer with pooling layer on the top. Convenient for sequence classification tasks. Use is very similar to that of tbert.bert.Bert module:

from tbert.bert import Bert

config = dict(
    attention_probs_dropout_prob=0.1,
    directionality="bidi",
    hidden_act="gelu",
    hidden_dropout_prob=0.1,
    hidden_size=768,
    initializer_range=0.02,
    intermediate_size=3072,
    max_position_embeddings=512,
    num_attention_heads=12,
    num_hidden_layers=12,
    type_vocab_size=2,
    vocab_size=105879
)

bert_pooler = BertPooler(config)
# ... should load trained parameters (see below)

input_ids      = torch.LongTensor([[1, 2, 3, 4, 5, 0]])
input_type_ids = torch.LongTensor([[0, 0, 1, 1, 1, 0]])
input_mask     = torch.LongTensor([[1, 1, 1, 1, 1, 0]])

activation = bert_pooler(input_ids, input_type_ids, input_mask)

Returns a single tensor of size [B, H], where B is the batch size, and H is the size of the hidden layer.

Programmatically loading pre-trained weights

To initialize tbert.bert.Bert or tbert.bert.BertPooler from pre-trained saved checkpoint:

...
bert = Bert(config)
bert.load_pretrained(dir_name)

Here, dir_name should be a directory containing pre-trained tBIRT model, with bert_model.pickle and pooler_model.pickle files. See below to learn how to convert published TF BERT pre-trained models to tBERT format.

Similarly, load_pretrained method can be used on tbert.bert.BertPooler instance.

Installing optional dependencies

Optional deps are needed to use CLI utilities:

to convert TF BERT checkpoint to tBERT format
to extract features from a sequence
to run training of a classifier

pip install -r requirements.txt
mkdir tf
cd tf
git clone https://github.com/google-research/bert
cd ..
export PYTHONPATH=.:tf/bert

Now all is set up:

python -m tbert.cli.extract_features --help
python -m tbert.cli.convert --help
python -m tbert.cli.run_classifier --help

Running unit tests

pip install pytest
pytest tbert/test

Converting TF BERT pre-trained checkpoint to tBERT

Download TF BERT checkpoint and unzip it

mkdir data
cd data
wget https://storage.googleapis.com/bert_models/2018_11_03/multilingual_L-12_H-768_A-12.zip
unzip multilingual_L-12_H-768_A-12.zip
cd ..

Run the converter

python -m tbert.cli.convert \
  data/multilingual_L-12_H-768_A-12 \
  data/tbert-multilingual_L-12_H-768_A-12

Extracting features

Make sure that you have pre-trained tBERT model (see section above).

echo "Who was Jim Henson ? ||| Jim Henson was a puppeteer" > /tmp/input.txt
echo "Empty answer is cool!" >> /tmp/input.txt

python -m tbert.cli.extract_features \
    /tmp/input.txt \
    /tmp/output-tbert.jsonl \
    data/tbert-multilingual_L-12_H-768_A-12

Comparing TF BERT and tBERT results

Run TF BERT extract_features:

echo "Who was Jim Henson ? ||| Jim Henson was a puppeteer" > /tmp/input.txt
echo "Empty answer is cool!" >> /tmp/input.txt

export BERT_BASE_DIR=data/multilingual_L-12_H-768_A-12

python -m extract_features \
  --input_file=/tmp/input.txt \
  --output_file=/tmp/output-tf.jsonl \
  --vocab_file=$BERT_BASE_DIR/vocab.txt \
  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
  --layers=-1,-2,-3,-4 \
  --max_seq_length=128 \
  --batch_size=8

This creates file /tmp/output-tf.jsonl. Now, compare this to the JSON-L file created by tBERT:

python -m tbert.cli.cmp_jsonl \
    --tolerance 5e-5 \
    /tmp/output-tbert.jsonl \
    /tmp/output-tf.jsonl

Expect output similar to this:

Max float values delta: 3.6e-05
Structure is identical

Fine-tuning a classifier

Download GLUE datasets, as explained here. In the following we assume that GLUE datasets are in the glue_data directory.

To train MRPC task, do this:

python -m tbert.cli.run_classifier \
    data/tbert-multilingual_L-12_H-768_A-12 \
    /tmp \
    --problem mrpc \
    --data_dir glue_data/MRPC \
    --do_train \
    --num_train_steps 600 \
    --num_warmup_steps 60 \
    --do_eval

Expect to see something similar to that:

...
Step:        550, loss:  0.039, learning rates: 1.888888888888889e-06
Step:        560, loss:  0.014, learning rates: 1.5185185185185186e-06
Step:        570, loss:  0.017, learning rates: 1.1481481481481482e-06
Step:        580, loss:  0.021, learning rates: 7.777777777777779e-07
Step:        590, loss:  0.053, learning rates: 4.074074074074075e-07
Step:        600, loss:  0.061, learning rates: 3.703703703703704e-08
Saved trained model
*** Evaluating ***
Number of samples evaluated: 408
Average per-sample loss: 0.4922609218195373
Accuracy: 0.8504901960784313

javelir/tbert