allenai/specter

How to create the vocab files? (tokens.txt, non_padded_namespaces.txt, venue.txt)

piegu opened this issue · 5 comments

piegu commented

Hi,

I'm testing the SPECTER training method with another model than Scibert (a non-English model present in the Hugging Face hub).

For doing that, I download the scibert tar file, untar it and replace files by the other BERT ones.

Here the files after untar:

data/
data/scibert_scivocab_uncased/
data/scibert_scivocab_uncased/scibert.tar.gz
data/scibert_scivocab_uncased/vocab.txt
data/vocab/
data/vocab/non_padded_namespaces.txt
data/vocab/tokens.txt
data/vocab/venue.txt

As we can see, the scibert tar file has a folder vocab with 3 files.

How theses files were created?
How can I created them from a non-English BERT model present in the Hugging Face hub?

The vocab files are specific to the SciBERT model (that can be used within AllenNLP framework).
For Huggingface I believe you don't need those.
If you want to train Specter using another HF model, you can use our HF specific training scripts here:
https://github.com/allenai/specter/tree/master/scripts/pytorch_lightning_training_script.
You would just need to modify your model and tokenizer here.

piegu commented

Thanks @armancohan.

A question: I still need to create preprocessed training files (through the following code) before to use https://github.com/allenai/specter/tree/master/scripts/pytorch_lightning_training_script?

python specter/data_utils/create_training_files.py \
--data-dir data/training \
--metadata data/training/metadata.json \
--outdir data/preprocessed/

Yes exactly, please see here for more details on how to create your own training data: https://github.com/allenai/specter#advanced-training-your-own-model

This should work just fine for training the model from scratch but it is not optimal. If you are curious:

The current code for training the model using Huggingface is not an optimal process, because it still depends on AllenNLP. I.e., it first creates pickled training data by tokenizing the input text using the AllenNLP backend and creates AllenNLP Instance objects. The pytorch_lightning_training_script then reads these pickled instances, detokenizes them and then uses Huggingface's tokenizers to tokenize the input again.

It should be relatively straightforward to modify the data loader part in the pytorch_lightning_training_script.py to directly read json textual input instead of AllenNLP instances. For this the get_instance function in the create_training_files.py script needs to be modified to return a simple text json instead of AllenNLP "Fields". Then instead of pickled instances, we can directly write dictionary of plain text key/values to a jsonlines file see here.
Finally, after this, the dataset reader in the pytorch_lightning_training_script can be replaced with a simple iterator over the json lines file.

We would welcome a PR on this if you wanted to go this route and I'd be happy to help resolve questions.

piegu commented

Hi @armancohan,

I ran the script on 1 epoch.

The output in terminal is as following:

  | Name        | Type        | Params
--------------------------------------------
0 | model       | BertModel   | 108 M
1 | triple_loss | TripletLoss | 0
/opt/conda/envs/specter1/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:45: UserWarning: The dataloader, val dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 256 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  warnings.warn(*args, **kwargs)
/opt/conda/envs/specter1/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:45: UserWarning: The dataloader, train dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 256 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  warnings.warn(*args, **kwargs)
Epoch 0: : 2784it [16:45,  2.77it/s,Epoch 0: avg_val_loss reached 1.00000 (best 1.00000), saving model to /workspace/SPECTER/specter/scripts/pytorch_lightning_training_script/save/version_0/checkpoints/ep-epoch=0_avg_val_loss-avg_val_loss=1.000.ckpt as top 1: 435it [00:59,  6.88it/s]
/opt/conda/envs/specter1/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:200: UserWarning: Please also save or load the state of the optimzer when saving or loading the scheduler.
  warnings.warn(SAVE_STATE_WARNING, UserWarning)
Epoch 0: : 2785it [16:47,  2.76it/s,Epoch 0: avg_val_loss was not in top 1n_loss=1.02, rate=1.99e-5, avg_val_loss=1]
Epoch 0: : 2785it [16:47,  2.76it/s, loss=1.029, v_num=0, val_loss=1, train_loss=1.02, rate=1.99e-5, avg_val_loss=1]

It looks strange to have avg_val_loss=1 and my question is: how to get the specter model in Hugging Face format from the file ep-epoch=0_avg_val_loss-avg_val_loss=1.000.ckpt?

piegu commented

Hi @armancohan,

to get the specter model in Hugging Face format from the file ep-epoch=0_avg_val_loss-avg_val_loss=1.000.ckpt, can you valid this method?

path_to_ckpt = "/workspace/SPECTER/specter/scripts/pytorch_lightning_training_script/save/version_0/checkpoints/ep-epoch=0_avg_val_loss-avg_val_loss=1.000.ckpt"

# pytorch model wrapped into the class SPECTER
model = Specter.load_from_checkpoint(path_to_ckpt) 

# files saved: pytorch_model.bin and config.json
model.model.save_pretrained('./hf/')