Deep Keyphrase extraction using BERT.
It takes a lot of computational power to work correctly. I kept the file in a way that only can work in colab environment
If you want to run this in your local machine follow this steps:
- Clone this repository and install
transformers
with this commandpip3 install transformers
- From
bert
repo, untar the weights (rename their weight dump file topytorch_model.bin
) and vocab file into a new foldermodel
. (can be skipped for limitation, leads to poor performance) - Change the parameters accordingly in
experiments/base_model/params.json
. We recommend keeping batch size of 4 and sequence length of 512, with 6 epochs, if GPU's VRAM is around 11 GB. - For training, run the command
python train.py --data_dir data/task1/ --bert_model_dir model/ --model_dir experiments/base_model
- For eval, run the command,
python evaluate.py --data_dir data/task1/ --bert_model_dir model/ --model_dir experiments/base_model --restore_file best
We used IO format here. Unlike original BERT repo, we only use a simple linear layer on top of token embeddings.
On test set, we got:
- F1 score: 0.34
- Precision: 0.45
- Recall: 0.27
- Support: 921
- Some tokens have more than one annotations. We did not consider multi-label classification.
- We only considered a linear layer on top of BERT embeddings. We need to see whether SciBERT + BiLSTM + CRF makes a difference.
- SciBERT: https://github.com/allenai/scibert
- HuggingFace: https://github.com/huggingface/pytorch-pretrained-BERT
- PyTorch NER: https://github.com/lemonhu/NER-BERT-pytorch
- BERT: https://github.com/google-research/bert