Keyphrase Extraction using BERT (Semeval 2017, Task 10)

F1 score: 0.34
Precision: 0.45
Recall: 0.27
Support: 921

Deep Keyphrase extraction using BERT.

Usage

It takes a lot of computational power to work correctly. I kept the file in a way that only can work in colab environment

If you want to run this in your local machine follow this steps:

Clone this repository and install transformers with this command pip3 install transformers
From bert repo, untar the weights (rename their weight dump file to pytorch_model.bin) and vocab file into a new folder model. (can be skipped for limitation, leads to poor performance)
Change the parameters accordingly in experiments/base_model/params.json. We recommend keeping batch size of 4 and sequence length of 512, with 6 epochs, if GPU's VRAM is around 11 GB.
For training, run the command python train.py --data_dir data/task1/ --bert_model_dir model/ --model_dir experiments/base_model
For eval, run the command, python evaluate.py --data_dir data/task1/ --bert_model_dir model/ --model_dir experiments/base_model --restore_file best

We used IO format here. Unlike original BERT repo, we only use a simple linear layer on top of token embeddings.

On test set, we got:

Some tokens have more than one annotations. We did not consider multi-label classification.
We only considered a linear layer on top of BERT embeddings. We need to see whether SciBERT + BiLSTM + CRF makes a difference.