Use BERT-based models to extract references to scientific literature in patent texts. This repository includes the preprocessing pipeline, the script to train and evaluate the BERT model and the data used to finetune the models.
Read more at Improving reference mining in patents with BERT.
I finetuned three models for this project:
- Based on BERT
- Based on BioBERT
- Based on SciBERT <-- this one seems to perform the best, though the differences are small
python3
pip3 install -r requirements.txt
Below are three example scenarios for using this project.
python run_ner.py --data_dir=data/bio --bert_model=bert-base-cased --output_dir=out_base --max_seq_length=64 --do_train --num_train_epochs 5 --do_eval --warmup_proportion=0.1
This example uses the bert-base-cased
model that is hosted by Huggingface, which will be downloaded automatically if neccessary. You can also use a local model by supplying a path to the --bert_model
argument.
python run_ner.py --data_dir=data/bio --bert_model=bert-base-cased --max_seq_length=64 --do_leave_one_out --num_train_epochs 5 --do_eval --warmup_proportion=0.1 --output_dir=out_leave_one_out
python run_ner.py --data_dir=data/new_data --bert_model=./path/to/model --task_name=ner --output_dir=out_results --max_seq_length=64 --do_eval --train_ratio=0
Note the --train_ratio=0
, meaning none of the data files will be kept separate for training, and evaluation is run on all files.