/patent-citation-extraction

Primary LanguagePythonGNU Affero General Public License v3.0AGPL-3.0

Extract references from patents

Use BERT-based models to extract references to scientific literature in patent texts. This repository includes the preprocessing pipeline, the script to train and evaluate the BERT model and the data used to finetune the models.

Read more at Improving reference mining in patents with BERT.

Finetuned models

I finetuned three models for this project:

Requirements

  • python3
  • pip3 install -r requirements.txt

Usage

Below are three example scenarios for using this project.

Train a new model and evaluate

python run_ner.py --data_dir=data/bio --bert_model=bert-base-cased --output_dir=out_base --max_seq_length=64 --do_train --num_train_epochs 5 --do_eval --warmup_proportion=0.1

This example uses the bert-base-cased model that is hosted by Huggingface, which will be downloaded automatically if neccessary. You can also use a local model by supplying a path to the --bert_model argument.

Run leave-one-out evaluation

python run_ner.py --data_dir=data/bio --bert_model=bert-base-cased --max_seq_length=64 --do_leave_one_out --num_train_epochs 5 --do_eval --warmup_proportion=0.1 --output_dir=out_leave_one_out

Use a finetuned model on new data

python run_ner.py --data_dir=data/new_data --bert_model=./path/to/model --task_name=ner --output_dir=out_results --max_seq_length=64 --do_eval --train_ratio=0

Note the --train_ratio=0, meaning none of the data files will be kept separate for training, and evaluation is run on all files.