mop: A Python repository from Cambridge Language Technology Lab

Mixture-of-Partitions: Infusing Large Biomedical Knowledge Graphs into BERT

Authors: Zaiqiao Meng, Fangyu Liu, Thomas Hikaru Clark, Ehsan Shareghi, Nigel Collier.

Code of our paper Mixture-of-Partitions: Infusing Large Biomedical Knowledge Graphs into BERT[EMNLP2021]

News:

[26 August 2022] - Our paper has been accepted to appear at the EMNLP 2021 as a short paper.

Introduction

Infusing factual knowledge into pre-trained models is fundamental for many knowledge-intensive tasks. In this paper, we proposed Mixture-of-Partitions (MoP), an infusion approach that can handle a very large knowledge graph (KG) by partitioning it into smaller sub-graphs and infusing their specific knowledge into various BERT models using lightweight adapters. To leverage the overall factual knowledge for a target task, these sub-graph adapters are further fine-tuned along with the underlying BERT through a mixture layer. We evaluate our MoP with three biomedical BERTs (SciBERT, BioBERT, PubmedBERT) on six downstream tasks (inc. NLI, QA, Classification), and the results show that our MoP consistently enhances the underlying BERTs in task performance, and achieves new SOTA performances on five evaluated datasets.

File structure

data_dir: downstream task dataset used in the experiments.
kg_dir: folder to save the knowledge graphs as well as the partitioned files.
model_dir: folder to save pre-trained models.
src: source code.
- adapter-transformers: adapter-transformers v1.1.1 forked from adapter-transformers, it has been modified for using different mixture approaches.
- evaluate_tasks: codes for the downstream tasks.
- knowledge_infusion: knowledge infusion main codes.

kg_dir and model_dir can be downloaded at this link.

Installation

The code is tested with python 3.8.5, torch 1.7.0 and huggingface transformers 3.5.0. Please view requirements.txt for more details. Our models use a modified adapter-transformers. To use this package, please go to the ./src/adapter-transformers folder of this project, and run pip install . to install the adapter-transformers package

Datasets

The BioAsq7b, PubMedQA, HoC datasets can be downloaded from BLURB
The MedQA dataset can be downloaded from: https://github.com/jind11/MedQA
The BioAsq8b datasets can be downloaded from: http://bioasq.org/

Train knowledge fusion and downstream tasks

Train Knowledge Infusion

To train knowledge infusion, you can run the following command in the src/knowledge_infusion/entity_prediction folder.

Click to expand!

MODEL="microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext"
TOKENIZER="microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext"
INPUT_DIR="kg_dir"
OUTPUT_DIR="model_dir"
DATASET_NAME="S20Rel"
ADAPTER_NAMES="entity_predict"
PARTITION=20

python run_pretrain.py \
--model $MODEL \
--tokenizer $TOKENIZER \
--input_dir $INPUT_DIR \
--data_name $DATASET_NAME \
--output_dir $OUTPUT_DIR \
--n_partition $PARTITION \
--use_adapter \
--non_sequential \
--adapter_names  $ADAPTER_NAMES\
--amp \
--cuda \
--num_workers 32 \
--max_seq_length 64 \
--batch_size 256 \
--lr 1e-04 \
--epochs 1 \
--save_step 2000

Train Downstream Tasks

To evaluate the model on a downstream task, you can go to the task folder and see the *.sh file for an example. For example, the following command is used to train a model on pubmedqa dataset over different shuffle_rates.

Click to expand!

MODEL="microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext"
TOKENIZER="microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext"
ADAPTER_NAMES="entity_predict"
PARTITION=20
python run_pretrain.py \
 --model $MODEL \
 --tokenizer $TOKENIZER \
 --input_dir $INPUT_DIR \
 --output_dir $OUTPUT_DIR \
 --n_partition $PARTITION \
 --use_adapter \
 --non_sequential \
 --adapter_names  $ADAPTER_NAMES\
 --amp \
 --cuda \
 --num_workers 32 \
 --max_seq_length 64 \
 --batch_size 256 \
 --bi_direction \
 --lr 1e-04 \
 --epochs 2 \
 --save_step 2000
done

Hyper-parameters

Pre-train

Click to expand!

Parameter	Value
lr	1e-04
epoch	1-2
batch_size	256
max_seq_length	64

BioASQ7b,BioASQ8b,PubMedQA

Click to expand!

Parameter	Value
lr	1e-05
epoch	25
patient	5
batch_size	8
max_seq_length	512
repeat_run	10

MedQA

Click to expand!

Parameter	Value
lr	1e-05,2e-05
epoch	25
patient	5
batch_size	12
max_seq_length	512
repeat_run	3
temperature	1

MedNLI

Click to expand!

Parameter	Value
lr	1e-05
epoch	25
patient	5
batch_size	16
max_seq_length	256
repeat_run	3
temperature	-15,-10,1

HoC

Click to expand!

Parameter	Value
lr	1e-05,3e-05
epoch	25
patient	5
batch_size	16,32
max_seq_length	256
repeat_run	5
temperature	1

If you find our paper and resources useful, please kindly cite our paper:

@inproceedings{meng2021mixture,
  title={Mixture-of-Partitions: Infusing Large Biomedical Knowledge Graphs into BERT},
  author={Meng, Zaiqiao and Liu, Fangyu and Clark, Thomas and Shareghi, Ehsan and Collier, Nigel},
  booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing},
  pages={4672--4681},
  year={2021}
}

Contact

If you have any questions, feel free to contact me via (zm324@cam.ac.uk).

cambridgeltl/mop

Mixture-of-Partitions: Infusing Large Biomedical Knowledge Graphs into BERT

News:

Introduction

File structure

Installation

Datasets

Train knowledge fusion and downstream tasks

Train Knowledge Infusion

Train Downstream Tasks

Hyper-parameters

Pre-train

BioASQ7b,BioASQ8b,PubMedQA

MedQA

MedNLI

HoC

If you find our paper and resources useful, please kindly cite our paper:

Contact