Biomedical-Relation-Classification
To understand what relationship types are possible and map unstructured natural language descriptions onto these structured classes, we used labeled sentences in Percha B 2018 to classify relations bewteen chemicals, genes and diseases.
Requirements
torch='1.3.1'
transformers='2.7.0'
keras
Dataset
All train dataset can be found here. You can download them manually and put them in the ./data
.
We used Kaggle-COVID-19-dataset to extract sentences what contains more than two biomedical entities and put them in ./data/COVID-19
. If you have a new dataset, please save it in ./data/YourDatasetName
, the format as follows.
start_entity | end_entity | start_entity_type | end_entity_type | marked_sentence |
---|---|---|---|---|
COVID-19 | Pharyngitis | Disease | Disease | end_entity, bronchitis, and start_entity represent the most common respiratory tract infections |
remdesivir | MERS | Chemical | Disease | The data presented here support testing of the efficacy of start_entity treatment in the context of a end_entity clinical trial |
remdesivir | COVID-19 | Chemical | Disease | Drugs are possibly effective for end_entity include: start_entity, lopinavir ritonavir, lopinavir ritonavir combined with interferon-, convalescent plasma, and monoclonal antibodies |
Pretrained model
We used ./pretrained/bert-base-cased
as pretrained model. You can use other pretrained bert model by specifying --bert_path.
Training
Parameters:
--task_type
: task type:chemical-disease,chemical-gene,gene-disease.
--confidence_limit
: dependency path lower confidence limit, use suggestion value if it equal -1.0. suggestion value:0.9 for chemical-disease; 0.5 for chemical-gene; 0.6 for gene-disease; 0.9 for gene-gene.
--prediction_path
: prediction data path.
--max_seq_len
: padding length of sequence.
--bert_path
: pretrained bert model path.
--lr
: learning rate.
--train_bs
: train batch size.
--eval_bs
: evaluate batch size.
--epochs
: training epochs.
--cuda
: which gpu be used.
Fine-tuning models:
./model/chemical-disease
./model/chemical-gene
./model/gene-disease
./model/gene-gene
Reproducing results
python3 main.py --task_type chemical-disease --prediction_path COVID-19 --max_seq_len 64 --bert_path ./pretrained/bert-base-cased --lr 1e-5 --train_bs 128 --eval_bs 64 --epochs 10 --cuda 0
COVID-19 knowledge graph
Acknowledgement
This work is supported by Aladdin Healthcare Technologies