BioKGrapher is a comprehensive tool designed for the automatic construction of knowledge graphs (KGs) from large-scale biomedical literature, processing PubMed IDs as input. By leveraging NLP techniques, BioKGrapher extracts and ranks biomedical concepts, integrating them into structured KGs. This tool can be valuable to construct specialized KGs to get a conceptual view on a topic of interest or to export the KG for further applications such as predictive modeling, drug repurposing, document classification, RAG and decision support systems.
- Automatic Knowledge Graph Construction: Extracts and integrates biomedical concepts from large PMID sets
- Named Entity Recognition and Linking (NER+NEL): Utilizes MedCAT for identifying and normalizing biomedical concepts using the UMLS Metathesaurus
- Concept Weighting and Re-Ranking: Applies Kullback-Leibler divergence and local frequency weighting to identify prevalent concepts specific to the provided set
- Hierarchical Structuring and Relationship Mapping: Constructs hierarchical knowledge graphs with semantic triples using UMLS's MRHIER and MRREL files
- Evaluation: Evaluates constructed KGs by comparing them with concepts extracted from evidence-based clinical practice guidelines.
- Downstream Applications: Demonstrates utility in document classification and an example drug repurposing tasks.
Clone the Repository:
git clone https://github.com/rtg-wispermed/BioKGrapher.git
Navigate to the Project
cd BioKGrapher
Install requirements
pip install -r requirements.txt
BioKGrapher requires a valid UMLS license to access and use the UMLS Metathesaurus files. Obtain a license from the UMLS Terminology Services.
Once you have obtained a license, sign into your NIH profile / UMLS license and download one of the following public MedCAT models:
- UMLS Full. >4MM concepts trained self-supervsied on MIMIC-III was used in this work
- SNOMED International (Full SNOMED modelpack trained on MIMIC-III)
Unzip the model into the empty models folder.
Download the Full UMLS Release Files and replace the following UMLS placeholder files with the ones from your UMLS Rlease Files:
- MRCONSO.RRF
- MRHIER.RRF
- MRREL.RRF
- MRDEF.RRF
It is recommended to stick to a UMLS Rlease that is the same version or newer to the one that was used in the MedCAT model, eg. UMLS Release 2022AA and newer.
Navigate to the index/baseline folder
cd index/baseline
Download the PubMed baseline files:
wget -nc ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/*.xml.gz
and also (optionally) add the latest Updatefiles for the latest publications:
wget -nc ftp://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/*.xml.gz