PheneBank aims at automatic extraction and validation of a database of human phenotype-disease associations in the scientific literature. This package provides code, data, and models for the following three purposes:
The model is trained to support 9 categories of entities:
- Phenotype
- Disease
- Anatomy
- Cell
- Cell_line
- GPR
- Gene_variant
- Molecule
- Pathway
Map an entity to its corresponding concept in any of the following 5 ontologies:
- SNOMED (Phenotype, Disease, GPR, Anatomy, Molecule, Cell, Cell_line, Gene_variant)
- HPO (Phenotype, Disease)
- MESH (Phenotype, Disease)
- FMA (Anatomy)
- PRO (GPR)
Given an input text, extract its entities and map each to its corresponding concept in the ontologies (a pipeline containing both previous stages).
- To be activated soon!
Download the followings:
- embeddings.zip (around 1.1GB)
- data.zip (around 200MB)
To get started with the pipeline, first obtain the required data
and decompress them in the project directory.
Then, import pipeline
into your project:
from pipeline import pipe
pp = pipe()
input_text = "Risk factors for recurrent respiratory infections in preschool children in China."
Find entities in an input text:
pp.tag(input_text)
The output will look like the following (formatted for clarity). Lists of tuples, one tuple per sentence. Each tuple contains two lists: words and their corresponding tags.
[
(['Risk', 'factors', 'for', 'recurrent', 'respiratory', 'infections', 'in', 'China.'],
['O', 'O', 'O', 'B-Phenotype', 'I-Phenotype', 'I-Phenotype', 'O', 'O'])
]
Find entities in the text and harmonise (map) them to their corresponding ontologies:
pp.tag_harmonise(input_text)
The output will have each sentence as a list of tuples. Each tuple has three parts: word, tag (Null if not an entity), (the list of) corresponding concept IDs ([] if no mapping was found).
[
[
('Risk', 'Null', []),
('factors', 'Null', []),
('for', 'Null', []),
('recurrent respiratory infections', 'Phenotype', [('HP:0002205', 1.0)]),
('in', 'Null', []),
('China', 'Null', [])
]
]
- Place the new ontology file (eg, hp.obo) under the
data
directory. - Fix the corresponding path in
utils/project_config.py
. - Use the
ontology_embedding.py
script under grounding to create a new semantic embedding.
You can use the following command in the "embeddings" directory to binarise the ontology embedding:
$ ./convertvec txt2bin [embedding.txt] [embedding.bin]
(convertvec script from https://github.com/marekrei/convertvec)
The tagging stage relies on Anago, a Bidirectional LSTM-CRF for Sequence Labeling: https://github.com/Hironsan/anago
M.T. Pilehvar, D. Smedley, A. Bernard, and N. Collier: PheneBank: a literature-based database of phenotypes. Bioinformatics, Volume 38, Issue 4, 2022.