Overview | Installation | Datasets | Examples | How You Can Help
Bio Covid is an open source project to build a knowledge graph to enable research in COVID-19 and related disease areas.
We're excited to release an open source knowledge graph to speed up the research into Covid-19. Our goal is to provide a way for researchers to easily analyse and query large amounts of data and papers related to the virus.
Bio Covid makes it easy to quickly trace information sources and identify articles and the information therein. This first release includes entities extracted from Covid-19 papers, and from additional datasets including, proteins, genes, disease-gene associations, coronavirus proteins, protein expression, biological pathways, and drugs.
For example, by querying for the virus SARS-CoV-2, we can find the associated human protein, proteasome subunit alpha type-2 (PSMA2), a component of the proteasome, implicated in SARS-CoV-2 replication, and its encoding gene (PSMA2). Additionally, we can identify the drug carfilzomib, a known inhibitor of the proteasome that could therefore be researched as a potential treatment for patients with Covid-19. To support the plausibility of this association and its implications, we can easily identify papers in the Covid-19 literature where this protein has been mentioned.
By examining these specific relationships and their attributes, we are directed to the data sources, including publications. This will help researchers to efficiently study the mechanisms of coronaviral infection, the immune response, and help to find targets for the development of treatments or vaccines more efficiently. We can also expand our search to include entities such as publications, organisms, proteins and genes as is shown below:
Our team currently consists of a partnership between GSK, Oxford PharmaGenesis and Vaticle
The schema that models the underlying knowledge graph alongside the descriptive query language, TypeQL, makes writing complex queries an extremely straightforward and intuitive process. Furthermore, TypeDB's automated reasoning, allows Bio Covid to become an intelligent database of biomedical data for the Covid research field that infers implicit knowledge based on the explicitly stored data. TypeDB Data - Bio Covid can understand biological facts, infer based on new findings and enforce research constraints, all at query (run) time.
Prerequesites: Python >3.6, TypeDB Core 2.3.3, TypeDB Python Client API 2.2.0, Workbase 2.1.2.
Clone this repo:
git clone https://github.com/vaticle/typedb-data-bio-covid.git
Manually download all source datasets and put them in the Datasets
folder. You can find the links below.
Set up a virtual environment and install the dependencies:
cd <path/to/typedb-data-bio-covid>/
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Start typedb
typedb server
Start the migrator script
python migrator.py -n 4 # insert using 4 threads
For help with the migrator script command line options:
python migrator.py -h
Now grab a coffee (or two) while the migrator builds the database and schema for you!
TypeQL queries can be run either on TypeQL console, on workbase or through client APIs. However, we encourage running the queries on Workbase to have the best visual experience.
# Return drugs that are associated to genes, which have been mentioned in the same
# paper as the gene which is associated to SARS.
match
$virus isa virus, has virus-name "SARS";
$gene isa gene;
$1 ($gene, $virus) isa gene-virus-association;
$2 ($gene, $pub) isa mention;
$3 ($pub, $gene2) isa mention;
$gene2 isa gene;
not {$gene2 is $gene;};
$4 ($gene2, $drug); $drug isa drug;
offset 0; limit 30;
Currently the datasets we've integrated include:
- CORD-NER: The CORD-19 dataset that the White House released has been annotated and made publicly available. It uses various NER methods to recognise named entities on CORD-19 with distant or weak supervision.
- Uniprot: We’ve downloaded the reviewed human subset, and ingested genes, transcripts and protein identifiers.
- Coronaviruses: This is an annotated dataset of coronaviruses and their potential drug targets put together by Oxford PharmaGenesis based on literature review.
- DGIdb: We’ve taken the Interactions TSV which includes all drug-gene interactions.
- Human Protein Atlas: The Normal Tissue Data includes the expression profiles for proteins in human tissues.
- Reactome: This dataset connects pathways and their participating proteins.
- DisGeNet: We’ve taken the curated gene-disease-associations dataset, which contains associations from Uniprot, CGI, ClinGen, Genomics England and CTD, PsyGeNET, and Orphanet.
- SemMed: This is a subset of the SemMed version 4.0 database
In progress:
- CORD-19: We incorporate the original corpus which includes peer-reviewed publications from bioRxiv, medRxiv and others.
- TODO: write migrator script
- TissueNet
- TODO:
./Migrators/TissueNet/TissueNetMigrator.py
incomplete: only migrates a single data file and is not called in./migrator.py
.
- TODO:
We plan to add many more datasets!
This is an on-going project and we need your help! If you want to contribute, you can help out by helping us including:
- Migrate more data sources (e.g. clinical trials, DrugBank, Excelra)
- Extend the schema by adding relevant rules
- Create a website
- Write tutorials and articles for researchers to get started
If you wish to get in touch, please talk to us on the #typedb-data-bio-covid channel on our Discord (link here).