Extract ontology terms referenced from PubMed abstracts as per the MEDLINE/PubMed Baseline Repository by using SciGraph against a set of ontologies.
Using OmniCorp requires the following open source tools:
- Git
- Maven
- Scala and sbt
- wget
On macOS, these can be installed using Homebrew by running
the command: brew install git maven scala sbt wget
.
We need to use a specially modified version of SciGraph in order to carry out text annotations.
To install this version locally, run make SciGraph
. This will download, compile and install the customized SciGraph we use.
You will then need to run make omnicorp-scigraph
to generate the SciGraph instance for the ontologies specified in ontologies.ofn.
Extract ontology terms used in the COVID-19 Open Research Dataset (CORD) as tab-delimited files for further processing in COVID-KOP.
In order to generate OmniCORD output files, you should:
- Update the
ROBOCORD_DATE
variable inMakefile
. You can look up the latest CORD-19 release date on their website. - Download the CORD-19 dataset by running
make robocord-download
. This will automatically create a directory in therobocord-datas
directory and download the CORD-19 dataset for$ROBOCORD_DATE
into that directory. - Uncompress the dataset by running
make robocord-data
. - Test the extraction program by running
make robocord-test
. This will extract data from some articles in order to ensure that the program is working correctly. It will also create a directory in therobocord-outputs
directory to store the results in. It's usually a good idea to clear therobocord-output
directory after running the test and ensuring that the output files look correct. - Use
robocord.job
to attempt to run all the jobs on a SLURM cluster. Any number of jobs can be specified, but values of around 4000 seem to work with. Example:sbatch --array=0-3999 robocord.job
. - Use RoboCORDManager to re-run any jobs that failed to complete. You can
use the
--dry-run
option to see what jobs will be executed before they are run. Jobs are executed using therobocord-sbatch.sh
script, so modify that if necessary. Example:srun sbt "runMain org.renci.robocord.RoboCORDManager --job-size 20
Currently, we look for terms from the following ontologies:
- Uberon (base) (OWL)
- ChEBI (OWL)
- Cell Ontology (OWL)
- Environment Ontology (OWL)
- Gene Ontology (plus) (OWL)
- NCBITaxon (OWL)
- Relation Ontology (OWL)
- PRotein Ontology (PRO) (OWL)
- Biological Spatial Ontology (OWL)
- Mondo Disease Ontology (OWL)
- The Human Phenotype Ontology (OWL)
- Ontology for Biomedical Investigations (OWL)
- Sequence Ontology (OWL)
- HUGO Gene Nomenclature Committee (OWL)
- Experimental Factor Ontology (OWL)