/omnicorp

Primary LanguageScalaMIT LicenseMIT

Build Status

OmniCorp

Extract ontology terms referenced from PubMed abstracts as per the MEDLINE/PubMed Baseline Repository by using SciGraph against a set of ontologies.

Prerequisites

Using OmniCorp requires the following open source tools:

  • Git
  • Maven
  • Scala and sbt
  • wget

On macOS, these can be installed using Homebrew by running the command: brew install git maven scala sbt wget.

Setting up SciGraph

We need to use a specially modified version of SciGraph in order to carry out text annotations.

To install this version locally, run make SciGraph. This will download, compile and install the customized SciGraph we use.

You will then need to run make omnicorp-scigraph to generate the SciGraph instance for the ontologies specified in ontologies.ofn.

OmniCORD

Extract ontology terms used in the COVID-19 Open Research Dataset (CORD) as tab-delimited files for further processing in COVID-KOP.

In order to generate OmniCORD output files, you should:

  1. Update the ROBOCORD_DATE variable in Makefile. You can look up the latest CORD-19 release date on their website.
  2. Download the CORD-19 dataset by running make robocord-download. This will automatically create a directory in the robocord-datas directory and download the CORD-19 dataset for $ROBOCORD_DATE into that directory.
  3. Uncompress the dataset by running make robocord-data.
  4. Test the extraction program by running make robocord-test. This will extract data from some articles in order to ensure that the program is working correctly. It will also create a directory in the robocord-outputs directory to store the results in. It's usually a good idea to clear the robocord-output directory after running the test and ensuring that the output files look correct.
  5. Use robocord.job to attempt to run all the jobs on a SLURM cluster. Any number of jobs can be specified, but values of around 4000 seem to work with. Example: sbatch --array=0-3999 robocord.job.
  6. Use RoboCORDManager to re-run any jobs that failed to complete. You can use the --dry-run option to see what jobs will be executed before they are run. Jobs are executed using the robocord-sbatch.sh script, so modify that if necessary. Example: srun sbt "runMain org.renci.robocord.RoboCORDManager --job-size 20

Ontologies used

Currently, we look for terms from the following ontologies: