/release-orthopairs

Data-release-pipeline: Retrieval of homology data (human to other species) from PANTHER (http://www.pantherdb.org/)

Primary LanguageJava

Orthopairs: Protein ortholog file generation

To see how to run this script, check out the Running Orthopairs section.

This step has been rewritten in Java and completely overhauled from the Perl version.

The overall goal of Orthopairs remains the same: For each of Reactome's model organisms, find all human-model organism protein orthologs. These orthologs are the foundation of the Orthoinference step, which produces all electronically inferred ReactionlikeEvents and Pathways in the knowledgebase.

Old Orthopairs

When the script was initially written, getting protein orthologs programmatically wasn't simple. Reactome accomplished it through three steps:

  1. From Ensembl BioMart obtain all protein-gene relationships for Human

For each model organism:

  1. From Ensembl BioMart obtain all gene-protein relationships for the organism
  • BioMart has become increasingly unstable as Ensembl's focus has moved to their RESTful API. Unfortunately, the information we want is not available through the API.
  1. From Ensembl Compara, obtain all human-organism gene orthologs
  • While Compara is stable, the process of obtaining the gene orthologs typically took around 24 hours
  1. With the organism's gene-protein list, the human protein-gene list and the human-organism gene ortholog list, map the species proteins to the human proteins.

New Orthopairs

Due to the above-mentioned shortcomings of the original script, as well as the introduction of a new resource, PANTHER, the algorithm for the step has been overhauled. PANTHER provides quarterly data releases in easily parsed flat files that bundles both Gene and Protein ortholog information together. Additionally, the files provide information about how diverged the ortholog is from the source species (in Reactome's case, Human). This gives more confidence in the computational inferences we provide to our users and significantly reduces the runtime of the script. Additionally, the Protein IDs are all from UniProt, which simplifies things during the inference process and harmonizes Reactome's protein ID linkouts.

There are a few catches with using this new resource though -- Unlike Protein IDs, Gene IDs don't come from a single resource in PANTHER. The old Orthopairs system provided all Gene IDs from Ensembl, which made those linkouts simpler. The PANTHER file provides a mix of Ensembl IDs and IDs from Model Organism Databases (MODs). As a result, we've had to add a step that maps all MOD IDs to Ensembl IDs, to ensure our linkouts are stable. This entails downloading mapping files from various MODs, which adds a bit of complexity to the code and potential future issues.

Model Organism Databases used in new Orthopairs

Mouse Genome Informatics (MGI) (link) -- Mapping file: http://www.informatics.jax.org/downloads/reports/HGNC_homologene.rpt

Rat Genome Database (RGD) (link) -- Mapping file: https://download.rgd.mcw.edu/data_release/GENES.RAT.txt

Xenbase (Frog) (link) -- Mapping file: https://ftp.xenbase.org/pub/GenePageReports/GenePageEnsemblModelMapping.txt

Zebrafish Information Network (ZFIN) (link) Mapping file: https://zfin.org/downloads/ensembl_1_to_1.txt

Saccharomyces Genome Database (SGD) (link) Mapping file downloaded from here. File not programmatically accessible, so mapping information is stored insrc/main/resources/sgd_ids.txt

Running Orthopairs

Orthopairs can be executed from the bash script runOrthopairs.sh

Run bash runOrthopairs.sh

Checking Orthopairs output

There should be 2 files produced for each species in the directory corresponding to the release number. For example, if it was release 70, you would expect to find 2 files corresponding to "mmus" (Mouse): 70/mmus_gene_protein_mapping.txt and 70/hsap_mmus_mapping.txt.

Compare the line counts of the files to the same ones produced during the previous release. If they are similar, Orthopairs was likely run successfully.