/release-download-directory

Data-release-pipeline: Generation of files for the Reactome website download directory

Primary LanguageJava

Download Directory

The Download Directory step generates a number of files that can be downloaded from Reactome's download page. It also generates an archive file during the CreateReleaseTarball step that acts as a snapshot of the Release repository and some services on the release server. This module has been rewritten from Perl to Java.

The steps of Download Directory and files produced are:

Files no longer generated by download directory include:

  • All SBML and SBGN files (generated elsewhere)
  • diagrams.pdf.zip and diagrams.png.zip
  • curated_complex.txt and curated_complexes.stid.txt
  • compiled_pathway_images files
  • st_id_2_uniprot.txt
  • TheReactomeBook.pdf.zip and TheReactomeBook.rtf.zip
  • reactome.tar.gz

Preparing and running Download Directory

To run the Download Directory step, a few things need to be taken into account.

Local installation of the Pathway-Exchange dependency

Download Directory depends on a local installation of Pathway-Exchange. This requires generating a new ant build using PathwayExchangeJar.xml. It can be built using most IDEs (such as Eclipse or IntelliJ) or using the following command:
ant -buildfile ant/PathwayExchangeJar.xml

This makes sure that the Data Model is up to date. If not updated, an unexpected data structure could cause the BioPAX, GSEAOutput or CreateReactome2BioSystems steps to fail.

Once the jar file has been built, it should appear above the Pathway-Exchange directory in the RESTfulAPI/web/WEB-INF/lib/ folder. The jar file next needs to be installed locally in accordance with the POM file in the Download Directory repo.

Here is a sample snippet from the pom.xml file in Download Directory that contains the Pathway-Exchange information we care about:

<!-- Locally installed pathway exchange jar -->
  <dependency>
    <groupId>org.reactome.pathway-exchange</groupId>
    <artifactId>pathwayExchange</artifactId>
    <version>1.0.1</version>
  </dependency>

Based on the above information POM file snippet, to locally install the pathwayExchange.jar we would use the following command:

mvn install:install-file -Dfile=pathwayExchange.jar -DgroupId=org.reactome.pathway-exchange -DartifactId=pathwayExchange -Dversion=1.0.1 -Dpackaging=jar

  • Make sure the groupId, artifactId, and version arguments match the same fields in the POM

If the build was successful, we have successfully installed an up-to-date Pathway-Exchange jar file that will be used during Download Directory.

Setting config.properties

Next the config.properties file must be set or updated in the src/main/resources/ folder. Below is a sample file:

## Sample config.properties file for Download Directory
username=mySQLUsername
password=mySQLPassword
database=release_current
host=localhost
port=3306
release=releaseNumber
## Filepaths important to the Download Directory step
absoluteReleaseDirectoryPath=/usr/local/gkb/scripts/release/
releaseDownloadDirectoryPath=/usr/local/gkb/scripts/release/download_directory/
speciesConfigPath=src/main/resources/Species.json
stepsToRunConfigPath=src/main/resources/stepsToRun.config

Running the program

Now that the Pathway-Exchange project is accessible and the config.properties file set, the step can be run using the script runner runDownloadDirectory.sh.

Note: The ReactomeBook and CreateReleaseTarball steps are deprecated. The ReactomeBook is generated from the event-pdf project and the release tarball has been retired. The legacy implementations still use the old Perl scripts found here. This means that, to generate the their output files, Download Directory will need to be run in either a docker container with the Release project usable (see release-container project), or on the release server. Download Directory can be run anywhere despite this, but if these steps aren't commented out in stepsToRun.config (see below), they will report errors.

If the DownloadDirectory step has been run from the release.pl wrapper on the release server, the generated files would appear in /usr/local/reactomes/Reactome/production/Website/static/download/67/ (if it was release 67). When the jar file is run directly, the files will appear in /usr/local/gkb/scripts/release/download-directory/67/.

Running specific modules of Download Directory

Specific files can be generated via the stepsToRun.config file found in the src/main/resources/ folder. This file contains a list of all steps that will be run during the Download Directory process. Commenting out steps in this file will cause it to be excluded during the run. Sample below:

# This file is used to specify which steps in DownloadDirectory to execute.
# Comment out any steps that don't need to be run.
DatabaseDumps
#BioPAX2
BioPAX3
GSEAOutput
FetchTestReactomeOntologyFiles
#PathwaySummationMappingFile
MapOldStableIds
gene_association.reactome
models2pathways.tsv
CreateReactome2BioSystems

In this example, the BioPAX2 and PathwaySummationMappingFile steps will not be run.

Running specific modules in Jenkins

When running Download Directory in Jenkins, the code is cloned from Github each time so updating the stepsToRun file locally is not possible. Instead, you will need to upload the modified stepsToRun file as a Jenkins 'credential' before re-running download directory. This is explained step-by-step below:

  1. Modify the stepsToRun.config file found in src/main/resources folder so that only the step(s) you want to run are not commented out.
  2. In Jenkins, navigate to Releases -> releaseNumber (eg: 70).
  3. On the left-hand side, select Credentials. You should see a table of different credentials used by Jenkins. Look for the one with the ID 'stepsToRun' and select 'stepsToRun.config' under the Name column.
  4. On the left-hand side of this page, click Update, and then click the check-box for 'Upload stepsToRun.config' again.
  5. Upload the stepsToRun file you modified in step 1. Save and re-run Download Directory -- only the steps you specified will be run.

Verifying Download Directory Results

This section will touch on the files produced by each step in Download Directory, and how to verify they were produced correctly. Often, comparing the files produced in the previous release is the way to go, but where needed this guide will provide additional suggestions for checking the output.

Note: During Download Directory, the output files are temporarily held in a folder corresponding to the current release. For release 67, the output files would appear in release-download-directory/download-directory/67/.

DatabaseDumps

This step generates two mySQL dump files from the stable_identifiers and release_current databases. The files produced are gk_stable_ids.sql.gz and gk_current.sql.gz, respectively. These are then placed in the release-download-directory/download-directory/67/databases/ folder (if it was release 67). Comparing the size of these files to the previous release is sufficient for verifying the success of this step.

BioPAX

This step generates BioPAX level 2 and level 3 files for each species in the release_current database. Additional information on BioPAX can be found at its website. It makes use of the Pathway-Exchange jar that should have been locally installed during the preparation step of Download Directory.

Note: Due to the dependency on a local installation of Pathway-Exchange, this is also the most error-prone step of Download Directory. Any attribute or instance errors that result from BioPAX might mean that this installation will need to be updated to the most recent version. See above for instructions on installing/updating the Pathway-Exchange module.

Each zip file produced should contain a number of files (owl or validation xml) corresponding to the species found in the Species.json file.

biopax2.zip: This zip file should contain BioPAX level 2 files for each species in Species.json. Inspect a few of the files for the string biopax-level2 near the beginning. Next, look at the corresponding validation files (found in biopax2_validator.zip) (see below).

biopax.zip: This zip file should contain BioPAX level 3 files for each species in Species.json. Inspect a few of the files for the string biopax-level3 near the beginning. Next, look at the corresponding validation files (found in biopax_validator.zip) (see below).

biopax2_validator.zip & biopax_validator.zip: These zip files should contain a validation.xml file for each species that has an owl file. These validation files can be quite large since they report problems at both the warning and error levels, and typically there have been many warnings. Checking the <validation description> tag in the validation file will list the number of problems found, including any errors that might have come up. Any errors found will need to be investigated. Additionally, comparing the number of warnings between releases is another way to ensure that the BioPAX process ran successfully.

Finally, a BioPAX validator tool exists online. The owl files can be run through here as well to check file validity.

GSEAOutput

This step uses the ReactomeToMsigDBExport method found in Pathway-Exchange. Ensure that the jar has been installed locally, as described in an earlier section of this document. It takes all Human Pathway instances in the release_current database and converts the data to MSigDB format, which can be used in Gene Set Enrichment Analysis (GSEA). More information can be found at the GSEA website.

The Reactome.gmt.zip file produced is tab-separated, with each row having varying numbers of columns. The first few columns for each row are a Pathway instance's displayName and stableIdentifier values, and the string Reactome Pathway. This third column is added manually after initially generating the Reactome.gmt file. The remaining columns are all gene names that are associated with the Pathway instance (needs confirmation). The GSEAOutput step attempts to add a line for each Human Pathway, but some entries are excluded due to missing attributes in the instance, meaning that the file should have nearly as many lines as there are Human Pathway instances in the release_current database.

FetchTestReactomeOntologyFiles

This step produces 3 different files, reactome_data_model.pprj, reactome_data_model.pont, and reactome_data_model.pins. All are parsed from the ontology attribute in the Ontology table in the release_current database. This value in Ontology.ontology is a blob that contains all 3 files. The contents of each file are parsed out of the blob during the FetchTestReactomeOntologyFiles step. These files are associated with Protégé 2.0 (website)and can be used with their software. Additional information about each file type can be found here.

Compare each file with its equivalent from the previous release. The beginning and end of each file should have the same formatting between them, although the content may differ.

PathwaySummationMappingFile

This step creates a tab-separated file, pathway2summation.txt. The file contains information on all Human Pathways in the release_current database. The 3 columns of the file are stableIdentifier, name, and summation. The file is populated from all Human Pathway instances.

The file should have the same amount of lines as Human Pathway instances in the release_current database.

MapOldStableIds

This step will match old stableIdentifiers to ones in the new format. The file contains two tab-separated columns of stable IDs in the new (eg: R-HSA-1234567) and old (eg: REACT_98765) formats. The file should include stableIdentifiers for all species in Reactome, and should have approximately the same number of lines as stableIdentifier instances in the release_current database.

GenerateGOAnnotationFile

This step generates the 'gene_association.reactome' GO Annotation file. Information about the file format can be found here. This step will go through all curated ReactionlikeEvents that are in the database and generate GOA lines for a variety of instances pertaining to all 3 of the Gene Ontology annotation types: Cellular Compartment, Molecular Function and Biological Process.

Further information on the details of the GenerateGOAnnotationFile step can be found here.

To check the file, compare it with previous release. Since the number of curated proteins does not change drastically between release, there should be a relatively similar amount of annotations in the files.

models2pathways.tsv

This step copies the models2pathways.tsv file that is produced during the BioModels step. Comparing the file to previous releases is sufficient for this step.

Protege exporter

This step will run the protege export code in `GKB::WebUtils`. The Perl code will create an archive file containing a pins, pont, and pprj file, for a given pathway. This code is designed to only export TOP-LEVEL pathways, as defined as pathways associated with the FrontPageItem in the database.

This step takes the following configuration options, specified in config.properties:

  • protegeexporter.pathToWrapperScript - This is the absolute path to the directory containing the Perl wrapper script run_protege_exporter.pl. The script should be located in src/main/resources for this project.
  • protegeexporter.parallelism - The number of concurrent protege export jobs to run. Try to keep this smaller than the number of available cores (MySQL and your operating system should get 1 core each, at least). If you do not specify anything for this value, then parallelism will be the default value used by the ForkJoinPool class, which is usually the number of cores minus 1.
  • protegeexporter.extraIncludes - If you need to specify additional include paths for Perl, use this option. this should be a comma-separate string formatted as: -I/alt/path/to/perl/libs,-I/other/alt/path/to/libs.
  • protegeexporter.filterIds - If you want to filter to only export specific pathways, specify them here as a comma-separated list of DB_IDs.
  • protegeexporter.filterSpecies - If you want to filter to only export pathways of a specific species, you can specify a comma-separated list here. Normally, you would just set this to Homo sapiens.

CreateReactome2BioSystems

This step creates a zip file containing an NCBI BioSystems-formatted xml file for each of Reactome's primary model organisms. More information on NCBI BioSystems is found at the website.

This module uses the ReactomeToBioSystemsConverter method in Pathway-Exchange. Ensure that the jar has been installed locally, as described in an earlier section of this document.

Compare the files produced with those from an earlier release for verification the process ran successfully.

Internal Reactome files

The steps in this section and the files they produce are generated solely for Reactome internal QA purpose.

HumanPathwaysWithDiagrams

This step creates a tab-delimited text file recording:

  • pathway database identifier
  • pathway name
  • disease status (i.e. is it a disease pathway - true or false)

for all human pathway instances which have their own pathway diagram (i.e. not a subpathway in a larger pathway diagram and not a diagram composed solely of subpathway nodes).