/rain

Pipeline generating the RAIN interaction files

Primary LanguagePythonOtherNOASSERTION

RAIN - RNA Association and Interactions Networks

This document describes the RAIN pipeline generating the interaction files available on the RAIN download page. The download files are then fed into an internal MySQL database to provide the RAIN webinterface. However, since this step is specific to our local setup, this step is not described here.

To create the interaction files, please follow subsequent instructions.

Checkout repository

Checkout repository using git via ssh. Via ssh:

git clone git@github.com:JungeAlexander/rain.git
cd rain

Or, equivalently, via https:

git clone https://github.com/JungeAlexander/rain.git
cd rain

Create and activate virtual environment

Create a virtual environment to execute the RAIN pipeline in using the package and environment manager conda. conda can either be downloaded with Anaconda or Minconda as described here. We ran and tested the pipeline using Anaconda 4.2.0 in the Python 3.5 version which can be obtained here.

After downloading and installing conda, create the Python virtual environment for RAIN. (Preferred) If on 64bit linux OS:

conda create --name rain --file explicit-spec-file.txt

(Alternative) If the above isn't suited:

conda env create -f environment.yml

Then activate the virtual environment before running the pipeline:

source activate rain

Testing the RAIN virtual environment

The RAIN pipeline includes a test suite that ensures basic functionality of the RAIN environment prior to running the complete RAIN pipeline generating the interaction files. The test suite can be executed as follows:

python -m unittest discover example/

If all tests pass, the following will be printed at the end of the command execution:

Ran 3 tests in 0.306s
OK

An explanation of the input and output files used by the test suite:

Test suite input files

  • example/experiments_example.tsv
  • example/predictions_example.tsv
  • example/textmining_example.tsv

These input files to the RAIN test suite resemble the interaction files for different evidence channels which are produced by RAIN and offered on the RAIN download page. Each file has the following tab-separated columns (no header):

Organism (NCBI taxonomy id.); ID1; ID2; Directed (true/false); Evidence; Score; Source; URL; Comment

Test suite expected output file

  • example/integrated_score_expected.tsv

This output file to be generated by the test suite contains RAIN combined scores across all evidence channels. The file contains the following tab-separated columns (no header):

Organism (NCBI taxonomy id.); ID1; ID2; Evidence; Combined confidence score

Running RAIN pipeline

The RAIN pipeline generating the interaction files can be start by executing:

./make_master_files.sh

Pipeline logging output is automatically written to make_master_files.log.

If the pipeline finishes successfully, the message "RAIN pipeline ran successfully." is printed at the end. If not, please check the log file for error messages. At the end of the pipeline, a small set of unittests is run to make sure the interactions files exist, are in the expected format and have the expected size.

The complete pipeline runs for about 8 hours on a single core and uses about 32 GB of RAM. The by far most expensive step in the RAIN pipeline is benchmarking and integrating the miRNA prediction tools into the RAIN prediction evidence channel. If runtime or memory usage of the pipeline is an issue, consider changing the following line in ./make_master_files.sh from

USE_PRECOMPUTED_PREDICTION_CHANNEL=false

to

USE_PRECOMPUTED_PREDICTION_CHANNEL=true

This will download a pre-computed version of the miRNA-target prediction channel from the RAIN website.

Pipeline output files

Output files are located in the folder download_files/ and contain interactions for different organisms split into one file per evidence channel and a file containing combined scores.

Evidence Homo sapiens Mus musculus Rattus norvegicus Saccharomyces cerevisiae ALL ORGANISMS
Curated 9606.v1.database.tsv.gz N/A N/A N/A v1.database.tsv.gz
Experiments 9606.v1.experiments.tsv.gz 10090.v1.experiments.tsv.gz 10116.v1.experiments.tsv.gz 4932.v1.experiments.tsv.gz v1.experiments.tsv.gz
Text mining 9606.v1.textmining.tsv.gz 10090.v1.textmining.tsv.gz 10116.v1.textmining.tsv.gz 4932.v1.textmining.tsv.gz v1.textmining.tsv.gz
Predictions 9606.v1.predictions.tsv.gz 10090.v1.predictions.tsv.gz 10116.v1.predictions.tsv.gz 4932.v1.predictions.tsv.gz v1.predictions.tsv.gz
Combined 9606.v1.combined.tsv.gz 10090.v1.combined.tsv.gz 10116.v1.combined.tsv.gz 4932.v1.combined.tsv.gz v1.combined.tsv.gz

Note that the files above are not filtered based on a confidence score cutoff. In contrast, the RAIN web interface only contain interactions with a confidence score greater than 0.15.

The Croft et al gold standard file is v1.croft.tsv.gz while the complete gold standard expanded including highly reliable interactions from miRTarBase/NPInter is located at v1.extended.gold.standard.tsv.gz.

Furthermore, v1.rna.aliases.search.interface.txt.gz contains all aliases accepted by the RAIN search interface. v1.rna.aliases.complete.txt.gz contain aliases of instances present in the interaction files above. v1.rna.aliases.universe.txt.gz contains all ncRNA aliases we gathered while constructing RAIN; this file also contains entities with no known interactions in RAIN.

The format of all files mentioned above is further described on the RAIN download page.

Contributors

RAIN is being developed at CBS, CPR, RTH, SIB, DTU, KU, and UZH.

Contributors:

  • Alexander Junge
  • Jan C. Refsgaard
  • Christian Garde
  • Xiaoyong Pan
  • Alberto Santos
  • Ferhat Alkan
  • Christian Anthon
  • Christian von Mering
  • Christopher T. Workman
  • Lars Juhl Jensen
  • Jan Gorodkin