tRNA gene regulation downstream analysis

This repository contains the downstream analysis pipeline of the Schmitt, Rudolph & al. paper. You can find information on how to run the pipeline or use individual components in the following sections. Enjoy!

Please consult the online methods for more details.

Update

A previous version of the code contained a small error in the plotting code for the PCA summary plots. A corrected version of the code has been committed. The update does not affect the data or the interpretation of the results in our manuscript. The changes are highlighted in commit 4860f48.

Dependencies

This pipeline requires Python 2.7 and R 3.0. Dependencies on R packages are listed in the DEPENDS file. Further dependencies are:

Meme (4.9.0)
klmr/rcane (included as submodule)
BioPython (>= 1.59)
klmr/pygoo (included as submodule)

After pulling the repository, please do

git submodule init
git submodule update

to initialise the submodules.

How to run

Before running the pipeline, please ensure that all the necessary upstream data is inside the common/data directory. This data can be obtained from the supplementary material of the publication.

To re-generate all results, run the command

Rscript --vanilla common/scripts/generate-all.R

from within the base directory. This will take a while. Individual results can be generated by running the respective script inside either the rna or chip folder. For instance, to generate just the PCA plot for the ChIP data, execute the following:

cd chip
Rscript --vanilla scripts/pca.R

Result files

This section gives an overview over the result files and folders, with instructions how to generate the file and how to interpret it.

Outline

This is what the directory tree of the project folder looks like:

chip
- plots
- results
  - active-genes
  - compensation
  - de-acc
  - de-genes
  - de-type
  - meme
  - usage-sampling
- scripts
common
- cache
- data
  - meme
- scripts
  - rcane
rna
- plots
  - correlation
  - de
  - distribution
  - usage
  - usage-sampling
- results
  - de
  - usage-sampling
- scripts
  - goo

The */scripts directories contain the source code of the pipeline contained in this project.

The common/data directory contains the source data files from the supplementary materials.

The common/cache directory contains cache files which speed up re-generation of the results by caching some intermediate results. These files are not checked for staleness – changing parameters or input data may require manually deleting these files, otherwise results may not correctly update.

The {chip,rna}/plots directories contain result plots. Plots which contrast tRNA ChIP data with RNA-seq derived data are found in the chip directory. The rna directory contains only result which are solely based on the RNA-seq data and not on ChIP-seq data.

The {chip,rna}/results directories contain any additional results which are not in the form of plots. These are tables and raw text files, as well as HTML files for the Meme analysis.

The following contains a breakdown of the individual directories.

Colocalization

Figures are generated by chip/scripts/colocalization.R.

Each PDF file corresponds to one colocalisation test. The parameters for the test are given in the title of the plot. The p value of the significance test is labelled in the plot.

Compensation

Figures and tables are generated by chip/scripts/compensation.R.

results/compensation/tested-isoacceptors.tsv is a table of the anticodon isoacceptor families for which compensation was tested, and their respective (raw and adjusted) p-values.

klmr/trna