This repository contains the downstream analysis pipeline of the Schmitt, Rudolph & al. paper. You can find information on how to run the pipeline or use individual components in the following sections. Enjoy!
Please consult the online methods for more details.
A previous version of the code contained a small error in the plotting code for the PCA summary plots. A corrected version of the code has been committed. The update does not affect the data or the interpretation of the results in our manuscript. The changes are highlighted in commit 4860f48.
This pipeline requires Python 2.7 and R 3.0. Dependencies on R packages are
listed in the DEPENDS
file.
Further dependencies are:
- Meme (4.9.0)
- klmr/rcane (included as submodule)
- BioPython (>= 1.59)
- klmr/pygoo (included as submodule)
After pulling the repository, please do
git submodule init
git submodule update
to initialise the submodules.
Before running the pipeline, please ensure that all the necessary upstream data
is inside the common/data
directory. This data can be obtained from the
supplementary material of the publication.
To re-generate all results, run the command
Rscript --vanilla common/scripts/generate-all.R
from within the base directory. This will take a while. Individual results can
be generated by running the respective script inside either the rna
or chip
folder. For instance, to generate just the PCA plot for the ChIP data, execute
the following:
cd chip
Rscript --vanilla scripts/pca.R
This section gives an overview over the result files and folders, with instructions how to generate the file and how to interpret it.
This is what the directory tree of the project folder looks like:
chip
plots
results
active-genes
compensation
de-acc
de-genes
de-type
meme
usage-sampling
scripts
common
cache
data
meme
scripts
rcane
rna
plots
correlation
de
distribution
usage
usage-sampling
results
de
usage-sampling
scripts
goo
The */scripts
directories contain the source code of the pipeline contained in
this project.
The common/data
directory contains the source data files from the
supplementary materials.
The common/cache
directory contains cache files which speed up re-generation
of the results by caching some intermediate results. These files are not
checked for staleness – changing parameters or input data may require manually
deleting these files, otherwise results may not correctly update.
The {chip,rna}/plots
directories contain result plots. Plots which
contrast tRNA ChIP data with RNA-seq derived data are found in the
chip
directory. The rna
directory contains only result
which are solely based on the RNA-seq data and not on ChIP-seq data.
The {chip,rna}/results
directories contain any additional results which
are not in the form of plots. These are tables and raw text files, as well as
HTML files for the Meme analysis.
The following contains a breakdown of the individual directories.
Figures are generated by
chip/scripts/colocalization.R
.
Each PDF file corresponds to one colocalisation test. The parameters for the test are given in the title of the plot. The p value of the significance test is labelled in the plot.
Figures and tables are generated by
chip/scripts/compensation.R
.
results/compensation/tested-isoacceptors.tsv
is a table of the anticodon
isoacceptor families for which compensation was tested, and their respective
(raw and adjusted) p-values.