RiboWave is a funtional Ribo-seq analysis tool to identify translated ORF based on Ribo-seq data.
The RiboWave workflow consists of:
-
Pre-processing :
- Create the annotation file for the subsequent analysis. [
create_annotation.sh
] - Determine the P-site position of Ribo-seq. [
P-site_determination.sh
] - Generate P-site tracks from Ribo-seq.[
create_track_Ribo.sh
]
- Create the annotation file for the subsequent analysis. [
-
Main function :
- Denoise [
Ribowave
] - Predict translated ORFs [
Ribowave
] - Estimate reads density for each given ORF [
Ribowave
] - Estimate frameshift potential for each given ORF [
Ribowave
]
- Denoise [
- R
- bedtools v2.25.0
- reshape
- ggplot2
- rhdf5
- methods
- wmtsa
- parallel
It is recommanded to make a new directory and move the Ribo-seq bam file
into that directory;
This step scans for and annotates all putative ORFs as well as define annotated ORF according to the annotation file
./create_annotation.sh -h
A helper message is shown:
----------------------------------------------------------------------------------------------------
RiboWave : version 1.0
This step is set for the purpose of genome annotation.
Several of the output is necessary for the following steps.
----------------------------------------------------------------------------------------------------
Usage:
create_annotation.sh [-h] -G annotation.gtf -f fasta -o annotation_dir -s scripts_dir
Options:
-G <filename> (annotation.gtf )
-f <filename> (genome fasta )
-o <directory> (Output annotation directory )
-s <directory> (Script directory )
-h (Help )
----------------------------------------------------------------------------------------------------
Run create_annotation.sh
on example:
scripts/create_annotation.sh -G annotation_fly/dmel-all-r6.18.gtf -f annotation_fly/dmel-all-chromosome-r6.18.fasta -o annotation_fly -s scripts;
-
<annotation.gtf> : the annotation gtf should contain start_codon and stop_codon information,eg:
dmel-all-r6.18.gtf
-
<genome.fasta> : genome fasta ,eg:
dmel-all-chromosome-r6.18.fasta
-
<annotation_dir> : the directory for all the annotation output
-
<scripts_dir> : the directory of all the scripts in the package
annotation
directory, including :
-
start_codon.bed : the bed file annotating start codon
-
final.ORFs : all identified ORFs, eg:
FBtr0300105_0_31_546
whereFBtr0300105
refers to the transcript,0
refers to the reading frame relative to the start of transcript,31
refers to the start site,546
refers to the stop codon.
This step determines the P-site position for each Ribo-seq reads length by overlapping with the annotated start codons from previous step
./P-site_determination.sh -h
A helper message is shown:
----------------------------------------------------------------------------------------------------
RiboWave : version 1.0
This step is for determining the P-site position for each read length.
----------------------------------------------------------------------------------------------------
Usage:
P-site_determination.sh [-h] -i Ribo.bam -S start_codon.bed -o out_dir -s scripts_dir [-n study name]
Options:
-i <filename> (Ribo-seq bam )
-S <filename> (start codon annotation file )
-o <directory> (Output annotation directory )
-s <directory> (Script directory )
-n <string> (study name, defult:test )
-h (Help )
----------------------------------------------------------------------------------------------------
Run P-site_determination.sh
on example :
scripts/P-site_determination.sh -i GSE52799/SRR1039770.sort.bam -S annotation_fly/start_codon.bed -o GSE52799 -n SRR1039770 -s scripts;
-
<Ribo_bam> : secondary alignment removed to ensure one genomic position per aligned read and sorted
-
annotation
:- <start_codon.bed> : annotated start site
start_codon.bed
. It is generated in thecreate_annotation.sh
step.
- <start_codon.bed> : annotated start site
-
<out_dir> : the directory of the output result, eg:
GSE52799
-
<study_name> : the name of all the output file, default: test. eg:
SRR1039770
-
<scripts_dir> : the directory of all the scripts in the package
P-site
directory, including :
-
name.psite1nt.txt : the Ribo-seq reads length and its corresponding P-sites position(= offset + 1). It may look this this :
30 13
-
name.psite.pdf : the PDF displaying the histogram of aggregated reads
This step creats the P-site track for transcripts of interests using determined P-sites position from previous step.
./create_track_Ribo.sh -h
A helper message is shown:
----------------------------------------------------------------------------------------------------
RiboWave : version 1.0
This step is the last step for data processing where P-site track is generated for each transcript.
----------------------------------------------------------------------------------------------------
Usage:
create_track_Ribo.sh [-h] -i Ribo.bam -G exons.gtf -g genome_size -P psite.position -o out_dir -s scripts_dir [-n study name]
Options:
-i <filename> (Ribo-seq bam )
-G <filename> (exons gtf file )
-g <filename> (Genome size annotation file )
-P <filename> (P-site position (offset+1) )
-o <directory> (Output annotation directory )
-s <directory> (Script directory )
-n <string> (study name, defult:test )
-h (Help )
----------------------------------------------------------------------------------------------------
Run create_track_Ribo.sh
on example:
scripts/create_track_Ribo.sh -i GSE52799/SRR1039770.sort.bam -G annotation_fly/X.exons.gtf -g annotation_fly/genome -P GSE52799/P-site/SRR1039770.psite1nt.txt -o GSE52799 -n SRR1039770 -s scripts;
-
<Ribo_bam>
-
<exons.gtf> : a gtf file for only the exons from transcripts of intersect, eg:
X.exons.gtf
-
<genome> : the file including all the chromosomes and its genome size. Noted:
genome
can be obtained by usingsamtools faidx
function with the input of fasta file.genome
may look like this:2L 23513712 2R 25286936 3L 28110227 3R 32079331
-
P-site
:- <P-site_position> : the file listing the P-site position for each read length. This file can be found in the output of previous step, eg: name.psite1nt.txt
-
<out_dir> : the directory of the output result, eg:
GSE52799
-
<study_name> : the name of all the output file, default: test. eg:
SRR1039770
-
<scripts_dir> : the directory of all the scripts in the package
bedgraph/name
directory, including :
-
final.psite : P-site track at transcriptome wide. It may look like this :
FBtr0070533 0,0,0,0,0,0,0,0,0,0,0,0,6,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,1,0,0,0,0,1,0,2,1,0,0,0,0,0,0,4,8,0,0,3,0,5,12,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 FBtr0073886 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,25,0,0,0,0,0,0,0,0,0,0,0,0,0 FBtr0070604 0,0,0,0,0,0,0,0,0,0,0,0,59,6,0,1,0,0,2,6,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 FBtr0070603 0,0,0,0,0,0,0,0,0,0,0,0,75,2,7,10,7,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,3,3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
This step can achieve multiple functions :
-
denoising [
denoise
] -
providing predicted p.value for each given ORF to identify its translation status [
pvalue
,-P
] -
providing reads density (P-site/PF P-site) for each given ORF [
density
,-D
] -
providing translation efficiency (TE) estimation for each given ORF [
TE
,-T
] -
providing frameshift potential (CRF score) for each given ORF [
CRF
,-F
]
./Ribowave -h
A helper message will be shown:
----------------------------------------------------------------------------------------------------
RiboWave : version 1.0
This step is main function of RiboWave.
Functions are provided including : predicting translated ORF, estimating reads density, estimating translation efficiency and predicting frameshift events.
----------------------------------------------------------------------------------------------------
Usage:
Ribowave [Options] -a P-site track -b ORF_list -o output_dir -s scripts_dir
Options:
-P (providing P value for each ORF )
-D (providing reads abundance for each ORF )
-F (predicting frameshift potential for each ORF )
-T <int> <RNA_FPKM> (estimating TE for each ORF.<int> requires total Ribo-seq sequence depth;<RNA_FPKM> requires RNA-seq FPKM)
-a <filename> (psite track )
-b <filename> (ORF list )
-o <directory> (Output directory )
-s <directory> (Script directory )
-n <string> (The name of the study, default: test )
-p <int> (The number of threads, default: 1 )
-h (Help )
----------------------------------------------------------------------------------------------------
It might take hours to perform the analysis if the input is large. It is recommended to specify the number of CPU cores through the -p
option.
Run Ribowave
on example:
mkdir -p Ribowave;
scripts/Ribowave -a GSE52799/bedgraph/SRR1039770/final.psite -b annotation_fly/final.ORFs -o GSE52799/Ribowave -n SRR1039770 -s scripts -p 8;
mkdir -p Ribowave;
scripts/Ribowave -P -a GSE52799/bedgraph/SRR1039770/final.psite -b annotation_fly/final.ORFs -o GSE52799/Ribowave -n SRR1039770 -s scripts -p 8;
mkdir -p Ribowave;
scripts/Ribowave -D -a GSE52799/bedgraph/SRR1039770/final.psite -b annotation_fly/final.ORFs -o GSE52799/Ribowave -n SRR1039770 -s scripts -p 8;
IMPORTANT : when estimating TE, user should input the sequenced depth of Ribo-seq and the FPKM value from paired RNA-seq
mkdir -p Ribowave;
scripts/Ribowave -T 9012445 GSE52799/mRNA/SRR1039761.RPKM -a GSE52799/bedgraph/SRR1039770/final.psite -b annotation_fly/final.ORFs -o GSE52799/Ribowave -n SRR1039770 -s scripts -p 8;
mkdir -p Ribowave;
awk -F '\t' '$3=="anno"' annotation_fly/final.ORFs > annotation_fly/aORF.ORFs;
scripts/Ribowave -F -a GSE52799/bedgraph/SRR1039770/final.psite -b annotation_fly/aORF.ORFs -o GSE52799/Ribowave -n SRR1039770 -s scripts -p 8;
mkdir -p Ribowave;
scripts/Ribowave -PD -T 9012445 GSE52799/mRNA/SRR1039761.RPKM -a GSE52799/bedgraph/SRR1039770/final.psite -b annotation_fly/final.ORFs -o GSE52799/Ribowave -n SRR1039770 -s scripts -p 8;
-
bedgraph/name
:- <P-site track> : output from the previous step, containing the P-site track of transcripts of interest, eg:
final.psite
- <P-site track> : output from the previous step, containing the P-site track of transcripts of interest, eg:
-
<ORF_list> : ORFs of interest ,eg :
final.ORFs
. It is generated in the step ofcreate_annotation.sh
-
<Ribo-seq sequenced depth> : the sequenced depth of Ribo-seq to calculate FPKM , eg:
9012445
-
<RNA FPKM> : FPKM table. It may look like this :
FBtr0100871 22262 FBtr0070604 18682 FBtr0100231 14746.5 FBtr0100874 14024.5 FBtr0100864 11475.6
-
name.PF_psite : the denoised signal track(PF P-sites signal track) at transcriptome wide. It looks similar as the input
final psite
. -
name.feats1 : the features of ORFs including chi-square P-value information. It may look like this :
Column | Explanation |
---|---|
Column1-Column4 | Basic information about the ORF |
Column5 | Reads coverage within the ORF |
Column6 | P-value predicted by RiboWave |
Column7 | Values estimating the relative abundance of PF P-sites outside of the studied ORF |
Column8 | Reads intensity at the current start codon |
result
directory, including :
- name.95%.mx : RiboWave makes the prediction on the translation initiation sites and gives the final translated product output (p.value < 0.05) . It may look like this :
FBtr0070007_2_93_1028
FBtr0070008_1_128_943
FBtr0070025_2_135_1094
- name.density : reads density ( PF P-site ) of given ORFs. It may look like this :
Column | Explanation |
---|---|
Column1-Column4 | Basic information about the ORF |
Column5 | number of PF P-sites in transcript |
Column6 | number of PF P-sites in given ORF |
Column7 | Density of PF P-sites in given ORF |
- name.TE : TE of given ORFs. It may look like this :
Column | Explanation |
---|---|
Column1 | transcript |
Column2 | ORF |
Column3 | TE |
- name.CRF.final : ORFs that might experience reading frame translocation. It may look like this :
Column | Explanation |
---|---|
Column1 | ORF |
Column2 | Start of frameshift |
Column3 | Stop of frameshift |
Column4 | PF P-sites' reading frames after the change point ,eg: 2_2,0_1 where 2_2 refers to continuous two PF P-sites of frame 2 followed by continuous one PF P-sites of frame 0. |
Column5 | Relative position of PF P-sites after the shift ,eg : 1413,1440;1789 where 1413,1440 corresponds to the exact position of 2_2 within the transcript. Discontinuity in the reading frame is separated by ; |
Column6 | CRF score describing the potential of frameshift |
An example file is packed and found in here.
Enclosed in the RiboWave_v1.0.tar.gz, run_Ribowave_dmel.sh
combines all the steps together into a pipeline.
Zhiyu Xu [xanthexu18@gmail.com]