/ORFcalling

A tutorial review of some methods for ORF calling

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

ORFcalling

This repository provides scripts to reproduce parts of results from RiboCode paper, all the data used in this instruction is stored in Google Drive: https://drive.google.com/open?id=1Ibpeu9r-TpCce4qQG-gwZvj4XRpR7RKn

Note

This document gives a general overview of creating simulation dataset, the procedure to generate ROC data and the scripts to run these methods. Since the methodology of RiboTaper was based on testing of each annotated exon, and it's performance was originally bechmarked at the exon level, here we also compare RiboCode, RiboTaper, and ORFscore at the exon level. In next section, we compare RiboCode, RpBp, riboORF and ORF-RATER at gene-level because these method work on genes or transcripts. The simulation dataset were generated from a published data in HEK293.

The detailed step-by-step instruction of the data preprocessing and usage of these methods can also be found the following pages:

Comparison with RiboTaper and ORFscore

This section is for reproducing Fig2.c results.

  1. Run RiboTaper to create the data, calculate ORFscore and p values of RiboTaper (results_ccds file):

    create_annotations_files.bash genecode.v19.annotation.gtf hg19_genome.fa true true ribotaper_annot

    Ribotaper.bash mapping_files/rpf_aligned.bam mapping_files/mRNA_aligned.bam ribotaper_annot 26,27,28,29 12,12,12,12 8

    The following files generated by Ribotaper will be used in the future steps:

    • P_sites_all_tracks_ccds
    • Centered_RNA_tracks_ccds
    • results_ccds
    • frames_ccds

    The P_sites_all_tracks_ccds data is used as true positives, and Centered_RNA_tracks_ccds data is negatives. The ribotaper_data folder on google drive contains the entire data.

  2. Calculating the p values using RiboCode:

    Note: please change the path of input files in script

    python generate_ribocode_pvalue.py

    This step will generate "ribocode_result.txt".

  3. Extract the p values produced by RiboTaper algorithm and generate the ROC input file:

    python generate_ROC_input.py

    This step generates "ROC_input.txt"

  4. Plot the ROC curves:

    Rscript ROC.R

    This step will generate "ROC.pdf" and "auc.txt".

Comparison with Rp-Bp, ORF-RATER, RiboORF

This section is for reproducing Fig3.a results. We randomly select 1000 annotated protein coding genes (with RPF reads count >5), then use the RPF reads uniquely mapped on these genes as the true positives, and the mRNA data are used as the true negative data. The mapping_data folder on google drive contains the original RPF alignment files, and the simulation_data folder contains the simulation files and results of each method.

  1. Orignal alignment files for RPF and mRNA:

    These files are available in the mapping_data folder.

    Or, users can also generate these files by using run_star.sh script.

  2. Randomly selecting 1000 protein coding genes:

    python generate_simulatedFq.py

    This step will generate "selected.gtf" file and a simulation fastq file ("simulated_rpf.fq") in which the RPF reads of the selected genes were replaced by the mRNA reads. The mRNA reads of these selected genes in the simulation fastq file will be used as the true negative data.

  3. Change the type of selected genes from "protein_coding" to "lincRNA" in GTF file:

    python generate_simulatedGTF.py

    This step uses the "selected.gtf" as input file and will generate "simulation.gtf". Since ORF-RATE is a supervised method, other protein_coding genes are also included in this file as the training set.

  4. Alignment using STAR:

    See ./run_star.sh script. The simulation fastq file generated in step2 will be aligned on the human genome with the annotation file "simulation.gtf".

  5. Run the following methods using original alignment files and simulated alignment files (generated by above step) individually.

    • Rp-Bp

      Run the "RpBp/run_RpBp.sh" script.

      Note: please change the file path in "RpBp/prepare.yaml" and "RpBp/config.yaml".

    • ORFrate

      Run the "ORF-RATER/run_ORF-RATER.sh" script.

    • RiboORF

      Run the "RiboORF/run_RiboORF.sh" script. We also provide a "generate_CDS_ORF_genePred.py" script to generate the genePred file for riboORF.

    • RiboCode

      Run the "RiboCode/run_RiboCode.sh" script.

    Note: Users need to change the path of input and output files in scripts. Each method will generate two set of results in "result_true" folder (using original alignment files as the input data) and "result_simulation" folder (using simulation alignment files as the input data), which are available in "simulation_data" folder at google drive.

  6. Generate the ROC data:

    python generate_ROCdata.py

    This script will collect the results of each method (produced by above step), and extract their statistics to generate the ROC input data (such as "ribocode_roc.txt"). Then using the "roc.sh" script to remove "None" values from ROC input data and call the "roc.R" script to generate the ROC curve and AUC.

    ./roc.sh

    Contact

    If you have any question about RiboCode and this study, please feel free to contact:

    xzt41[at]126.com

    xudonxing_bioinf[at]sina.com

    THUhry12[at]163.com