/DeepBound

Predict the boundary of transcript start and end from RNA-seq reads alignment

Primary LanguageC++

#=================
DeepBound (v0.1)
date: 2017.02.10
#=================

Author: Sheng Wang, Jianzhu Ma, Mingfu Shao
Contact email: realbigws@gmail.com, majianzhu@gmail.com,  shaomingfu@gmail.com


#=================== Abstract =========================

Predict the boundary of transcript (i.e., start and end) from RNA-seq reads alignment, 
     via AUC-Maximized Deep Convolutional Neural Fields (DeepCNF) model.

[Reference]:
DeepBound: Accurate Identification of Transcript Boundaries via Deep Convolutional Neural Fields.
  Mingfu Shao, Jianzhu Ma, and Sheng Wang
                  (submitted to ISMB 2017)


#=================== Install ==========================

1. download the package

git clone --recursive https://github.com/realbigws/DeepBound
cd DeepBound/

--------------

2. compile

cd source_code/
        ./oneline_make.sh
cd ../

--------------

3. update the package

git pull
git submodule foreach --recursive git pull origin master



#================
 Overall Package
#================

#================= Usage and Example =====================#

USAGE: ./oneline_command.sh <-i input_BAM_file> [-o output_boundary]
        [-f chromosomes_dir] [-g annotation_file] [-m min_expression]
        [-O premature_bound] [-p prob_threshold] [-w window_size]

Options:

***** required arguments *****
-i input_BAM_file  : specifies the input reads alignemnt file in BAM format.
                     It has to be sorted (for example, by samtools).
-o output_boundary : specifies the predicted boundaries in FASTA format.
                     0 for end-boundary, 1 for non-boundary, and 2 for start-boundary.

***** optional arguments *****
--| Step 1. from BAM to generate ReadCount by bin/gsamples
-f chromosomes_dir : is optional but strongly recommended,
                     which specifies directory of chromosome sequences in FASTA format.
-g annotation_file : is optional, which specifies the ground-truth expression abundance in GTF file.
-m min_expression  : is together with '-g' option, which will make gsamples
                     ignore these transcripts in GTF file with expression abundance under this value.

--| Step 2. predict premature boundary by DeepBound.sh
-O premature_bound : is the directory containing predicted premature boundary,
                     by DeepBound.sh which is based on AUC-maximizd DeepCNF model.

--| Step 3. process premature boundary by bin/pinpoint
-p prob_threshold  : is optional (default 0.6), which specifies the minimum average probability
                     over a window that will be considerred as a true boundary.
-w window_size     : is optional (default 10), which gives the window size for average calculation.


#--------------------------

oneline command example

./oneline_command.sh -i example/test.bam

the output file for predicted boundary could be found in 'test.boundary' in FASTA format.



#==============
 Detail Steps
#==============


#======== Section I: from BAM, generate ReadCount by bin/gsamples ==============#

1.1 Usage of bin/gsamples

The usage of gsamples is: 
     bin/gsamples <in-bam-file> <out-sample-file> [-f fasta-dir] [-g gtf-file] [-e min-expression]

Options:
     The parameter <in-bam-file> specifies the input reads alignemnt file. It has to be sorted (for example, by samtools).
     The parameter <out-sample-file> specifies the the output sample ReadCount file. See below for more details.
     The parameter [fasta-dir] is optional but strongly recommended, which specifies directory of all sequences files of all chromosomes. These sequence files will be used as features.
     The parameter [gtf-file] is optional, which specifies the ground-truth expressed transcripts. If this file is given, the true boundaries will be contained in the output sample file as true labels; otherwise, the true labels will be -1. When evaluting mode is on (i.e., to evalute the performance of this method, this file should be given.
     The parameter [min-expression] is together with [-g gtf-file], which will make gsamples ignore these transcripts in the gtf-file with expression abundance under this value.

#+++++++++++++++++++++++++++++++++++

1.2 oneline command example

bin/gsamples example/test.bam test.ReadCount



#======== Section II: predict premature boundary by DeepBound.sh ==================#

2.1 Usage of DeepBound.sh

Usage: ./DeepBound.sh <ReadCount_input> <output_folder>
[Note]:    <ReadCount_input> should contain concatenated ReadCount files.
           <output_folder> would contain predicted premature boundary.

#+++++++++++++++++++++++++++++++++++

2.2 oneline command example

./DeepBound.sh example/test.ReadCount test_out/



#======== Section III: process premature boundary prediction by bin/pinpoint ===========#

3.1 Usage of bin/pinpoint

The usage of pinpoint is: 
     bin/pinpoint: <sample-file> <prediction-file> <output-file> [-p probability-threshold] [-w window-size]

Options:
     The parameter <sample-file> specifies the sample ReadCount file used for prediction (see gsamples program).
     The parameter <prediction-file> specifies the probability file generated by the DeepBound.
     The parameter <output-file> specifies the predicted boundaries: 0 for end-boundary, 1 for non-boundary, and 2 for start-boundary
     The parameter [probability-threshold] is optional (default value is 0.25), which specifies the minimum average probability over a window that will be considerred as a true boundary.
     The parameter [window-size] is optional (default value is 10), which gives the window size.


#+++++++++++++++++++++++++++++++++++

3.2 oneline command example

bin/pinpoint example/test.ReadCount example/test.premature test.boundary



#======================
 ReadCount file format
#======================

#=================== format of ReadCount file ======================#

each sequence consists of 23 rows.

1. the initial row, starting with '#', is the header showing the comment or description of this sequence.

2. the first row below '#' header records the 'Start Mid End' label.
        If real-world data is applied, then just put any number here (e.g., -1);
        If for training purpose, then MUST set 0,1,2 for END, MID, and START, respectively.

3. the second row below '#' header records the positive abundance value.
        If real-world data is applied, then just put any number here (e.g., 0);
        If for training purpose, then MUST put a real positive value here to indicate the abundance value.

4. each data entry should have 23 lines in total, including the '#' header row, first boundary label row, and second abundance value row.
        The following 14 rows are the feature row, with the 1st row of them indicating the read coverage.
        The next 4 rows are the A,T,C,G sequence in 0/1 one-hot matrix format.
        The final 2 rows indicates the binary features of TRANSCRIPT_START and TRANSCRIPT_START.


#=======================
 Premature output files
#=======================

#=================== output files and folders ====================#

#-> input data list
1. sample_test_sample_list
   the file list shows the number of sequences in <ReadCount_input>, with the same order as in the original file.
   all following folders contain each sequence as a file with the same name as 'sample_x', where x is the order of sequence, starting from 0.

#-> output prediction
2. bou_pred_out/
   the predicted boundary

#-> concatenated prediction
3. <input_name>.premature
    concatenated premature prediction in one file, where <input_name> should be the same as the <ReadCount_input> file.