/SEASTAR

SEASTAR - Systematic Evaluation of Alternative transcription STArt site in RNA

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

SEASTAR: Systematic Evaluation of Alternative STArt site in RNA

Zhiyi Qin, Xuegong Zhang, Yi Xing

First Edition June 20, 2015
Last revised Oct 18, 2017

SEASTAR (Systematic Evaluation of Alternative STArt site in RNA) is a software package for Transcription Start Site (TSS) identification and quantification using only RNA-seq data. It assembles novel TSSs based only on RNA-Seq data and merges them with known TSSs from a public database. This package enables high-quality TSS identification that is comparable to the highly sophisticated CAGE technology. This package is particularly useful for finding novel TSSs that contribute to transcriptome complexity along with identifying differential promoter utilization.

version 1.0.0 - updates several descriptions and tests. To achieve v0.9.4, one can visit https://github.com/zhyqin/SEASTAR-0.9.4 for download.

Usage

bash SEASTAR.sh [-A sample1.bam] [-B sample2.bam] [-o outputfolder] [-g publicgtf]  [-i genome_sizes] [-s sequence] [-d range_of_non-redundant_TSSs] [-p num_threads] [-c AFE_detection_diff_cutoff] [-t modeltype] [-b batchprocess] [-S strand-specific]

This will identify Alternative First Exons (AFEs) from multiple RNA-Seq runs. Please see installation instructions below before running the pipeline for the first time. See the parameters section for details.

Installation

SEASTAR requires the following packages on your Linux environment to run:

Add the Cufflinks, samtools, bedtools, R and Python directories to the $PATH environment variable, in the given order.

You will also need:

  • the Bowtie indexes for the genome of interest. If you are working with Human (hg19) or Mouse (mm9) data, you may download and use pre-built Bowtie indexes. Otherwise you may build your own Bowtie index from the genome fasta sequence using the bowtie-build tool.
  • the GTF annotation file for species of interest. This may be obtained from a public database.

For OS system user, please see Q&A section for further help.

Examples

The test folder provided with this package contains a test files that can be used to verify the software was installed and is functioning correctly.

  • Test data generation

The reads of test BAM files are abstracted from genomic region chr7:1-20,000,000 of human adipose and adrenal samples. The original BAM files are aligned to GRCh37 of human genome. The raw GTF file is made for Homo sapiens GRCh37 generated by Ensembl v72. The GTF file in the test folder only includes the genomic region of chromosome 7 including all the region chr7:1-20,000,000.

####* Start the test

To verify the package is installed correctly and working, use the following command to start analysis on the test samples. Replace [BowtieIndexPath] with the path to the Bowtie index on your system.

bash [scriptPath]test_SEASTAR.sh [BowtieIndexPath]

Alternatively, you may run the full command on the command line:

bash ./SEASTAR.sh -A ./test/adipose1.bam,./test/adipose2.bam -B ./test/adrenal1.bam,./test/adrenal2.bam -o ./testresult -g ./test/Homo_sapiens.Ensembl.GRCh37.72.chr7.gtf -i ./test/hg19.chrom.sizes -s [BowtieIndexPath]/hg19.fa -c 0.1 -p 1 -b U

####* De novo mode example

The parameter -g can be used to run SEASTAR in a De novo mode.

The following line will generate bash scripts for the example data set described above:

bash ./SEASTAR.sh -A ./test/adipose1.bam,./test/adipose2.bam -B ./test/adrenal1.bam,./test/adrenal2.bam -o ./testresult -g ./test/Homo_sapiens.Ensembl.GRCh37.72.chr7.gtf -i ./test/hg19.chrom.sizes -s [BowtieIndexPath]/hg19.fa -c 0.1 -p 1 -b U

####* Reference mode example

The parameter -G can be used to run SEASTAR in a Reference mode skipped the RABT assembly steps.

The following line will generate bash scripts for the example data set described above:

bash ./SEASTAR.sh -A ./test/adipose1.bam,./test/adipose2.bam -B ./test/adrenal1.bam,./test/adrenal2.bam -o ./testresult -G ./test/Homo_sapiens.Ensembl.GRCh37.72.chr7.gtf -i ./test/hg19.chrom.sizes -s [BowtieIndexPath]/hg19.fa -c 0.1 -p 1 -b U

####* Batch mode example

The parameter -b B can be used to run SEASTAR in a compute cluster environment. This parameter instructs SEASTAR to generate bash scripts that can be submitted to the cluster individually instead of running the analysis interactively.

The following line will generate bash scripts for the example data set described above:

bash ./SEASTAR.sh -A ./test/adipose1.bam,./test/adipose2.bam -B ./test/adrenal1.bam,./test/adrenal2.bam -o ./testresult -g ./test/Homo_sapiens.Ensembl.GRCh37.72.chr7.gtf -i ./test/hg19.chrom.sizes -s [BowtieIndexPath]/hg19.fa -c 0.1 -p 1 -b B

Technical Overview

The flowchart below provides a high-level view of the package: Overview

Parameters

Required parameters:

  • -A A_r1.bam[,A_r2.bam] The mapping results for the sample in bam format for the case group. Multiple alignments must be in a comma separated list (if using bam).

  • -B B_r1.bam[,B_r2.bam] The mapping results for the sample in bam format for the control group. Multiple alignments must be in a comma separated list (if using bam).

  • -g gtfFile Annotation of genes and transcripts in GTF format in De novo mode

  • -G gtfFile Annotation of genes and transcripts in GTF format in Reference mode skipping RABT assembly step

  • -i genomeSizes The lengths of all chromosomes. The format can be achieved from UCSC. More details can be found from the command genomeCoverageBed in bedtools.

  • -s bowtieIndexBase The fasta file of the bowtie indexes (fa files). The name should use hg19.fa instead of hg19. (Only used for assembly)

  • -o outDir The output directory for the generated results

Optional parameters:

  • -p <int> The number of processors to be used. The default value is 1.

  • -d <int> The distance among the TSSs derived from the same promoter region. The default is 100 (bps).

  • -c <float> The splicing difference cutoff. The cutoff is used in the null hypothesis test for differential splicing. The default is 0.1 for a 10% difference. The valid range is 0 ≤ cutoff < 1.

  • -t modeType Mode used in the MATS analysis. The options are 'U' for unpaired data and 'P' for paired data. Default is unpaired data.

  • -b batchProcess Type of batch processing to be done. The 'B' option generates several scripts to be used for batch processing on a parallel computing cluster. The 'U' option starts the analysis immediately. Default is 'U'.

  • -S strand-specific The strand specific type of input data. The 's' option represents strand-specific data. The 'U' option represents non-strand specific data. Default is 'U'.

The cluster computing parameters should be set in the user-defined "header.txt" file in the installation folder before running the program in batch mode. The contents of this file will be added to each script that is generated.

Output

SEASTAR will generate the following files in the [outputFolder]:

merged.gtf

Assembly result generated by Cufflinks, with raw annotation including all transcripts with their respective exons.

FilteredNrtss.annotation

TSS annotation results after running SEASTAR pipeline. The columns in this file are described below:

  • tss_id: The ID of each TSS. Each tss_id represents one clustering of the raw TSSs whose distance are smaller than the provided threshold (default is 100). Example: TSS103

  • class_code: This field is defined by Cufflinks. Example: =

  • gene_id: The ID of each gene given by Cufflinks Example: XLOC_001358

  • gene_short_name: RefSeq gene name Example: COX19

  • locus: Gene location Example: chr7:1005710-1015235

  • length: Gene location without chromosome information Example: 1005710-1015235

  • tss_chr: The chromosomal position of the TSS Example: 3666

  • tss_ss: The start site of the corresponding first exon, without considering the strand direction. Example: 1015064

  • tss_es: The start site of the corresponding first exon, without considering the strand direction. Example: 1015235

  • strand: The strand direction of the TSS Example: -

  • in_out: Whether the TSS is internal (denoted by 1) or external (denoted by 0). Example: 0

  • over_cnt: The number of non-initial exons overlapping this TSS (i.e. overlapping the corresponding first exon). Example: 0

  • over_len: The length of non-initial exons overlapping this TSS (i.e. the corresponding first exon). Example: 0

  • tss_len: The length of other initial exons overlapping this TSS (i.e. the corresponding first exon). Example: 0

  • over_frc: The fraction of the length of other initial exons overlapping this TSS (i.e. the corresponding first exon). Example: 0

  • tss_ovcnt: How many other TSSs exist in the locus of this TSS (i.e. the corresponding first exon). Example: 1

  • e_max: Maximum coverage of the corresponding first exon Example: 22

  • j_max: Maximum coverage of the corresponding first exon only considering the junction reads. Example: 2

  • result: Whether the TSS passes the filtering (i.e. the logistic regression) Example: TRUE

  • logit: The prediction result of the logistic regression model Example: 0.94

  • known: With De novo mode, whether this TSS is known in public database or is a novel one discovered by the pipeline. 0 denotes novel TSS (i.e. not present before the assembly step). Example: 0

table.cov

TSS quantitation of all input samples, in the same order as input. Each column denotes one sample and each row denotes one TSS.

table_j.cov

TSS quantitation of all input samples, in the same order as input, but only for junction reads. Each column represent one sample and each row represent one TSS.

rMATS_Result.txt

The method as described in rMATS documentation.

  • IC_SAMPLE_1: counts of the first exons for SAMPLE_1, replicates are separated by comma
  • SC_SAMPLE_1: counts of other first exons in this gene for SAMPLE_1, replicates are separated by comma
  • IC_SAMPLE_2: counts of the first exons for SAMPLE_2, replicates are separated by comma
  • SC_SAMPLE_2: counts of other first exons in this gene for SAMPLE_2, replicates are separated by comma
  • IncFormLen: length of the first exon we studying, used for normalization
  • SkipFormLen: average length of other first exons in this gene, used for normalization
  • IncLevel1: usage ratio for SAMPLE_1 replicates (comma separated) calculated from normalized counts
  • IncLevel2: usage ratio for SAMPLE_2 replicates (comma separated) calculated from normalized counts
  • IncLevelDifference: $average(IncLevel1) - average(IncLevel2)$
  • PValue: the probability whether the absolute difference is larger than the user-defined cutoff
  • FDR: adjusted p-value using the Benjamini method

utr_result_All_Prediction_Results.txt

We predicted whether the usage of a FE would be shortened or lengthened between two different biological conditions. The method as described in the Dapars documentation.

  • Predicted_Proximal_5UTR: predicted switched site of lengthening and shortening first exons
  • Group_A_Mean_PDUI: mean PDUI of sample group A (PDUI: Percentage of Distal start site Usage Index)
  • Group_B_Mean_PDUI: mean PDUI of sample group B
  • PDUI_Group_diff: the difference of PDUI between group A and B
  • P_val: the probability whether the absolute difference is larger than 0
  • adjusted.P_val: adjusted p-value using Benjamini method
  • Pass_Filter: final step checked by the parameters in the user defined configure file. Y and N represent passed and filtered, respectively.

rMATS_utr_Result.txt

With the prediction model of DaPars, the switch point between the lengthening and shortening regions was predicted and reported. Then the whole region of a FE was split into two regions based its predicted switch point. The following statistical inference was performed based on rMATS by testing whether the relative usage of these two split regions would be significantly changed between two different samples.

  • IC_SAMPLE_1: counts of the first region of split exon for SAMPLE_1, replicates are separated by comma
  • SC_SAMPLE_1: counts of other region of split exon for SAMPLE_1, replicates are separated by comma
  • IC_SAMPLE_2: counts of the first region of split exon for SAMPLE_2, replicates are separated by comma
  • SC_SAMPLE_2: counts of other region of split exon for SAMPLE_2, replicates are separated by comma
  • IncFormLen: length of the first region of split exon we studying, used for normalization
  • SkipFormLen: length of other region of split exon, used for normalization
  • IncLevel1: usage ratio for SAMPLE_1 replicates (comma separated) calculated from normalized counts
  • IncLevel2: usage ratio for SAMPLE_2 replicates (comma separated) calculated from normalized counts
  • IncLevelDifference: $average(IncLevel1) - average(IncLevel2)$
  • PValue: the probability whether the absolute difference is larger than the user-defined cutoff
  • FDR: adjusted p-value using the Benjamini method

Q&A

  1. What version of OS X system can use this pipeline?

We suggest to use a OS X system => 10.7. Additionally, when running SEASTAR on OS X systems, we have the following suggestions for users:

    1. Use the version Cufflinks 2.1.0 (or lower version) for Mac (http://cole-trapnell-lab.github.io/cufflinks/assets/downloads/cufflinks-2.1.0.OSX_x86_64.tar.gz). Its higher version does not work on Mac OS X sometimes. This bug has been reported several times on some Cufflinks user forums.
    1. Use the lower version of samtools such as 0.1.9. Otherwise Cufflinks may not link with it and may not work.
    1. Use the higher version of bedtools such as 2.26 as there are several major updates in its recent version.
  1. How to download v0.9.4? To achieve v0.9.4, one can visit https://github.com/zhyqin/SEASTAR-0.9.4 for download.

Contact and Bug Reports

Please contact Zhiyi Qin qzyfy.dq@gmail.com for questions and bug reports.

Acknowledgements

We wish to thank Tevfik Umut Dincer, Shihao Shen, Jinkai Wang and Juw Won Park for technical assistance and comments.