This RNA-seq analysis pipeline measures the frequency of each type of substitution (e.g., A->G) in each of a set of human RNA-seq samples on the Sequence Read Archive. The samples, specified in a manifest file, are divided into two groups. The pipeline looks for a statistically significant differences in frequencies of substitutions between the two groups via ??????. Every line of the manifest file is either a comment (denoted by an initial # character) or three tab-separated fields:
SRA run accession number <TAB> sample group <TAB> sample name
A run accession number always begins with "SRR". The pipeline assumes that there are exactly two sample groups, e.g., "tumor" or "normal." Both sample group and sample name should contain no spaces.
The pipeline has several dependencies:
- HISAT, a fast spliced alignment tool.
- HISAT index for UCSC hg19 (ftp://ftp.ccb.jhu.edu/pub/data/hisat_indexes/hg19_hisat.tar.gz), a hierarchical FM index required by HISAT.
- Bambino, a variant caller that pools data from multiple input files.
- UCSC hg19 Illumina iGenome, whose genome annotation
genes.gtf
is used to provide a list of known introns to HISAT. - SAMTools for conversion to sorted, indexed BAM and mpileup.
- SRA Toolkit for downloading SRAs and converting them to raw FASTQs.
To install these dependencies on an Amazon EC2 instance running Ubuntu, enter
sh ec2_bootstrap.sh WORK
where WORK is a work directory on the instance with at least ~three times as much space as the sample FASTQs demand. The pipeline is run as follows.
a) Download and align data with align.py. Command-line parameters can be viewed by running
python align.py --help
An example alignment command run by the hackathon team is
python align.py -m /blast/rna/hackathon_v001_rnaseq/testset.txt
--hisat-idx /blast/rna/hg19_hisat/hg19_hisat
--gtf /blast/rna/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.gtf
--out /blast/rna/aligned/
--fastq-dump-exe /blast/rna/sratoolkit.2.4.2-ubuntu64/bin/fastq-dump
--num-processes 6 --gzip-output --temp /blast/rna/temp
--hisat-args "--trim3 10" 2>/blast/rna/hackathon_v001_rnaseq/6.log
Note that the manifest file described above is specified with the -m
parameter.
b) Run prepare_bam.pl to process the alignment files (SAM), remove the unmapped, low-quality or ambiguous reads (e.g. reads that map at multiple different locations). Command-line parameters can be viewed by running the script without parameters
perl prepare_bam.pl
An example command is
perl prepare_bam.pl -v /blast/rna/aligned/GSM823518.normal.SRR358994.sam.gz /blast/rna/BAM/output/
c) Run run_bambino.pl to generate the variant call table. Each line of the output contains a call variant at a particular location within the genome and the reference base at the same location. Command-line parameters can be viewed by running the script without parameters
perl run_bambino.pl
An example command is
perl run_bambino.pl /blast/rna/BAM/output/chr1_GSM823518.normal.bam /blast/rna/BAM/output/chr1_GSM823518.normal.bambino
d) Before start to map bambino output chromosome location to refGene, you need to download refFlat.txt(from UCSC http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/). Then, download and run mapping2gene.py. Inside the result file, header is "Gene name refGene name Chromosome Transcription start Transcription end Variant found at Alternate base Reference base".
An example command is
python mapping2gene.py /absolute/path/to/output/file(from run_bambino) /absolute/path/to/refFlat.txt
This will produce hit.txt and miss.txt. hit.txt has candidate variant info. which mapped to inside the trascription region of the gene. This process also makes count_mapping2gene.txt, which is a statistical table for every transition(A>C, A>G ,and etc) in each samples.