RNASeq Pipeline

This pipeline has 2 files. The pipeline.sh is a bash script that

Maps the reads in the samples' fastq files onto the genome using hisat2
Using samtools sorts the bam files and re-saves them as bam files
Indexes the bam files using samtools for viewing reads in the IGV viewer
Creates QA files using samtools flagstat
Assembles transcript structures and quantitates these levels using stringtie but focuses on known genes
Stringtie expands transcript set to those having experimental evidence but not in the RefSeq set. It is this set that is used for Salmon.
All transcripts, including new ones, are quantitated. This is only for comparing to Salmon.
The fasta file of all known and newly established transcript structures is created with gffread
Salmon indexes these transcript sequences
salmon quant is run to estimate transcript-level TPM levels for each transcript-sample pair
post-process-salmon.R is run to gather TPM levels across all salmon quant.sf files and sum across transcripts of the same gene.

This pipeline is a modified version of one written by the Perteas in the Salzberg lab. In order to run the pipeline, it is assumed that there exists the following directory structure containing certain files. One also uses the CONFIG.sh file to set a lot of the hard-coded parameters.

keslingmj/RNASeq-Pipeline

RNASeq Pipeline