RNASeq Pipeline
This pipeline has 2 files. The pipeline.sh is a bash script that
- Maps the reads in the samples' fastq files onto the genome using hisat2
- Using samtools sorts the bam files and re-saves them as bam files
- Indexes the bam files using samtools for viewing reads in the IGV viewer
- Creates QA files using samtools flagstat
- Assembles transcript structures and quantitates these levels using stringtie but focuses on known genes
- Stringtie expands transcript set to those having experimental evidence but not in the RefSeq set. It is this set that is used for Salmon.
- All transcripts, including new ones, are quantitated. This is only for comparing to Salmon.
- The fasta file of all known and newly established transcript structures is created with gffread
- Salmon indexes these transcript sequences
- salmon quant is run to estimate transcript-level TPM levels for each transcript-sample pair
- post-process-salmon.R is run to gather TPM levels across all salmon quant.sf files and sum across transcripts of the same gene.
This pipeline is a modified version of one written by the Perteas in the Salzberg lab. In order to run the pipeline, it is assumed that there exists the following directory structure containing certain files. One also uses the CONFIG.sh file to set a lot of the hard-coded parameters.