/GBCF_RNASeqAnalysisPipeline

Repository for analysis of RNA-seq data

Primary LanguageR

RNA-seq Analysis Pipeline

Repository for scripts used to perform the following analysis of short paired-end RNA-seq reads:

  • Differential expression (DE)
  • Gene ontology (GO) term enrichment
  • Gene set enrichment (GSE)

RNA-seq Analysis Pipeline

RNA-seq Analysis Pipeline

Running Scripts

  • The input and output paths need to be set using the inputPaths.txt and outputPaths.txt files in the InputData directory.
  • Be sure to read the usage notes at the beginning of the file for any script that you intend to run.

Running Scripts on Servers

  • To submit a BASH job script to the queue: qsub SCRIPTNAME.sh INPUT_1 ... INPUT_N
  • To view the jobs you have submitted and corresponding task ID numbers: qstat -u USERNAME
  • To delete a job from the queue: qdel TASKIDNUMBER

Running Scripts Locally

bash SCRIPTNAME.sh INPUT_1 ... INPUT_N

Alternative Method of Running Scripts Locally

  • To compile the script before running: chmod +x SCRIPTNAME.sh
  • To run a compiled trimming script: ./SCRIPTNAME.sh INPUT_1 ... INPUT_N

Resources

Required Software

  • FastQC: A quality control tool for high throughput raw sequence data. It generates quality reports for NGS data and gives pass/fail results for the following checks: Per base sequence quality, Per sequence quality scores, Per base sequence content, Per base GC content, Per sequence GC content, Per base N content, Sequence length distribution, Sequence duplication levels, Overrepresented sequences, Kmer content. It also has a Graphic User Interface.
  • Trimmomatic: A flexible read trimming tool for Illumina NGS data. It can trim adapter sequences, remove low-quality reads and bases.
  • HISAT2: A fast and sensitive alignment program for mapping next-generation sequencing reads (whole-genome, transcriptome, and exome sequencing data) against the general human population (as well as against a single reference genome). The algorithm is based on HISAT and Bowtie2; uses a graph FM index (GFM) to index the genome before read mapping.
  • Tophat2: A spliced read mapper for RNA-Seq. It aligns RNA-Seq reads to mammalian-sized genomes using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons.
  • Bowtie2: An ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. Bowtie2 first extracts "seed" substrings in reads, aligns seeds in an ungapped way, and then performs extension in a gapped way.
  • Cufflinks: It assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA-Seq samples. Assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA-Seq samples. It can be used in the pipeline with a protocol paper.
  • Cuffdiff: Differential analysis of gene regulation at transcript resolution with RNA-seq. An algorithm that estimates expression at transcript-level resolution and controls for variability evident across replicate libraries.
  • Samtools: Utilities for the Sequence Alignment/Map (SAM) format. SAMtools has multiple commands for processing SAM/BAM files. The sub-command "SAMtools-flagstat" can be used to print statistics for SAM/BAM files using the FLAG field.
  • HTSeq-count: A package to count mapped reads for genomic features. It counts mapped reads for genomic features.
  • EdgeR: Empirical Analysis of Digital Gene Expression Data. It performs differential expression analysis using read counts. It uses raw count data; implements a range of statistical methodology based on the negative binomial distributions, including empirical Bayes estimation, exact tests, generalized linear models and quasi-likelihood tests.