rna-seq

Pipeline for RNA-seq scripts used by the Essigmann Lab.

Part I: Pre-Process, Align, and Assemble

Download genome from UCSC Genome Browser: chromFa.tar.gz
Unzip FASTA: tar -xvzf chromFa.tar.gz
Remove mitochondrial chromosome and other noncanonical chromosomes (chr#_#########_random) from directory
Compile chromosomal FASTA files to single file: cat *.fa > mm10.fa
If necessary, modify FASTA file to match naming convention for GTF file: sed -i 's/chr//g' mm10.fa
Index reference: hisat2-build -f mm10.fa mm10

Download GTF from Ensembl: Mus_musculus.GRCm38.93.gtf.gz
Unzip GTF: tar -xzvf Mus_musculus.GRCm38.93.gtf.gz
(Optional) Rename file: mv Mus_musculus.GRCm38.93.gtf mm10.gtf
Format known splice junctions to format used by HISAT2: hisat2_extract_splice_sites.py mm10.gtf > mm10.gtf.ss

Trim adapter sequences and ends: trimmomatic-0.38.jar SE $seq.fastq ILLUMINACLIP:TruSeq3-SE.fa:2:30:10 SLIDINGWINDOW:4:30 LEADING:30 TRAILING:30 MINLEN:25

Map to whole genome, accounting for known splice sites: hisat2 --dta -x ref/mm10 --known-splicesite-infile ref/mm10.gtf.ss -U $trimmed.fastq -S $sample.hisat2.sam
Convert to BAM: samtools view -bS $sample.hisat2.sam > $sample.hisat2.unsorted.bam
Sort BAM file: samtools sort -o $sample.hisat2.bam $sample.hisat2.unsorted.bam

Estimate abundances for differential expression analysis: stringtie -e -B -G ref/mm10.gtf -A $sample\_abund.tab -o $directory/$sample/$sample.gtf $sample.hisat2.bam
- Note: This is considered StringTie's "alternate" workflow, relying on a well-annotated reference; it will not search for novel isoforms. Suggested by the StringTie creator here.
- Historical note: Originally the pipeline used StringTie's recommended workflow, but identifying gene names caused trouble as many gene_id values were given MSTRG assignments. StringTie author made note of it here.

Download Python script (prepDE.py) provided by StringTie developers
Run script to extract read count information from StringTie outputs: python2.7 prepDE.py -l 40 [-i $directory]
- Note: Without the -i parameter, this assumes default directory structure created by StringTie, with a ballgown folder in the working directory. In our case, use the -i parameter to denote the directory where outputs are contained.
- Note: The script requires a Python version between 2.7 and 3.
- Note: The -l parameter takes in average read length. While this doesn't affect relative transcript levels, it will impact your absolute values! The default parameter for -l is 75.

R environment must have the following Bioconductor packages installed: DESeq2, org.Mm.eg.db, and biomaRt.
- If not installed, install Bioconductor: source("https://bioconductor.org/biocLite.R")
- Install packages: biocLite("DESeq2"); biocLite("org.Mm.eg.db"); biocLite("biomaRt")
- Call each library in the R environment with the library() function.
Read in expression gdata calculated by StringTie: pheno_data <- read.table("180711Ess_phenodata.csv", header=TRUE, sep=","); count_data <- read.table("gene_count_matrix.csv", header=TRUE, sep=",", row.names=1)
Retrieve gene IDs and gene names by creating a biomaRt instance: ensembl <- useMart("ensembl", dataset="mmusculus_gene_ensembl")
Create key-value pairs for Ensembl IDs to gene names: gene_names <- getBM(c("ensembl_gene_id", "external_gene_name"), filters="ensembl_gene_id", values=rownames(count_data), mart=ensembl); colnames(gene_names) <- c("ensembl_id", "gene_name"); gene_set <- setNames(as.list(gene_names$gene_name), as.list(gene_names$ensembl_id))
Read in all data to a DESeqDataSet: dds <- DESeqDataSetFromMatrix(countData=count_data, colData=pheno_data, design~1)
Filter to remove low-count genes: keep.rows <- rowSums(counts(dds)) >= 10; dds <- dds[keep.rows,]

Use in-house R function dea.group to query between either sexes or treatment groups. For sex differences, the function requires the DESeq data structure and experimental treatment group to query; TCPOBOP addition changes require the DESeq data structure,the base group (i.e. non-TCPOBOP treatment), and the sex to query.
- Conduct differential gene analysis on DMSO-treated mice, by sex: dds.sex_dmsona <- dea.group(dds, "sex", "DMSO")
- Conduct differential gene analysis on DMSO-treated males, by TCPOBOP addition: dds.tc_dmsom <- dea.group(dds, "TCPOBOP", "DMSO", "M")