/taffeta

RNAseq/DGE analysis pipeline

Primary LanguageHTMLMIT LicenseMIT

taffeta

Reproducible analysis and validation of RNA-Seq data

Authors: Maya Shumyatcher, Mengyuan Kan, Blanca Himes.

Overview

The goal of taffeta is to preform reproducible analysis and validation of RNA-Seq data, as a part of RAVED pipeline:

  • Download SRA .fastq data
  • Perform preliminary QC
  • Align reads to a reference genome
  • Perform QC on aligned files
  • Create a report that can be used to verify that sequencing was successful and/or identify sample outliers
  • Perform differential expression of reads aligned to transcripts according to a given reference genome
  • Create a report that summarizes the differential expression results

Generate LSF scripts in each step for HPC use.

Bioinformatic Tools

  • Alignment and quantification: STAR, HTSeq, kallisto (older versions used bowtie2, tophat, cufflinks, cummerbund, whose options are no longer available).
  • Differential expression (DE) analysis: R packages DESeq2 and sleuth
  • QC: fastqc, trimmomatic, samtools, bamtools, picardtools.
  • Pipeline scripts: the Python scripts make use of various modules including subprocess, os, argparse, sys.
  • R markdown scripts for summary reporot: Require various R libraries such as DT, gplots, ggplot2, rmarkdown, RColorBrewer, dplyr, ginefilter, biomaRt. Note that the current RMD scripts require pandoc version 1.12.3 or higher to generate HTML report.

Prerequisite Files

  • Spesis-specific genome reference files: reference genome fasta, gtf, refFlat, and index files. As our example uses ERCC spike-ins, the reference files include ERCC mix transcripts.
  • Adapter and primer sequences: a list of adapter and primer sequences for different library preparation kits is provided. For fastqc, replace . For trimming we provide adapter and primer sequences for the following types: Ilumina TruSeq single index, Illumina unique dual (UD) index adpter and PrepX. Users can tailor this file by adding sequences from other protocols.

Workflow

Download data from GEO/SRA

Run pipeline_scripts/rnaseq_sra_download.py to download .fastq files from SRA. Read in template_files/rnaseq_sra_download_Rmd_template.txt from specified directory template_dir to create a RMD script. Ftp addresses for corresponding samples are obtained from SRA SQLite database using R package SRAdb.

rnaseq_sra_download.py --geo_id GEO_Accession --path_start output_path --project_name output_prefix --template_dir templete_file_directory --fastqc

bsub < project_name_download.lsf

The option --pheno_info refers to using user provided SRA ID for download which is included in the SRA_ID column in the provided phenotype file. If the phenotype file is not provided, use phenotype information from GEO. SRA_ID is retrieved from the field relation.1.

The option --fastqc refers to running FastQC for downloaded .fastq files.

Output files:

Fastq files downloaded from SRA are saved in the following directory.

path_start/project_name_SRAdownload/

Fastqc results of corresponding sample are saved in:

path_start/sample_name/fastq_name_fastqc.zip

The RMD and corresponding HTML report files:

path_start/project_name_SRAdownload/project_name__SRAdownload_RnaSeqReport.Rmd path_start/project_name_SRAdownload/project_name__SRAdownload_RnaSeqReport.html

GEO phenotype file generated by RMD reports:

path_start/project_name_SRAdownload/GEO_Accession_withoutQC.txt

SRA information file generated by RMD reports:

path_start/project_name_SRAdownload/project_name_sraFile.info

User-tailored phentoype file

The sample info file used in the following steps should be provided by users.

Required columns:

  • 'Sample' column containing sample ID
  • 'Status' column containing variables of comparison state
  • 'R1' and/or 'R2' columns containing full paths of .fastq files

Other columns:

'Treatment', 'Disease', 'Donor' (i.e. cell line ID if in vitro treatment is used), 'Tissue', 'ERCC_Mix' (i.e. ERCC mix ID if ERCC spike-in sample is used), 'protocol' designating sample preparation kit information.

'Index' column contains index sequence for each sample. If provided, trim raw .fastq files based on corresponding adapter sequences.

If use data from GEO, most GEO phenotype data do not have index information. However, FastQC is able to detect them as "Overrepresented sequences". Users can tailor the 'Index' column based on FastQC results. We provide a file with most updated adapter and primer sequences for FastQC detection.

An example phenotype file can be found here: example_files/SRP033351_Phenotype_withoutQC.txt. Note that column naming is rigid for the following columns: 'Sample', 'Status', 'Index', 'R1', 'R2', 'ERCC_Mix', 'Treatment', 'Disease', 'Donor', because pipeline scripts will recognize these name strings, but the column order can be changed.

Alignment, quantification and QC

  1. Run pipeline_scripts/rnaseq_align_and_qc.py to: 1) trim adapter and primer sequences if index information is available, 2) run FastQC for (un)trimmed .fastq files, 3) align reads and quantify reads mapped to genes/transcripts, and 5) obtain various QC metrics from .bam files.

Edit pipeline_scripts/rnaseq_userdefine_variables.py with a list user-defined variables (e.g. paths of genome reference file, paths of bioinformatics tools, versions of bioinformatics tools), and save the file under an executable search path.

If perform adapter trimming, read in template_files/rnaseq_adapter_primer_sequences.txt from specified directory template_dir used as a reference list of index and primer sequences for various library preparation kits.

rnaseq_align_and_qc.py --project_name output_prefix --sample_in sample_info_file.txt --aligner star --ref_genome hg38 --librar_type PE --index_type truseq_single_index --strand nonstrand --path_start output_path --template_dir templete_file_directory

for i in *.lsf; do bsub < $i; done

The "--aligner" option indicates aligner should be used (default: star).

The "--ref_genome" option refers to using selected version of genome reference.

The "--library_type" option refers to PE (paired-end) or SE (single-end) library.

The "--index_type" option refers to index used in sample library preparation.

The index types provided in template_files/rnaseq_adapter_primer_sequences.txt are:

  • truseq_single_index (TruSeq Single Indexes)
  • illumina_ud_sys1 (Illumina UD indexes for NovaSeq, MiSeq, HiSeq 2000/2500)
  • illumina_ud_sys2 (Illumina UD indexed for MiniSeq, NextSeq, HiSeq 3000/4000)
  • prepX (PrepX for Apollo 324 NGS Library Prep System)
  • Nextera_DU (Nextera XT indexes for MiniSeq, NextSeq, HiSeq 3000/4000)
  • NebNext_DU (NebNext multiplex oligos for MiniSeq, NextSeq, HiSeq 3000/4000)

template_files/rnaseq_adapter_primer_sequences.txt contains four columns (i.e. Type, Index, Description, Sequence). Sequences in the Index column is used to match those in Index column in sample info file. This column naming is rigid.

The list is based on the following resources:

If users provide new sequences, add the new index type in the 1st column 'Type' and specify it in "--index_type".

The "--strand" option refers to sequencing that captures both strands (nonstrand) or the 1st synthesized strand (reverse) or the 2nd synthesized strand (forward) of cDNA. If the 2nd strand is synthesized using dUTP, this strand will extinct during PCA amplification, thus only 1st (reverse) strand will be sequenced.

Read sample preparation protocal carefully. Reads not in the specified strand will be discarded. Double check proprotion of reads mapped to no feature category in QC report. If a lot of reads are mapped to 'no feature', the strand option setting is likely incorrect.

Output files:

Various output files will be written for each sample in directories structured as:

path_start/sample_name/sample_name_R1_Trimmed.fastq
path_start/sample_name/sample_name_R2_Trimmed.fastq
path_start/sample_name/sample_name_R1_Trimmed_fastqc.zip
path_start/sample_name/sample_name_R2_Trimmed_fastqc.zip
path_start/sample_name/sample_name_ReadCount
path_start/sample_name/aligner_out
path_start/sample_name/quantification_tool_out

  1. Run pipeline_scripts/rnaseq_align_and_qc_report.py to create an HTML report of QC and alignment summary statistics for RNA-seq samples. Read in template_files/rnaseq_align_and_qc_report_Rmd_template.txt from specified directory template_dir to create a RMD script.

If ERCC_Mix column exists in phenotype file, it will report the concordance between ERCC spike-in transcript-level read counts and its molecular concentrations. Read in ERCC molecular concentration file template_files/ERCC_SpikeIn_Controls_Analysis.txt from specified directory template_dir which can be downloaded here.

rnaseq_align_and_qc_report.py --project_name output_prefix --sample_in sample_info_file.txt --aligner star --ref_genome hg38 --library_type PE --strand nonstrand --path_start output_path --template_dir templete_file_directory

bsub < project_name_qc.lsf

Output files:

This script uses the many output files created in step 1), converts these sample-specific files into matrices that include data for all samples, and then creates an Rmd document.

The report and accompanying files are contained in:

path_start/project_nameAlignment_QC_Reportaligner/

The RMD and corresponding HTML report files:

path_start/project_nameAlignment_QC_Reportaligner/project_name_QC_RnaSeqReport.Rmd path_start/project_nameAlignment_QC_Reportaligner/project_name_QC_RnaSeqReport.html

Gene-based differential expression analysis - htseq-count/DESeq2

Run pipeline_scripts/rnaseq_de_report.py to perform DE analysis and create an HTML report of differential expression summary statistics. Read in template_files/rnaseq_deseq2_Rmd_template.txt from specified directory template_dir to create a RMD script.

rnaseq_de_report.py --project_name output_prefix --sample_in sample_info_file_withQC.txt --comp sample_comp_file_withQC.txt --de_package deseq2 --ref_genome hg38 --path_start output_path --template_dir templete_file_directory

bsub < project_name_deseq2.lsf

The "--sample_in" option specifies user provided phenotype file for DE analysis (e.g. example_files/SRP033351_Phenotype_withQC.txt). The columns are the same as example_files/SRP033351_Phenotype_withoutQC.txt but with an additional column "QC_Pass" designating samples to be included (QC_Pass=1) or excluded (QC_Pass=0) after QC. This column naming is rigid which will be recoganized in pipeline scripts, but column order can be changed.

The "--comp" option specifies comparisons of interest in a tab-delimited text file with one comparison per line with three columns (i.e. Condition1, Condition0, Design), designating Condition1 vs. Condition2. The DE analysis accommodates a "paired" or "unpaired" option specified in Design column. For paired design, specify the condition to correct for that should match the column name in the sample info file - e.g. paired:Donor. Note that if there are any samples without a pair in any given comparison, the script will automatically drop these samples from that comparison, which will be noted in the report.

Find the example comp file here example_files/SRP033351_comp_file.txt.

Output files:

The pairwise DE results and normalized counts for all samples and samples from pairwise comparisons are contained in:

path_start/project_name/project_namedeseq2_out/ project_nameCondition1_vs_Condition2_full_DESeq2_results.txt project_name_Condition1_vs_Condition2_counts_normalized_by_DESeq2.txt project_name_counts_normalized_by_DESeq2.txt

The RMD and corresponding HTML report file:

path_start/project_name_deseq2_out/project_name_DESeq2_Report.Rmd path_start/project_name_deseq2_out/project_name_DESeq2_Report.html

Transcript-based differential expression analysis - kallisto/sleuth

Updating...

Acknowledgements

This set of scripts was initially developed to analyze RNA-Seq and DGE data at the Partners Personalized Medicine PPM. Barbara Klanderman is the molecular biologist who led the establishment of PPM RNA-seq lab protocols and played an essential role in determining what components of the reports would be most helpful to PPM wet lab staff.