/RNA-Seq-mini-project

The project explores RNASeq analysis pipelines and their codes/scripts. Comparison report between the pipelines is provided at the end of analysis with preferred tools outline.

Primary LanguageHTML

RNA-Seq-mini-project

RNA-seq (RNA-sequencing) is a technique that can examine the quantity and sequences of RNA in a sample using next-generation sequencing (NGS). Over the past few years, RNA sequencing (RNA-seq) has become an indispensable tool for transcriptome-wide analysis of differential gene expression and differential splicing of mRNAs. It is rapidly replacing gene expression microarrays in many labs as it lets you quantify, discover, and profile RNAs.

Several tools and pipelines exist for RNA-Seq data analysis. Different consortiums and institutions use different sets of guidelines and standards for their data analysis. The H3ABioNet has developed a standard SOP and guidelines for RNA-Seq data analysis with some recommendations for gene expression analysis in human.

In this repo, we document RNA-seq data analysis following this guidelines developed by H3ABioNet. Data used in this project is available here

Roadmap

Phase I (Pre-processing analysis), Time: 1 week.

Tools : fastqc - v01109 , Trimmomatic - 0.39v, Cut-adapt - v2.8

  • Download raw reads
  • Check quality of the raw reads
  • Adapter removal and quality trimming
  • Quality recheck

Phase II (Gene Expression Analysis ), Time: 3 Weeks.

Generate gene/transcript level counts

Tool - kallisto- v0.46.2, -salmon v0.12.0, Hisat v2.1.0, feature counts - v2.0.0

  • Align reads to reference genome
  • Generate estimated counts using pseudo-alignment approach
  • Collecting and tabulating alignment stats

Phase III (R - Analysis ), Time: 2 weeks

Tool - DESeq2 v3.12 , EdgeR V3.12

  • QC and outlier removal / Batch detection.
  • Answer general questions of the project
  • wrap-up

Report Genaration(1 week)

  • Comparison of outputs from each tool in each processing step.

Setup

Create conda environment

$ conda create --name [environ_name]

Activate conda environment

$ conda activate [environ_name]

Install tools

$ conda install [toolname] -c bioconda

Tools:

Tool name Version Use
Fastqc 0.11.9 Check the quality of the reads
Trimmomatic 0.39 Trim adapter remnants and low quality reads
Kallisto 0.46.2 pseudo-alignment and gene counts
FeatureCounts 2.0.0 Perform gene counts
Salmon 0.12.0 Pseudo-alignment and gene counts
cutadapt 2.8 Trim adapter remnants and low quality reads

R-Analysis

Package Use
DESeq2 To analyse count data and test for differential expression.
rhdf5 To read abundance.h5 file
tximport To import abundance.h5 file
pheatmap To draw clustered heatmaps
RcolorBrewer Contains a ready-to-use color palettes for creating heatmaps
tximportData Provides output of running Kallisto

Workflow:

Phase 1

  • Download raw reads
  • Quality check of the raw reads
  • Adapter removal and quality trimming
  • Quality recheck

Phase 2

  • Alignment
  • Trascripts/gene counts
  • Collect and tabulate statistics

Phase 3

  • Statistical analysis:
    • QC check
    • Outlier removal and normalization
    • Differential expression

How to use the provided scripts for analysis

Hisat pipeline

  • The documents are found here

  • First, put your raw reads and metadata in one file.In case of HPC make sure you module load all the tools required for this pipeline. You will begin with checking the quality of your reads using Fastqc.Here you will get the information on which reads to trim or not. Those that require trimming to remove low quality reads and reads that have a shorter length than your preffered length will proceed for trimming using Trimmomatic. This was done using this script

  • Allignment of the reads requires a reference genome that will used to create an index to be used for the allignment. Using the command wget you can obtain the fasta file in relation to your reads and use hisat2 in creation of indeces and proceed for allignment of the reads.This was done using this script

  • When using HISAT2 the counts are obtained using a different tool. In this pipeline, we used features count to count the reads that alligned to the indexes created from the reference genome.The counts were done using this script

  • The counts obtained from featuresCount were used for statistical analysis in R using DESeq2. The statistical analysis done are contained in the DESeq2 Rmd.

Salmon Pipeline

The scripts are found here.

Data

Raw data/reads from the sequencer, Metadata(Downloaded from here), Reference genome,downloaded
here and the Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz in one directory as the scripts.

Phase I (Pre-processing)

Quality control check was done using this, fastqc_quality_check.sh script.

Data cleaning involves , removal of adapter remnants, short reads and low quality bases. Cutadapt trimming tool was preffered and the script cutadapt.sh was used.

Quality-recheck after trimming is necessary to examine the extent to which your data was cleaned and this was achieved using fastqc_quality_recheck.sh script.

Phase II (Gene Expression Analysis)

Involves Alignment of reads, Gene counts and Tabulating of the statistics, the script salmon.sh was used.

Phase III (Statistical analysis/Differential Expression)

EdgeR package was used for normalization, statistical analysis and visualization of the gene counts using this EdgeR_Analysis_script.Rmd script. The generated html document for the EdgeR can be assesed here.

Kallisto Pipeline

The scripts can be accessed from this repo or here.

Data

The code for downloading the required data is included in the scripts, since the data is huge it takes time to download the data. Incase you've already downloaded your data as from above then you can hash out the download codes from the scripts or if you wish to obtain raw data separately then use the below links.

Raw data reads and metadata can be downloaded from here incase you didn't download them from above pipelines. Reference genome can also be downloaded from here.

Phase I (Pre-processing)

Quality control check was done using fastqc tool which informed the data cleaning parameters in the next step. Data cleaning involves , removal of adapter remnants, short reads and low quality bases. Trimmomatic trimming tool was preffered in this pipeline.

Quality-recheck after trimming is necessary to examine the extent to which your data was cleaned.This was also done using Fastqc.

The combined script for Fastqc and Trimmomatic with details are in fastqc-trimmomatic.sh script.

Phase II (Gene Expression Analysis)

Involves Alignment of reads, Gene counts and Tabulating of the statistics, the script kallisto.sh was used to achieve this.

Phase III (Statistical analysis/Differential Expression)

DESeq2 package was used for normalization, statistical analysis and visualization of the gene counts using this kallisto_Deseq_analysis.Rmd script.

The generated html document for the DESeq2 can be assesed here

Conclusion

The final analysis report is available here