/RNAseq_analysis_scripts

Scripts for RNA seq analysis

Primary LanguageRGNU General Public License v3.0GPL-3.0

RNAseq_analysis_scripts

Scripts for RNA seq analysis. They can be used directly using the outputs from [IARCbioinfo's RNA seq workflow].(https://github.com/IARCbioinfo/RNAseq-nf)

Unsupervised analysis: RNAseq_unsupervised.R

This script performs unsupervised analyses (Principal Component Analysis and clustering) from htseq-count outputs.

Prerequisites

This R script requires the following packages:

  • ConsensusClusterPlus
  • ade4
  • DESeq2
  • fpc

Usage

Rscript RNAseq_unsupervised.R [options]
PARAMETER DEFAULT DESCRIPTION
-f . folder with count files
-o out output directory name
-p count.txt pattern for count file names
-n 500 number of genes to use for clustering
-t auto count transformation method; 'rld', 'vst', or 'auto'
-c hc clustering algorithm to be passed to ConsensusClusterPlus
-l complete method for hierarchical clustering to be passed to ConsensusClusterPlus
-h Show help message and exit

For example, one can type

Rscript RNAseq_unsupervised.R -f input -p count -t rld -o output/ -n 500

Details

The script involves 3 steps

  • Data transformation using either variance-stabilization log2-tranform or the r-log tranform from package DESeq2
  • PCA of tranformed counts with package ade4
  • Clustering of transformed counts with package ConsensusClusterPlus

Output

  • a .RData file with 4 objects: transformed counts di, the ade4 pca object pca,the ConsensusClusterPlus object clusters, and the list of cluster and item consensus icl

PCA

  • PCA plots with the first 2 PC, with colors corresponding to the clusters from ConsensusClusterPlus, for K=1,2,...,6 clusters.
  • loadings plot for the first 2 PC
  • csv file with the genes with the greatest loadings in the first PC

Clustering

  • a folder with plots from ConsensusClusterPlus

Compare an unsupervised analysis with a list of variables: RNAseq_unsupervised_compare.R

This script compares the result of an unsupervised analyses (Principal Component Analysis and clustering) obtained for example using script RNAseq_unsupervised_compare.R with an arbitrary number of variables (categorical or continuous).

Prerequisites

This R script requires the following packages:

  • ade4
  • permute

Usage

Rscript RNAseq_unsupervised_compare.R [options]
PARAMETER DEFAULT DESCRIPTION
-R . .RData file with results from clustering in variable clusters and results from PCA in variable pca
-i . name of input file with variables in column and variable names as first line
-m 2 minimum number of clusters
-M 5 maximum number of clusters
-o out output file preffix
-h Show help message and exit

For example, one can type

Rscript RNAseq_unsupervised_compare.R -R RNAseq_unsupervise.RData -i variables.txt -o output/

Details

For each clustering present in variable cluster, the script involves 3 steps

  • Plotting the variables on the first two PC axes
  • Computing the best matching between clusters and the levels of the variable (note: if the variable is continuous, categories are formed by subdividing the range of the variable into intervals of the same size)
  • Testing the significance of the best matching by computing a null distribution of matchings across 1000 permutations

Output

For each column (i.e., variable) of the input table, a .pdf file with K rows, where K is the number of clusterings in variable cluster, and 3 columns:

  • Column 1 represents the first 2 PCs of the PCA; colored ellipses correspond to the clusters of the clusters variable from the .RData file, and point colors correspond to the levels of the variable
  • Column 2 represents the best matching between the clustering and the variable
  • Column 3 represents the null distribution of the best matching (gray), the observed best matching and the p-value of the best matching (red)

Supervised analysis: RNAseq_supervised.R

This script performs supervised analyses (Differential Expression Analysis) at the gene level from htseq-count outputs.

Prerequisites

This R script requires the following package:

  • DESeq2
  • gtools

Depending on the options used, the following packages are also required:

  • BiocParallel (with option -c)
  • IHW (with option -m)

Usage

Rscript RNAseq_supervised.R [options]
PARAMETER DEFAULT DESCRIPTION
-f . folder with count files
-g . file with sample groups
-o out output directory name
-p count.txt pattern for count file names
-c 1 number of cores for statistical computation
-q 0.1 False Discovery Rate
-m FALSE Use Independent Hypothesis Weighting for multiple-testing procedure
-h Show help message and exit

For example, one can type

Rscript RNAseq_supervised.R -f input -g groups.txt -o output/

Details

The script performs DE analysis of gene count data under a Poisson glm with package DESeq2. When multiple groups are present (e.g., A, B, and C), computes results for contrasts corresponding to all combinations of 2 groups (A vs B, A vs C, and B vs C).

Output

  • a plot of fold changes as a function of normalized counts
  • plots of normalized counts of the most significant DE gene, as a function of the group, for each pair of group levels
  • .csv files with gene names, fold changes, and p-values, for each combination of 2 groups
  • a .RData file with 1 object: the deseq2 results table

Supervised analysis: RNAseq_supervised_transcript.R

This script performs supervised analyses (Differential Expression Analysis) at the transcript level from StringTie outputs.

Prerequisites

This R script requires the following package:

  • ballgown
  • genefilter
  • dplyr

Usage

Rscript RNAseq_supervised_transcript.R [options]
PARAMETER DEFAULT DESCRIPTION
-f . folder with folders of sample input files
-g . file with sample groups
-o out output directory name
-p . pattern for input folder names
-t 1 Threshold variance in gene expression (FPKM)
-r Row names for group file
-c 2 Column index of covariable to use for regression; other columns are treated as adjustment variables
-h Show help message and exit

For example, one can type

Rscript RNAseq_supervised_transcript.R -f input -g groups.txt -o output/

Details

The script performs DE analysis of transcript FPKM data under a hierarchical linear model with package ballgown, optionally correcting for additional variables.

Output

  • plots of expression levels for each gene with DE transcripts
  • a .csv file with gene names, IDs, p-values, and q-values
  • a .RData file with 1 object: the ballgown object