/arpir

Automatic RNA-Seq Pipelines with Interactive Report

Primary LanguageRGNU General Public License v3.0GPL-3.0

ARPIR (Automatic RNA-Seq Pipelines with Interactive Report)

Introduction

ARPIR makes RNA-Seq analysis: quality control, pre-processing, alignment (primary-analysis), transcript quantification and differential expression analysis on BAM files (secondaty-analysis). Given the input files and the working directory, the pipeline is completely automated. First, quality control on FastQ files is performed with FastQC and FastQ Screen. FastQC makes quality control and creates one report for sample. FastQ Screen estimates approximately the percentage of reads that can be mapped on genomes other than human, like ribosomal genomes, PhiX genome and mouse genome. This allows to evaluate the presence of contaminating genomes. Pre-processing follows quality control: the reads are aligned on PhiX genome and ribosomal genome to eliminate contaminations. Alignment can be performed with TopHat2, HISAT2 or STAR; in the first case quantification is performed with Cufflinks and DEA with cummeRbund, in the second case quantification is performed with featureCounts and DEA with DESeq2 or edgeR, in the third case it is possible to choose one of the previous methods for quantification and DEA. A second intermediate quality control analysis is also performed on the aligned BAM files with some of the RSeQC scripts and in particular: inner_distance, junction_annotation, junction_saturation, bam_stat, read_distribution, geneBody_coverage. A final report is generated from the results using multiqc. It is possible to perform an optional tertiary-analysis on the results. It consists in Gene Ontology enrichment analysis and KEGG Pathway enrichment analysis on the differentially expressed genes (with absolute Fold Change value higher than 1.5 and adjusted p-value lower than 0.05).

Prerequisites

For ARPIR to work properly, you must first make sure that you have installed the necessary applications for the pipeline. You can use the conf.sh configuration file to install all the required software and dependencies, as well as download the genome files. The script runs on Debian-based systems and you can run it with the following command:

sudo bash conf.sh

If instead you want to manually proceed with the installation of the prerequisites, the necessary ones are the following:

Applications

For Graphical User Interface and scripts running:

For quality control on fastq files and pre-processing:

For alignment and quality control on BAM files:

For quantification:

Input files

The necessary input files are as follows:

  • FastQ files of the samples. Since this is a differential analysis, at least two cases and two controls must be present, furthermore, if it is a paired-end analysis, files of read 1 and read 2 must be provided for each sample while if it is a single-end analysis only files of read 1 must be provided.

  • Reference genome for Bowtie2 with relative indexes, necessary for alignment with TopHat2. The path in which the file with its indexes is searched by default is as follows: /opt/genome/human/hg19/index/bowtie2/hg19. It is not necessary to report the extension, but only the name of the file and the indexes must be present in the same folder.

  • Reference genome with relative indexes, necessary for alignment with HISAT2. The path in which the file with its indexes is searched by default is as follows: /opt/genome/human/hg19/index/hisat2/hg19. The reference genome file is the same as TopHat, but the indexes are obtained differently (see below).

  • Reference genome with relative indexes, necessary for alignment with STAR. The path in which the directory with indexes is searched by default is as follows: /opt/genome/human/hg19/index/STAR. The reference genome file is the same as TopHat2 and HISAT2, but the indexes are obtained differently (see below).

  • BED file, necessary for quality control on BAM files. The path in which the file is searched by default is as follows: /opt/genome/human/hg19/annotation/hg19.refseq.bed12.

  • PhiX genome with relative indexes, necessary for pre-processing. The path in which the file with its indexes is searched by default is as follows: /opt/genome/control/phix174/bwa/phiX174.fa.

  • Ribosomal genome 5s with relative indexes, necessary for pre-processing. The path in which the file with its indexes is searched by default is as follows: /opt/genome/human/hg19/contam/bwa/hum5SrDNA.fa.

  • Ribosomal genome with relative indexes, necessary for pre-processing. The path in which the file with its indexes is searched by default is as follows: /opt/genome/human/hg19/contam/bwa/humRibosomal.fa.

  • GTF file, necessary for alignment with TopHat2 and quantification with featureCounts and Cufflinks. The path in which the file is searched by default is as follows: /opt/genome/human/hg19/annotation/hg19.refgene.sorted.gtf.

  • Reference genome with relative indexes, necessary for quantification with Cufflinks. The path in which the file with its indexes is searched by default is as follows: /opt/genome/human/hg19/index/hg19.fa. The reference genome file is the same, but the indexes change.

Creating indexes:

The indexes can be created starting from the Fasta file and require different commands based on the alignment software for which they will be used.

In the case of TopHat, the necessary command is the following:

bowtie2-build file.fa file

In the case of HISAT, the necessary command is:

hisat2-build file.fa file

In the case of STAR, the necessary command is:

STAR --runMode genomeGenerate --genomeDir index_directory --genomeFastaFiles file.fa --sjdbGTFfile file.gtf

In the case of BWA, instead, the necessary command is:

bwa index file.fa

Finally, in the case of BWA, the necessary command is:

samtools faidx file.fa

Running application

Command line

After installing the necessary tools and getting the required input files, you can launch the application using command-line. The following is the use for analysis in single-end and paired-end, with the arguments that must necessarily be inserted, as they do not provide default options, and the optional arguments, which can be omitted if the paths and options you want match the standard ones.

Usage

Paired-end

python ARPIR.py -n <project_name> -pn <pool_name> -sn <sample_A,sample_B,sample_C,sample_D> -r1 <sample_A_read1,sample_B_read1,sample_C_read1,sample_D_read1> -r2 <sample_A_read2,sample_B_read2,sample_C_read2,sample_D_read2> -type <cntrl,cntrl,treat,treat> -comp <cntrl_VS_treat> -o <output_directory> [options]*

Single-end

python ARPIR.py -n <project_name> -pn <pool_name> -sn <sample_A,sample_B,sample_C,sample_D> -r1 <sample_A_read1,sample_B_read1,sample_C_read1,sample_D_read1> -type <cntrl,cntrl,treat,treat> -comp <cntrl_VS_treat> -o <output_directory> [options]*

Arguments

-n Project name. No default options.
-pn Pool name. No default options.
-sn Sample names (',' sep). No default options.
-r1 Read 1 FastQ path (',' sep). No default options. Files must appear in the same order as sample names.
-r2 Read 2 FastQ path (',' sep). Required only for paired-end analysis. Files must appear in the same order as sample names.
-type Sample types (',' sep). No default options. Types must appear in the same order as sample names.
-comp Comparisons between samples (allows multiple comparisons comma separated). No default options.
-o Output directory. No default options.

Options

-rb Reference genome file path for bowtie. Default: /opt/genome/human/hg19/index/bowtie2/hg19
-rh Reference genome file path for HISAT2. Default: /opt/genome/human/hg19/index/hisat2/hg19
-rs Index directory path for STAR. Default: /opt/genome/human/hg19/index/STAR
-bed Reference genome annotation file path. Default: /opt/genome/human/hg19/annotation/hg19.refseq.bed12
-ph PhiX genome file path. Default: /opt/genome/control/phix174/bwa/phiX174.fa
-rib1 Ribosomal genome file path. Default: /opt/genome/human/hg19/contam/bwa/hum5SrDNA.fa
-rib2 Ribosomal genome file path. Default: /opt/genome/human/hg19/contam/bwa/humRibosomal.fa
-t Max thread number. Default: 12.
-g GTF file path. Default: /opt/genome/human/hg19/annotation/hg19.refgene.sorted.gtf
-a Alignment method. Default: hisat; alternative: tophat, star.
-l Library type. Default: fr-firststrand; alternative: fr-secondstrand.
-q Quantification method. Default: featureCounts; alternative: Cufflinks.
-r Reference genome file path. Default: /opt/genome/human/hg19/index/hg19.fa
-dea Differential Expression Analysis method. Default: edgeR; alternatives: DESeq2, cummeRbund.
-r_path Script directory (alignment, quantification and DEA). Default: /opt/applications/src/arpir/ARPIR.
-meta Analysis with or without final tertiary-analysis. Default: full; alternative: quant.
-cat Max number of categories showed in R plots for tertiary-analysis. Default: 5.

Example

To make the launch of the command easier, we have created a script run.sh where you can enter all the desired variables. Once done, you can launch the pipeline with the following command:

bash run.sh

Graphical User Interface

Usage

bash GUI_ARPIR.sh

A series of windows allows to indicate input files and to choose between various options.

Example

bash /path/to/script/ARPIR/GUI_ARPIR.sh

A first window will appear in which you will be asked to enter the data relating to the experiment, as in the example below.

This is followed by a series of windows that allow you to choose between different options or ask to select the input files or folders, as shown in the following two images.

If they exist, the default paths will be shown automatically. Finally, once all the inputs and the parameters have been selected, a summary window will appear. If the data entered are correct, it will be sufficient to confirm to start the analysis, otherwise it is possible to cancel and start from the beginning.

Shiny app

To observe the results of the analysis in an interactive form, it is possible to launch a shiny app.