Title | Subtitle | Project | Author | Affiliation | Web | Date | output | |||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
CirComPara |
a multi-method comparative bioinformatics pipeline to detect and study circRNAs from RNA-seq data |
CirComPara |
Enrico Gaffo |
Compgen - University of Padova |
December 21, 2016 |
|
CirComPara is a computational pipeline to detect, quantify, and correlate expression of linear and circular RNAs from RNA-seq data.
Execute the following commands to download and install (locally) in your system the scripts and tools required to run CirComPara. If something goes wrong with the installation process try to manually install the software as described below.
Download and extract the latest release of CirComPara, or clone the GIT repository, enter CirComPara directory and run the automatic installer script:
git clone http://github.com/egaffo/CirComPara
cd CirComPara
./install_circompara
NB: in the sed
string change the /full/circompara/dir/path
path with your installation directory
cd test_circompara
mkdir analysis
sed "s@\$CIRCOMPARA@/full/circompara/dir/path@g" vars.py > analysis/vars.py
sed "s@\$CIRCOMPARA@/full/circompara/dir/path@g" meta.csv > analysis/meta.csv
cd analysis
../../circompara
If you plan to use single-end reads, test with meta_se.csv
file instead of meta.csv
.
If you receive some error messages try to follow instructions in Installation troubleshooting section.
Once completed the installation, if you do not want to type the whole path to the CirComPara executable each time, you can update your PATH
environment variable. From the terminal type the following command (replace the /path/to/circompara/install/dir
string with CirComPara's actual path)
export PATH=/path/to/circompara/install/dir:$PATH
Another way is to link CirComPara's main script in your local bin
directory
cd /home/user/bin
ln -s /path/to/circompara/install/dir/circompara_CirComPara
A Docker image of CirComPara is available from DockerHub.
To pull the image:
docker pull egaffo/circompara-docker
You'll find the instructions on how to use the docker image at https://hub.docker.com/r/egaffo/circompara-docker.
This section shows how to set your project directory and run the analysis. To run an analysis usually you want to specify your data (the sequenced reads in FASTQ format) and a reference genome in FASTA format.
You have to specify read files, sample names and sample experimental condition in a metadata table file. The file format is a comma separated text file with the following header:
file,sample,condition
Then, each row corresponds to a read file. If you have paired-end sequenced samples write one line per file with the same sample name and condition.
An example of the metadata table:
file | sample | condition |
---|---|---|
/home/user/reads_S1_1.fq | S1 | WT |
/home/user/reads_S1_2.fq | S1 | WT |
/home/user/reads_S2_1.fq | S2 | MU |
/home/user/reads_S2_1.fq | S2 | MU |
and metadata file content:
file,sample,condition
/home/user/reads_S1_1.fq,S1,WT
/home/user/reads_S1_2.fq,S1,WT
/home/user/reads_S2_1.fq,S2,MU
/home/user/reads_S2_1.fq,S2,MU
In the meta file you can also specify the adapter sequences to preprocess the reads, just add an adapter
column with the adpter file.
file | sample | condition | adapter |
---|---|---|---|
/home/user/reads_S1_1.fq | S1 | WT | /home/user/circompara/adapter.fa |
/home/user/reads_S1_2.fq | S1 | WT | /home/user/circompara/adapter.fa |
A required parameter is the reference genome. You can either pass the reference genome from the command line
./circompara "GENOME_FASTA='/home/user/genomes/Homo_sapiens.GRCh38.dna.primary_assembly.fa'"
or by setting the GENOME_FASTA
parameter in the vars.py
file; e.g.:
GENOME_FASTA = '/home/user/genomes/Homo_sapiens.GRCh38.dna.primary_assembly.fa'
Although parameters can be set from command line (sorrounded by quotes), you can set them in a local vars.py
file, which must be placed in the analysis directory. Parameters not specified by the user will take defaulkt values.
Below there is the full list of the parameters:
META: The metadata table file where you specify the project samples, etc.
default: meta.csv
ANNOTATION: Gene annotation file (like Ensembl GTF/GFF)
default:
GENOME_FASTA: The FASTA file with the reference genome
default:
CIRCRNA_METHODS: Comma separated list of circRNA detection methods to use. Repeated values will be collapsed into unique values. Currently supported: ciri, find_circ, circexplorer2_star, circexplorer2_bwa, circexplorer2_tophat, circexplorer2_segemehl, testrealign (unfiltered segemehl; use of circexplorer2_segemehl is recommended for a better filtering of segemehl predictions). Set an empty string to use all methods available (including deprecated methods).
default: ciri,find_circ,circexplorer2_star,circexplorer2_bwa,circexplorer2_segemehl
CPUS: Set number of CPUs
default: 4
GENEPRED: The genome annotation in GenePred format
default:
GENOME_INDEX: The index of the reference genome for HISAT2
default:
SEGEMEHL_INDEX: The .idx index for segemehl
default:
BWA_INDEX: The index of the reference genome for BWA
default:
BOWTIE2_INDEX: The index of the reference genome for BOWTIE2
default:
STAR_INDEX: The directory path where to find Star genome index
default:
BOWTIE_INDEX: The index of the reference genome for BOWTIE when using CIRCexplorer2_tophat
default:
HISAT2_EXTRA_PARAMS: Extra parameters to add to the HISAT2 aligner fixed parameters '--dta --dta-cufflinks --rg-id <SAMPLE> --no-discordant --no-mixed --no-overlap'. For instance, '--rna-strandness FR' if stranded reads are used.
default:
BWA_PARAMS: Extra parameters for BWA
default:
SEGEMEHL_PARAMS: SEGEMEHL extra parameters
default:
TOPHAT_PARAMS: Extra parameters to pass to TopHat
default:
STAR_PARAMS: Extra parameters to pass to STAR
default:
CUFFLINKS_PARAMS: Cufflinks extra parameters. F.i. '--library-type fr-firststrand' if dUTPs stranded library were used for the sequencing
default:
CUFFQUANT_EXTRA_PARAMS: Cuffquant parameter options to specify. E.g. --frag-bias-correct $GENOME_FASTA --multi-read-correct --max-bundle-frags 9999999
default:
CUFFDIFF_EXTRA_PARAMS: Cuffdiff parameter options to specify. E.g. --frag-bias-correct $GENOME_FASTA --multi-read-correct
default:
CUFFNORM_EXTRA_PARAMS: Extra parameters to use if using Cuffnorm
default: --output-format cuffdiff
CIRI_EXTRA_PARAMS: CIRI additional parameters
default:
PREPROCESSOR: The preprocessing method
default: trimmomatic
PREPROCESSOR_PARAMS: Read preprocessor extra parameters. F.i. if Trimmomatic, an empty string defaults to MAXINFO:40:0.5 LEADING:20 TRAILING:20 SLIDINGWINDOW:4:30 MINLEN:50 AVGQUAL:30
default:
TOGGLE_TRANSCRIPTOME_RECONSTRUCTION: Set True to enable transcriptome reconstruction. Default only quantifies genes and transcripts from the given annotation GTF file
default: False
DIFF_EXP: Set True to enable differential expression computation for linear genes/transcripts. Only available if more than one sample and more than one condition are given. N.B: differential expression tests for circRNAs is not yet implemented
default: False
READSTAT_METHODS: Comma separated list of methods to use for read statistics. Currently supported: fastqc,fastx
default: fastqc
MIN_METHODS: Number of methods that commmonly detect a circRNA to define the circRNA as reliable. If this value exceeds the number of methods specified, it will be set to the number of methods.
default: 2
MIN_READS: Number of reads to consider a circRNA as expressed
default: 2
BYPASS_LINEAR: Skip analysis of linear transcripts. This will also skip the analysis of linear-to-circular expression correlation
default: False
To trigger the analyses you simply have to call the ./circompara
script in the analysis directory. Remember that if you used the vars.py
option file, this has to be in the analysis directory.
cd /home/user/circrna_analysis
/home/user/circompara/circompara
- Basic execution: run the analysis as a linear pipeline, i.e. no parallel task execution, and stop on errors
/path/to/circompara/dir/circompara
- Show parameters: to show the parameters set before actually run the analysis, use
-h
:
/path/to/circompara/dir/circompara -h
- Dryrun: to see which commands will be executed without actually execute them, use the
-n
option. NB: many commands will be listed, so you should redirect to a file or pipe to a reader likeless
/path/to/circompara/dir/circompara -n | less -SR
- Multitasks: the
-j
option specifies how many tasks can be run in parallel. N.B: "j x CPUS <= available cores", i.e: the j option value times the CPUS parameter value should not be greater than the number of CPU cores available, unless you want to overload your machine.
/path/to/circompara/dir/circompara_CirComPara -j4
- Ignore errors: keep executing the tasks even when some of them fails. Caveat: this can break downstream analyses
/path/to/circompara/dir/circompara -i
- Combine options: to set multiple options you must sorround them with quotes
/path/to/circompara/dir/circompara_CirComPara "-j4 -i"
- Statistics on the read quality, read filtering steps and alignments can be found into
read_stats_collect
directory. A report is saved inread_statistics.html
file into the same directory. - Results regarding circRNAs are reported in
circrna_analyze
directory with a summary reported incircRNAs_analysis.html
file. - Gene expression tables (as output by Cufflinks/Cuffdiff), plus an gene expression table with FPKM values for each gene and sample (
gene_expression_FPKM_table.csv
), and thegene_expression_analysis.html
report file are saved incuffdiff
directory. - Linear transcript sequences are saved as a multi-FASTA file into the
transcript_sequences
directory.
Building the genome indexes for each mapper can take lot of computing time. However, the same indexes can be used in different CirComPara runs, saving time and disk space. In CirComPara's package the ./make_indexes
script can be used to automatically build the genome index (and gene annotation formats) for each of the supported read aligner, and save them into a directory. In addition, it gives the parameter values to be set to use the index files to be shared.
Example commands using the test data follows:
cd test_circompara
mkdir genome_indexes
cd genome_indexes
../../make_indexes "-j2 GENOME=../annotation/CFLAR_HIPK3.fa ANNOTATION=../annotation/CFLAR_HIPK3.gtf"
The above commands will eventually generate a annotation_vars.py
file that can be appended to the vars.py
file of your project so that CirComPara will skip the building of genome indexes. Note that make_indexes
can use the same options provided by Scons showed above: -j 2
option will allow the script to build two indexes in parallel.
cd test_circompara
## clear CirComPara files in the test directory
cd analysis
../../circompara -c
cd ..
## overwrite the vars.py file omitting the genome and annotation parameters
grep -v "GENOME\|ANNOTATION" vars.py > analysis/vars.py
## append the parameters for the genome, the annotation and the genome indexes
## generated by the make_indexes utility
cat genome_indexes/annotation_vars.py >> analysis/vars.py
## run the test analysis
cd analysis
../../circompara
Some tools in CirComPara require special parameters to handle properly stranded reads. CirComPara allows to specify such parameters Example: include the following parameters if you used the Illumina TruSeq Stranded Total RNA Library Prep Kit with Ribo-Zero Human/Mouse/Rat
HISAT2_EXTRA_PARAMS = "--rna-strandness FR "
CUFFLINKS_PARAMS = "--library-type fr-firststrand "
In a freshly installed Ubuntu Server 16.04 LTS (x64) you need to install the dependency packages listed below (you need root system rights, or ask your system administrator):
sudo apt-get install python2.7 python-pip python-numpy zlib1g-dev unzip pkg-config libncurses5-dev default-jre r-base-core libcurl4-openssl-dev libxml2-dev libssl-dev libcairo2-dev pandoc
(pandoc >= 1.12.3 is required, try install latest packages from http://github.com/jgm/pandoc/releases/)
You also need to upgrade pip
version (pip v8.1.1
has an issue with the --install-option
parameter; tested working with pip v9.0.1
):
pip install --upgrade pip
To run CirComPara you need several software to be available in your system. If the automatic installation does not work for some reason, try to install the required tools in your system. Here there is the list of the tools used in CirComPara with the version that we used during development. We do not list each tool dependencies and if you need support for a specific tool, please refer to the relative software support.
If you used CirComPara for your analysis, please add the following citation to your references:
Gaffo, E., Bonizzato, A., Kronnie, G. te & Bortoluzzi, S. CirComPara: A Multi‐Method Comparative Bioinformatics Pipeline to Detect and Study circRNAs from RNA‐seq Data. Non-Coding RNA 3, 8 (2017). http://www.mdpi.com/2311-553X/3/1/8