Analysis toolkit for SEQuoia Express Stranded RNA-Seq kits
To use the toolkit a virtual environment is required to run the software, prepared here as a docker container. To use please ensure docker is both installed and running. Either generate (via docker build) or retrieve (docker pull) the container to continue. Please note that that nextflow can call docker directly and will be able to pull the container automatically.
Nextflow is the primary software the runs and coordinates the pipeline (groovy / Java language base) so you will need Java 8 or higher with nextflow installed to run.
wget -qO- https://get.nextflow.io | bash
#or
curl -s https://get.nextflow.io | bash
If you are more comfortable with conda it can also be done there.
conda install -c bioconda nextflow
When running the toolkit, nextflow will produce intermediate files required to complete the processes. To do this please follow the instructions from nextflow. There are options to keep the logs or to do as part of a run after complete. One example would be:
nextflow clean -f ./work
for the full options:
nextflow clean -h
It is suggested that you copy the tar of the reference that you want locally. These commands will take a while to run. For full list of options see: Sequoia Genomes
When downloading from Dropbox, it will add ?dl=0
to the end of each link. This needs to be removed. this can be done manually, however using the output option in wget -O you can rename the file to match what is expected. In the use case of the hg38 genome, see the example below where the dropbox link with the ?dl=0
can be saved as the expected hg38.tar.gz with the -O option.
mkdir ./ref_data/genome-annotations
cd ./ref_data/genome-annotations
wget -O hg38.tar.gz https://www.dropbox.com/s/hm6kyp70dtbqovr/hg38.tar.gz?dl=0
tar xvzf hg38.tar.gz
For most users there are only some basic commands that will need to be done to run the pipeline. For a full list of options, please see the nextflow.config
file. Using nextflow run main.nf --help
will only list the basic options.
This pipeline uses a docker container as a virtual environment to run the software. Outside of installing docker and nextflow, no other software is required. The recommendation for running this analysis is to pull the docker container from dockerhub. However, should the user choose to modify the docker container for a customized analysis, the Dockerfile is provided in this repository.
(Recommended) the container will be pulled automatically by nextflow, however this is the command required if needed:
docker pull -t bioraddbg/sequoia-express:latest
For the custom analysis described above, the following command will generate a docker container for the analysis:
docker build -t bioraddbg/sequoia-express [path to Dockerfile]
nextflow run Sequoia_express_toolkit/main.nf --outDir ./output/ --reads '~/read/express/' --genome hg38 --genomes_base ./genomes/
$ nextflow run main.nf --help
/-----------------------------------------------------------\
| __________.__ __________ .___ |
| \_____ \__|____ \______ \____ __| _/ |
| | | _/ |/ _ \ ______ | _/\__ \ / __ | |
| | | \ ( <_> ) /_____/ | | \ / __ \_/ /_/ | |
| |____ /__|\____/ |__|_ /(____ /\____ | |
| \/ \/ \/ \/ |
\___________________________________________________________/
Usage:
The typical command for running the pipeline is as follows:
nextflow run Sequoia_express_toolkit/main.nf --outDir ./output/ --reads '~/read/express/' --genome hg38 --genomes_base ./genomes/
Args:
REQUIRED:
genome (string ) Genome to align to and annotate against [hg38, mm10, rnor6]
genomes_base (string ) Bio-Rad formatted refence genomes and annotations
reads (string ) The path to the fastq files must be wrapped in single quotes.
OPTIONAL:
fivePrimeQualCutoff (integer) The read quality below which bases will be trimmed on the 5' end [0, 42]
max_cpus (integer) The max number of cpus the pipeline may use. Defaults provided by -profile.
max_memory (integer) The max memory in GB that the pipeline may use. Defaults provided by -profile.
minBp (intger ) 15 Reads with fewer base pairs will be rejected [0, 500]
minGeneCutoff (double ) Provide double value to cutt off how many reads are minimum [0, 9E+7]
minGeneType (string ) Provide metric to be used [none, reads, RPKM, TPM]
minMapqToCount (integer) 1 The minimum MapQ socre for an aligned read to count toward a feature count [0, 255]
noTrim (boolean) Indicates whether or not trimming skipped on the reads
outDir (string ) ./results Indicate the output directory to write to
reverseStrand (boolean) Indicate if your library is reverse stranded
seqType (string ) Provide sequencing method used, if SE provided deduplication will not occur [SE, PE]
skipUmi (boolean) Indicate no UMI processing is required
spikeType (string ) NONE The type of spike in used, if any [NONE, ercc]
threePrimeQualCutoff (integer) The read quality below which bases will be trimmed on the 3' end [0, 42]
validateInputs (boolean) true Ensure input meets standards and is below 500 million reads
This pipeline has been set up with multiple-sample bulk runs in mind, meaning that the predecessor Sequoia Complete took one file at time while this pipeline takes a whole directory of files at the same time. With this however your fastq files must have at a minimum R1 / R2 in the file name to specify that they are paired reads.
This pipeline creates output like those used for Sequoia Complete, each individual sample will have a report in csv, html, and pdf formats. Additionally, each batch that is run will have its own high-level report that is created to have a side by side comparison of metrics as well.
If you encounter an error / bug / issue, please contact support@bio-rad.com or submit and issue to this repository so that we can address it.
If you find that you are getting an error where nextflow cannot find your files, check your path, and if needed use an absolute path, or check the formatting on your relative path. Also check your reads have R1 / R2 (in caps) and end with .fastq or fastq.gz
The pipeline runs with paired end as default (assumes you have both R1 and R2) if this is not the case you can run --seqType=SE
to use just the R1 reads
- V1.1.0 Changed over pipeline to DSL2 to be compatible with new versions of nextflow - structure of the pipeline but no changes to any processing algorithms.