A pipeline to process Nanopore sequencing data. It performs the following steps:
- QC of the FASTQ files
- Removal of adapters and primers
- Alignment of the reads to the reference genome
- Optional: alignment to a secondary reference genome (i.e. spike-ins)
- Generation of BAM files
- Splice-site scoring on the reference genome
The pipeline is written in Nextflow, a workflow manager that allows to run the pipeline in a wide variety of systems. It is configured to be run either on a SLURM-managed HPC cluster or a local machine, though it can be run on a cloud instance or using other workload managers by editing the configuration file according to Nextflow documentation.
If running in local, you will need a PC with:
- At least 32GB of RAM (ideally 64GB)
- A multi-core CPU (ideally i7 or better)
- At least 500GB of free disk space (recommended to have an SSD)
- Docker or Singularity installed
- Nextflow installed OR EPI2ME installed
In your local machine, there are two ways of running the pipeline:
- Using EPI2ME: the easiest way for those without bioinformatics experience. It's just a graphical interface that allows you to run the pipeline. Below will be explained how to install it.
- Using the command line: for those with bioinformatics experience. It follows the same procedure as running in a cluster, so it will be explained in the
Installing in a cluster
section.
Install EPI2ME on your system and follow the instructions on the app to install all the dependencies (Java, Docker and Nextflow). To add the pipeline to your saved workflows simply copy this repository's URL and paste it on the "Add workflow" section of the EPI2ME interface.
If using Nextflow/nf-core, clone the repository and install the basic dependencies (Nextflow). The easiest way to do so is using conda. The pipeline can be run on any system that supports Docker or Singularity.
git clone https://github.com/a-hr/vive-pipeline.git
The internal dependencies of the pipeline are managed by Nextflow, so you don't need to worry about them. If for some reason Nextflow fails to download them when using Singularity (they are provided as Docker containers), you can manually download them with the Makefile:
# make sure you have Singularity installed and available
make pull
The pipeline is especially tailored to be run on a HPC cluster, though it can seamlessly be run on a local machine and, with some configuration, on a cloud instance.
- Open EPI2ME and go to the "Workflows" tab.
- Select the workflow.
- Fill in the parameters.
- In the
profile
section, make sure its set tostandard
(the default), which runs the backend on top of Docker containers. If you are using Singularity, set it tolocal_singularity
. - Run the pipeline.
- Go to the directory where you cloned the repository.
- Fill in the parameters in the
input_params.yaml
file. - Make sure your system has Docker/Singularity and Nextflow available.
- Run the pipeline in the cluster/local machine with the following command:
sbatch launch_cluster.sh # for SLURM-managed HPC clusters
bash launch_local.sh # for local machines
The cluster launch script is configured to run the pipeline in a SLURM-managed HPC cluster. If you are using another workload manager, you will need to edit the script accordingly.
If you are running the pipeline in a local machine, you can run it with the following command:
# with Docker
nextflow run main.nf -profile standard -params-file input_params.yaml
# or with Singularity
nextflow run main.nf -profile local_singularity -params-file input_params.yaml
Below, a description of the parameters that can either be set in the input_params.yaml
file or provided through the EPI2ME interface.
experiment_name
: all the output files will be prefixed with this nameinput_fastqs
: folder containing the FASTQ files output by the basecaller. Note that they will all be processed together, so make sure they are from the same run.plasmid_ref_fa
: reference genome of the sequenced targetoutput_dir
: folder where the output files will be saved (only available if running through the command line)is_rna
: whether the input is RNA or DNA. If RNA, the FASTQ files will be accordingly processed.
min_len
: expected minimum length of the reads (all reads below this length will be discarded)max_len
: expected maximum length of the reads (all reads above this length will be discarded)
assess_secondary
: whether to process a secondary targetsecondary_ref_fa
: reference genome of the secondary target
get_bams
: whether to generate BAM files as output