The distribution is a parallel wrapper around the Arrow consensus framework within the SMRT Analysis Software. The pipeline is composed of bash scripts, an example input fofn which shows how to input your bax.h5 files (you give paths without the .1.bax.h5), and how to launch the pipeline. The input can be either BAX.h5 or BAM files (only P6-C4 chemistry or newer) and requires SMRTportal 3.1+. It can also run the older Quiver algorithm if requested in the CONFIG file on the P6-C4 chemistry data.
The current pipeline has been designed to run on the SGE or SLURM scheduling systems and has hard-coded grid resource request parameters. You must edit arrow.sh to match your grid options. It is, in principle, possible to run on other grid engines but will require editing all shell scripts to not use SGE_TASK_ID but the appropriate variable for your grid environment and editing the qsub commands in arrow.sh to the appropriate commands for your grid environment.
To run the pipeline you need to:
-
You must have a working SMRT Analysis Software installation and have it configured so the tools are in your path.
-
Create the input.fofn file which lists the SMRTcells you want to use for Arrow (the full path excluding .[1-3].bax.h5 or subreads.bam), it will treat each collection of bax.h5 files as a single SMRTcell and will convert them to BAM prior to processing.
-
run the pipeline specifying the input file, the path to the reference fasta, and a prefix for the outputs:
sh arrow.sh input.fofn trio3 trio3.contigs.fasta
The pipeline is very rough and has undergone limited testing so user beware.
If you find this pipeline useful, please cite the original Quiver paper:
Chin et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nature Methods, 2013
and the Canu paper:
Koren S et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Research. (2017).