/nanopore_LSK109_assembly

Assembly pipeline for Nanopore LSK109 chemistry on R9 flowcells

Primary LanguageNextflow

Nanopore assembly pipeline for LSK109 chemistry and r9 series flowcells

This is a Nanopore assembly pipeline aimed at LSK109 chemistry and flowcell r9 series generated reads.

It then does some basic QC by removing control DNA which is sometimes used during a run to debug potential problems, but which should not end up in the final assembly.

Assembly happens using two different assemblers, Flye and nextDenovo. Both are very fast and have different strenghts. I have found that the nextDenovo assembly overall is better with fewer contigs, but tends to trim telomeres and sometimes loses the mitogenome. Flye however is great at maintaining telomeres and the mitogenome tends to fall out as a single, non-concatenated contig (other assemblers create tandem copies as they don't expect a circular sequence). Canu is used to generate corredted reads which I use to manually check and curate the assemblies in Geneious.

Currently this pipeline is optimised to run on a Nimbus instance with 16 cores and 64 GB of RAM.

Input

The pipeline requires you to basecall your raw fast5 or pod5 files with dorado and then compress the reads wigh gzip. The files need to be names "SampleID.fastq.gz" where 'SampleID' is whatever you want to call your sample. This ID will be used throughout the pipeline to name files.

Basecall with Dorado or Guppy:

If you are recalling old fast5 or pod5 data generated with LSK109 chemistry on an r9 series flowcell:

  • First download the latest model, this one is last available model for LSK-109 chemistry on R9 flowcells:
dorado download --model dna_r9.4.1_e8_sup@v3.6
  • Then run basecalling:
dorado basecaller dna_r9.4.1_e8_sup@v3.6 pod5s/ --emit-fastq > sampleID.dorado.fastq && \

gzip -9 sampleID.dorado.fastq

If you have the compute resources available you can try to correct the reads with the new dorado correct module, but the reads cannot be gzipped for that:

dorado correct sampleID.dorado.fastq > sampleID.dorado.corrected.fasta

Running the pipeline

nextflow run jwdebler/nanopore_LSK109_assembly -resume -latest -profile docker,nimbus --reads "reads/"

Profiles

We have a few profiles available to customise how the pipeline will run.

  • nimbus sets the canu assembler to use 15 CPUs and 60GB RAM.
  • zeus sets the canu assembler to use 14 CPUs and 64GB RAM, and sets some cluster specific options to use the slurm based scheduler at Pawsey.
  • docker and docker_sudo sets it to use docker containers, docker_sudo is identical except that docker is run as root (required for some installations of docker).

Parameters


    --reads <glob>
        Required
        A folder containing 1 files per sample. 
        The basename of the file is used as the sample ID.
       
        Example of file names: `Sample1.fastq.gz`, `Sample2.fastq.gz`.
        (Default: a folder called `reads/`)

    --genomeSize <glob>
        not required
        Size of genome, for example "42m" (Default: 42m)

    --medakaModel <glob>
        not required
        Which basecaller model was used?
        r941_min_sup_g507 (kit109, sup)
        (Default: r941_min_sup_g507)

    --minlen
        Min read length to keep for assembly
        (Default: 1000)

    --quality
        Min read q-score to keep for read filtering
        (Default: 10)

    --outdir <path>
        The directory to store the results in.
        (Default: `assembly`)