/nf-genomeassembly

Nextflow pipeline to assemble genomes from long reads.

Primary LanguageNextflowMIT LicenseMIT

DOI

The goal of nf-genomeassembly and nf-annotate is to make to genome assembly and annotation workflows accessible for a broader community, particularily for plant-sciences. Long-read sequencing technologies are already cheap and will continue to drop in price, genome sequencing will soon be available to many researchers without a strong bioinformatic background. The assembly is naturally quite organisms agnostic, but the annotation pipeline contains some steps that may not make sense for other eukaryotes, unless there is a particular interest in NB-LRR genes.

I am currently preparing nf-genomeassembly to be added into nf-co.re/genomeassembler. Please see here for the latest version: genomeassembler

nf-genomeassembly

Assembly pipeline for genomes from long-read sequencing written in nextflow. The pipeline supports for assembly Oxford Nanopore, Pacbio HiFi, combinations of ONT and pacbio HiFi, and can take short-reads for quality control and / or polishing.

Procedure

Preprocessisng:

  • For nanopore:

    • Extract all fastq.gz files in the readpath folder into a single fastq file. By default this is skipped, enable with --collect.
    • Barcodes and adaptors will be removed using porechop. By default this is skipped, enable with --porechop.

      NB: flye claims to work well on raw, un-trimmed reads

    • Read QC is done via nanoq
  • For pacbio:

    • lima to remove primers.

Assembly

Polishing:

  • Polishing of ONT assemblies done using medaka
  • Optional short-read polishing can be done using pilon

Scaffolding:

Annotation:

  • Annotations are lifted from reference using liftoff.

QC:

  • Quality of each stage is assessed using QUAST and BUSCO (standalone),
  • k-mer spectra can be used for further QC with yak,
  • if short-reads are provided merqury is run to compare k-mer spectra between assemblies (or scaffolds) and short-reads.

Tubemap

Tubemap

Usage

Clone this repo:

git clone https://github.com/nschan/nf-genomeassembly/

Run via nextflow:

The samplesheet is a .csv file with a header. It must adhere to this format, including the header row. Please note the absence of spaces after the commas:

sample,ontreads,hifireads,ref_fasta,ref_gff
sampleName,path/to/reads,path/to/hifi.fastq.gz,path/to/reference.fasta,path/to/reference.gff

To run the default pipeline with a samplesheet on biohpc_gen using charliecloud:

nextflow run nf-genomeassembly --samplesheet 'path/to/sample_sheet.csv' \
                           -profile charliecloud,biohpc_gen

Parameters

See also schema.md

Parameter Effect
General parameters
--samplesheet Path to samplesheet
--use_ref Use a refence genome? (default: true)
--lift_annotations Lift annotations from reference using liftoff? Default: true
--out Results directory, default: './results'
--ont ONT reads are available? These should go into the ontreads column of the samplesheet. Default: false
--hifi Pacbio hifi reads are available? These should go into the hifireads column of the samplesheet. default: false
ONT Preprocessing
--collect Are the provided reads a folder (true) or a single fq files (default: false )
--porechop Run porechop on ONT reads? (default: false)
pacbio Preprocessing
--lima Run lima on pacbio reads? default: false
--pacbio_primers Primers to be used with lima (required if --lima is used)? default: null
Assembly
--assembler Assembler to use. Valid choices are: 'hifiasm', 'flye', or 'flye_on_hifiasm'. flye_on_hifiasm will scaffold flye assembly (ont) on hifiasm (hifi) assembly using ragtag. Defaul: 'flye'
Assembly flye specific arguments
--flye_args The mode to be used by flye; default: "--nano-hq", options are: "--pacbio-raw", "--pacbio-corr", "--pacbio-hifi", "--nano-raw", "--nano-corr", "--nano-hq"
--kmer_length kmer size for Jellyfish? (default: 21)
--read_length Read length for genomescope? If this is null (default), the median read length estimated by nanoq. will be used. If this is not null, the given value will be used for all samples.
--genome_size Expected genome size for flye. If this is null (default), the haploid genome size for each sample will be estimated via genomescope. If this is not null, the given value will be used for all samples.
--flye_args Arguments to be passed to flye, default: none. Example: --flye_args '--genome-size 130g --asm-coverage 50'
Assembly hifiasm specific arguments
--hifi_ont Use hifi and ONT reads with hifiasm --ul? default: false
--hifiasm_args Extra arguments passed to hifiasm. default: ''
Polishing
--polish_medaka Polish using medaka, default: false
--medaka_model Model used by medaka, default: 'r1041_e82_400bps_hac@v4.2.0:consesus'
--polish_pilon Polish with short reads (see below) using pilon? Sefault: false
Scaffolding  
--scaffold_ragtag Scaffolding with ragtag? Default: false
--scaffold_links Scaffolding with LINKS? Default: false
--scaffold_longstitch Scaffolding with longstitch? Default: false
QC  
--short_reads Short reads available? These should go into shortread_F and shortread_R columns and the paired column should be true if both are filled. If only single-end reads are available, shortread_R remains empty, and paired is false. If short-reads are supplied, k-mer spectra will be used to assess quality of the assembly(s). Default: false
--trim_short_reads Trim short reads with trimgalore? Default: true
--meryl_k Value of k for meryl k-mers. Default: 21 
--qc_reads Long reads that should be used for QC when both ONT and HiFi reads are provided. Options are 'ONT' or 'HIFI'. Default: 'ONT'
--busco Run BUSCO? Default: 'true'
--busco_db Path to local BUSCO db? Default: ""
--busco_lineage BUSCO lineage to use. Default: brassicales_odb10
--quast Run QUAST? Default: true
Skipping steps
--skip_assembly Skip assembly? Requires different samplesheet (!). Default: false
--skip_alignments Skip alignments with minimap2? Requires different samplesheet (!). Default: false

Included profiles

This pipelines comes with some profiles to modify run behaviour independent of infrastructure configs, which can be used via -profile.

Name Contents
ont_flye Assemble ONT reads with flye
hifi_flye Assemble pac-bio hifi reads with flye
hifi_hifiasm Assemble pac-bio hifi reads with hifiasm
hifi_ul Assemble ONT and HiFI reads via hifiasm
ont_on_hifi Assemble HiFi (via hifiasm) and ONT (via flye) and subsequent scaffolding of the ONT assembly onto HiFi assembly with ragtag

Short reads: QC with yak

If short reads are available, yak can be used to perform additional quality control based on kmer spectra. This can be enabled using --short_reads and a samplesheet that looks like this:

sample,ontreads,hifireads,ref_fasta,ref_gff,shortread_F,shortread_R,paired
sampleName,ontreads.fa.gz,hifireads.fa.gz,assembly.fasta.gz,reference.fasta,reference.gff,short_F1.fastq,short_F2.fastq,true

If there are only single-end reads, shortread_R should remain empty, and paired should be false

Short reads: Polishing with pilon

The assemblies can be polished using available short-reads using pilon. --polish_pilon

This requires additional information in the samplesheet: shortread_F, shortread_R and paired:

sample,ontreads,ref_fasta,ref_gff,shortread_F,shortread_R,paired
sampleName,reads,assembly.fasta.gz,reference.fasta,reference.gff,short_F1.fastq,short_F2.fastq,true

In a case where only single-reads are available, shortread_R should be empty, and paired should be false.

Scaffolding

LINKS, longstitch and / or ragtag can be used for scaffolding.

Using liftoff

If --lift_annotations is used (default), the annotations from the reference genome will be mapped to assemblies and scaffolds using liftoff. This will happen at each step of the pipeline where a new genome fasta is created, i.e. after assembly, after polishing and after scaffolding.

No refence genome

If there is no reference genome available use --use_ref false to disable the reference genome. Liftoff should not be used without a reference, QUAST will no longer compare to reference.

Skipping Assembly

In case you already have an assembly and would only like to check it with QUAST and polish use --skip_assembly true

This mode requires a different samplesheet:

sample,readpath,assembly,ref_fasta,ref_gff
sampleName,path/to/reads,assembly.fasta.gz,reference.fasta,reference.gff

When skipping flye the original reads will be mapped to the assembly and the reference genome.

Skipping Flye and mappings

In case you have an assembly and have already mapped your reads to the assembly and the reference genome you can use --skip_assembly true --skip_alignments true

This mode requires a different samplesheet:

sample,readpath,assembly,ref_fasta,ref_gff,assembly_bam,assembly_bai,ref_bam
sampleName,reads,assembly.fasta.gz,reference.fasta,reference.gff,reads_on_assembly.bam,reads_on_assembly.bai,reads_on_reference.bam

QUAST

QUAST will run with the following additional parameters:

        --eukaryote \\
        --glimmer \\
        --conserved-genes-finding \\

Acknowledgements

This pipeline builds on modules developed by nf-core.