marbl/canu

Using Canu w/Snakemake

Closed this issue · 5 comments

Hi,

I have successfully used Canu for all my de novo assemblies and I am embarking on creating a snakemake-env for running Canu+medaka+diamond for viral metagenomic sequencing. Has anyone successfully used Canu with snakemake? I am running into the issue where I need to add --latency-wait time for Canu to create the final assembly.fasta file before moving onto my Medaka rule. I am getting this error (see below). How can I figure out how much time it will take Canu to finish the assembly? When I ran Canu before I never really knew how long the assembly would take, is there a way to figure this out based on how many reads I have?

Below is my snakemake-env error:

(/lustre/project/taw/share/conda-envs/snakemake-env) [kvigil@cypress1 concatenate]$ snakemake -s /lustre/project/taw/share/conda-envs/snakemake-env/nanopore_viralv3 --cores 10 --rerun-incomplete --latency-wait 3600
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 10
Rules claiming more threads will be scaled down.
Job stats:
job                   count
------------------  -------
all                       1
canu_assembly             2
diamond_alignment         2
medaka_polishing          2
minimap2_alignment        2
total                     9

Select jobs to execute...

[Tue Oct 17 08:05:25 2023]
rule canu_assembly:
    input: /lustre/project/taw/kvigil/ONR/baratariabay/ONR_baratariabay_20_100623/20231006_1648_MN18851_FAW76720_acec0fdf/fastq_pass/concatenate/barcode03.fastq
    output: assembly/barcode03/contigs.fasta, assembly/barcode03/canu-report.html
    jobid: 1
    reason: Missing output files: assembly/barcode03/contigs.fasta
    wildcards: barcode=barcode03
    threads: 8
    resources: tmpdir=/tmp

perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
        LANGUAGE = (unset),
        LC_ALL = (unset),
        LANG = "C.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
-- canu 2.2
--
-- CITATIONS
--
-- For 'standard' assemblies of PacBio or Nanopore reads:
--   Koren S, Walenz BP, Berlin K, Miller JR, Phillippy AM.
--   Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation.
--   Genome Res. 2017 May;27(5):722-736.
--   http://doi.org/10.1101/gr.215087.116
--
-- Read and contig alignments during correction and consensus use:
--   Šošic M, Šikic M.
--   Edlib: a C/C ++ library for fast, exact sequence alignment using edit distance.
--   Bioinformatics. 2017 May 1;33(9):1394-1395.
--   http://doi.org/10.1093/bioinformatics/btw753
--
-- Overlaps are generated using:
--   Berlin K, et al.
--   Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.
--   Nat Biotechnol. 2015 Jun;33(6):623-30.
--   http://doi.org/10.1038/nbt.3238
--
--   Myers EW, et al.
--   A Whole-Genome Assembly of Drosophila.
--   Science. 2000 Mar 24;287(5461):2196-204.
--   http://doi.org/10.1126/science.287.5461.2196
--
-- Corrected read consensus sequences are generated using an algorithm derived from FALCON-sense:
--   Chin CS, et al.
--   Phased diploid genome assembly with single-molecule real-time sequencing.
--   Nat Methods. 2016 Dec;13(12):1050-1054.
--   http://doi.org/10.1038/nmeth.4035
--
-- Contig consensus sequences are generated using an algorithm derived from pbdagcon:
--   Chin CS, et al.
--   Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data.
--   Nat Methods. 2013 Jun;10(6):563-9
--   http://doi.org/10.1038/nmeth.2474
--
-- CONFIGURE CANU
--
-- Detected Java(TM) Runtime Environment '1.8.0_382' (from '/lustre/project/taw/share/conda-envs/snakemake-env/bin/java') with -d64 support.
--
-- WARNING:
-- WARNING:  Failed to run gnuplot using command 'gnuplot'.
-- WARNING:  Plots will be disabled.
-- WARNING:
--
--
-- Detected 8 CPUs and 63 gigabytes of memory on the local machine.
--
-- Detected Slurm with 'sinfo' binary in /cm/shared/apps/slurm/14.03.0/bin/sinfo.
-- Detected Slurm with task IDs up to 1000 allowed.
--
-- Slurm support detected.  Resources available:
--     30 hosts with  20 cores and   62 GB memory.
--     49 hosts with  20 cores and  124 GB memory.
--     41 hosts with  20 cores and  249 GB memory.
--
--                         (tag)Threads
--                (tag)Memory         |
--        (tag)             |         |  algorithm
--        -------  ----------  --------  -----------------------------
-- Grid:  meryl     12.000 GB    4 CPUs  (k-mer counting)
-- Grid:  hap        8.000 GB    4 CPUs  (read-to-haplotype assignment)
-- Grid:  cormhap    6.000 GB   10 CPUs  (overlap detection with mhap)
-- Grid:  obtovl     4.000 GB    5 CPUs  (overlap detection)
-- Grid:  utgovl     4.000 GB    5 CPUs  (overlap detection)
-- Grid:  cor        -.--- GB    4 CPUs  (read correction)
-- Grid:  ovb        4.000 GB    1 CPU   (overlap store bucketizer)
-- Grid:  ovs        8.000 GB    1 CPU   (overlap store sorting)
-- Grid:  red       32.000 GB    4 CPUs  (read error detection)
-- Grid:  oea       32.000 GB    1 CPU   (overlap error adjustment)
-- Grid:  bat       32.000 GB    4 CPUs  (contig construction with bogart)
-- Grid:  cns        -.--- GB    4 CPUs  (consensus)
--
-- Found untrimmed raw Nanopore reads in the input files.
--
-- Generating assembly 'barcode03' in '/lustre/project/taw/kvigil/ONR/baratariabay/ONR_baratariabay_20_100623/20231006_1648_MN18851_FAW76720_acec0fdf/fastq_pass/concatenate/assembly/barcode03/contigs.fasta':
--   genomeSize:
--     2000000
--
--   Overlap Generation Limits:
--     corOvlErrorRate 0.3200 ( 32.00%)
--     obtOvlErrorRate 0.2000 ( 20.00%)
--     utgOvlErrorRate 0.2000 ( 20.00%)
--
--   Overlap Processing Limits:
--     corErrorRate    0.3000 ( 30.00%)
--     obtErrorRate    0.2000 ( 20.00%)
--     utgErrorRate    0.2000 ( 20.00%)
--     cnsErrorRate    0.2000 ( 20.00%)
--
--   Stages to run:
--     correct raw reads.
--     trim corrected reads.
--     assemble corrected and trimmed reads.
--
----------------------------------------
-- Starting command on Tue Oct 17 08:05:43 2023 with 153262.71 GB free disk space

    cd /lustre/project/taw/kvigil/ONR/baratariabay/ONR_baratariabay_20_100623/20231006_1648_MN18851_FAW76720_acec0fdf/fastq_pass/concatenate/assembly/barcode03/contigs.fasta
    sbatch \
      --cpus-per-task=1 \
      --mem-per-cpu=4g   \
      -D `pwd` \
      -J 'canu_barcode03' \
      -o canu-scripts/canu.01.out  canu-scripts/canu.01.sh
Submitted batch job 2436515

-- Finished on Tue Oct 17 08:05:43 2023 (fast as lightning) with 153262.71 GB free disk space
----------------------------------------
Waiting at most 3600 seconds for missing files.
MissingOutputException in rule canu_assembly in file /lustre/project/taw/share/conda-envs/snakemake-env/nanopore_viralv3, line 21:
Job 1  completed successfully, but some output files are missing. Missing files after 3600 seconds. This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait:
assembly/barcode03/canu-report.html
Removing output files of failed job canu_assembly since they might be corrupted:
assembly/barcode03/contigs.fasta
Skipped removing non-empty directory assembly/barcode03/contigs.fasta
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2023-10-17T080518.997070.snakemake.log
skoren commented

There isn't an easy way to guess the runtime as it will be affected by the repetitiveness and genome size of the organism not just read counts. Canu is itself a pipeline which submits itself to the grid to run so the total runtime will also depend on the grid availability/scheduling time. It does have onSuccess/onFailure options which let you specify a script to run after Canu completes. You could then run snakemake up to Canu, stop snakemake, and let Canu resume the snakemake using the onSuccess command or have that write a checkpoint file that a process in snakemake can busy-wait on (see #471 for some discussion). Alternatively, if you're only running small assemblies, you could restrict Canu to a single node so it won't complete until the whole assembly is finished by adding useGrid=false to your command line.

Thank you! so I can add the onSuccess in the "params" of my Canu Rule in Snakemake? What do I set the onSuccess=?

Here are my Canu snakemake rules:

rule canu_assembly:
input:
fastq="/lustre/project/taw/kvigil/ONR/baratariabay/ONR_baratariabay_20_100623/20231006_1648_MN18851_FAW76720_acec0fdf/fastq_pass/concatenate/{barcode}.fastq"

output:
   assembly="assembly/{barcode}/contigs.fasta",
   report="assembly/{barcode}/canu-report.html"

params: 
     genomeSize="2m",
     minInputCoverage=0,
     maxInputCoverage=0,
     corOutCoverage=10000,
     stopOnLowCoverage=0,
     corMinCoverage=0,
     redMemory=32,
     oeaMemory=32,
     batMemory=32,
     correctedErrorRate=0.2,
     corMhapSensitivity="high"
    onSuccess=?????

threads: 8

shell:
    "/lustre/project/taw/share/conda-envs/ONRviral/bin/canu -p {wildcards.barcode} -d {output.assembly} onSuccess={params.onSuccess} genomeSize={params.genomeSize} minInputCoverage={params.minInputCoverage} maxInputCoverage={params.maxInputCoverage} corOutCoverage={params.corOutCoverage} stopOnLowCoverage={params.stopOnLowCoverage} corMinCoverage={params.corMinCoverage} redMemory={params.redMemory} batMemory={params.batMemory} oeaMemory={params.oeaMemory} correctedErrorRate={params.correctedErrorRate} corMhapSensitivity={params.corMhapSensitivity} -nanopore {input.fastq}"

https://canu.readthedocs.io/en/latest/parameter-reference.html#global-options (scroll down a bit)

Execute the command supplied when Canu successfully completes an assembly.
The command will execute in the <assembly-directory> (the -d option to canu)
and will be supplied with the name of the assembly (the -p option to canu)
as its first and only parameter.

You could use this to (a) launch snakemake to run the remaining processing steps on the now-complete assembly; (b) create a file that snakemake is actively waiting for (while file-doesn't-exist { sleep 10}); (c) some other fiendishly clever method of telling snakemake to continue. onFailure is similar, but it runs when the assembly fails.

Note that the danger with a busy-wait is that if canu does not finish successfully, snakemake will wait forever.

Note also that in option (a) it is better to submit a job to the grid to run snakemake than to run snakemake directly.

@brianwalenz thank you for this information. I am still working on it!

I know this is quite a late reply but I have been working on implementing a snakemake pipeline for the past two weeks and have it successfully working.


Waiting at most 3600 seconds for missing files.
MissingOutputException in rule canu_assembly in file /lustre/project/taw/share/conda-envs/snakemake-env/nanopore_viralv3, line 21:
Job 1 completed successfully, but some output files are missing. Missing files after 3600 seconds. This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait:
assembly/barcode03/canu-report.html
Removing output files of failed job canu_assembly since they might be corrupted:
assembly/barcode03/contigs.fasta
Skipped removing non-empty directory assembly/barcode03/contigs.fasta
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2023-10-17T080518.997070.snakemake.log

so I am working on the same thing, parallel canu via snakemake, and have found that snakemake is particular about the outputs that you give it. So to chain outputs, you do need to have the file listed so it can build the DAG. However canu takes a directory as output. The way that I have made this work in my snakefiles is to include multiple outputs:

  • one for the directory that canu takes as an argument,
  • one to tell snakemake where the {barcode}.contigs.fasta is coming from.

I think that your error is from snakemake feeding the contigs.fasta as the output directory, canu accepting that as a valid output directory, and creating it. Once it's created, snakemake sees that the output has been generated and assumes that the rule has finished. If you're running on an HPC cluster and useGrid=true (or blank), then from snakemake's perspective, that initial rule has finished, the output directory "contigs.fasta/" exists, but the report is missing because canu has not actually finished running.

To add to what @skoren wrote about setting onsuccess/failure, you could also use checkpoints in snakemake to ensure completion before other rules are executed.

I've rewritten your rule below and added what should be written in rule all:

rule all:
    input:
        ........
        "assembly/{barcode}/{barcode}.contigs.fasta",

rule canu_assembly:
    input:
    fastq="/lustre/project/taw/kvigil/ONR/baratariabay/ONR_baratariabay_20_100623/20231006_1648_MN18851_FAW76720_acec0fdf/fastq_pass/concatenate/{barcode}.fastq"

    output:
        outdir = directory("assembly/{barcode}/"),
        assembly="assembly/{barcode}/{barcode}.contigs.fasta",
        checkpoint=touch("checkpoints/{barcode}.canu.finished")
    params: 
        genomeSize="2m",
        minInputCoverage=0,
        maxInputCoverage=0,
        corOutCoverage=10000,
        stopOnLowCoverage=0,
        corMinCoverage=0,
        redMemory=32,
        oeaMemory=32,
        batMemory=32,
        correctedErrorRate=0.2,
        corMhapSensitivity="high"
        useGrid="false"
    log: "logs/canu/{sample}.canu.log"
    message:
        "Running Canu for sample {wildcards.barcode}"
    threads: 8
    shell:
    "/lustre/project/taw/share/conda-envs/ONRviral/bin/canu -p {wildcards.barcode} -d {output.outdir} "
"genomeSize={params.genomeSize} minInputCoverage={params.minInputCoverage} "
"maxInputCoverage={params.maxInputCoverage} corOutCoverage={params.corOutCoverage} "
"stopOnLowCoverage={params.stopOnLowCoverage} corMinCoverage={params.corMinCoverage} "
"redMemory={params.redMemory} batMemory={params.batMemory} oeaMemory={params.oeaMemory} "
"correctedErrorRate={params.correctedErrorRate} corMhapSensitivity={params.corMhapSensitivity} -nanopore {input.fastq} 2>{log} "