marbl/canu

Excessive storage usage

gunjanpandey opened this issue · 2 comments

Could you please suggest the best way forward to assemble a presumable highly repetitive 3.5G-4G shrimp genome with ~35X (assuming 4G) ONT data? With the following command, canu is generating over 300T intermediate files only for the cormhap step. As I am not able provide any more storage, it is getting interrupted, and the log file is becoming empty. But there was really not much log info there anyway.

canu \ -d Assembly -p Pa \ genomeSize=4g \ -nanopore -raw ${INPUT_fastq} \ corMhapFilterThreshold=0.0000000002 corMhapOptions="--threshold 0.80 --num-hashes 512 --num-min-matches 3 --ordered-sketch-size 1000 --ordered-kmer-size 14 --min-olap-length 2000 --repeat-idf-scale 50" mhapMemory=60g mhapBlockSize=500 ovlMerDistinct=0.975

skoren commented

If you're already running with the options above, it should reduce correction space usage. You could try switch to minimap2 as the overlapper instead (corOverlapper=minimap). That should reduce space and speed the run up but we've seen that it doesn't produce as good an assembly. Another option would be to run without correction if this is relatively new ONT data and on the order 97+% accurate:https://canu.readthedocs.io/en/latest/quick-start.html#assembling-with-multiple-technologies-and-multiple-files

 -untrimmed correctedErrorRate=0.12 maxInputCoverage=100 'batOptions=-eg 0.10 -sb 0.01 -dg 2 -db 1 -dr 3' -pacbio-hifi nanopore_reads.fastq.gz
skoren commented

Idle