Speeding up correction / assembly of large plant genome

Question

Speeding up correction / assembly of large plant genome

Closed this issue 8 months ago · 8 comments

theo-allnutt-bioinformatics commented 9 months ago

Hi,

I have a large plant genome, estimated 9.3 Gbp, to assemble. I have ~40X coverage ONT reads, filtered to >10 kbp and >=Q15. My compute resource cannot run grid jobs (non-negotiable rule). I have 104 cpus, 500 GB memory, and 10 TB fast 'scratch' storage, but only a 48 hour walltime limit. The stage directory is the job node storage of 400 GB.. This could be changed to also use scratch which would be slightly slower but larger. I have run canu on smaller genomes with no problems, but I am finding that for this genome I cannot get past the correction step within 48 hours and on restarting canu it will pick up at a point where it still cannot finish (and so endlessly repeats the same correction steps).

I have tried to change the overlapper / correction resource options to few concurrent processes with high cpus each and vice-versa with no success yet - I always hit the 48 hour limit. Currently I am trying 64 x concurrency with 2 threads each.

My question is.. do you think it is possible within 48 hours and if so what would be the best options? Here are my current settings and the stderr from last attempt.

Thanks,

Theo


label=aust
saveLocation=/my/scratch/
reads=myreads.fasta
stageDirectory=$PBS_JOBFS
genomeSize=9.3g
correctCoverage=25
correctedErrorRate=0.03
corMaxEvidenceErate=0.05
minReadLength=10000
minOverlapLength=500


/my/bin/canu-2.2/bin/canu \
		-p $label \
		-d ${saveLocation}/canu_${label} \
		-nanopore $reads \
		-fast \
		genomeSize=${genomeSize} \
		stageDirectory=${stageDirectory} \
		corOutCoverage=${correctCoverage} \
		useGrid=false \
		correctedErrorRate=${correctedErrorRate} \
		corMaxEvidenceErate=${corMaxEvidenceErate} \
		minReadLength=${minReadLength} \
		minOverlapLength=${minOverlapLength} \
		cormhapMemory=8 \
		cormhapConcurrency=64 \
		cormhapThreads=2 \
		obtmhapMemory=8 \
		obtmhapConcurrency=64 \
		obtmhapThreads=2 \
		utgmhapMemory=8 \
		utgmhapConcurrency=64 \
		utgmhapThreads=2 \
		"batOptions=-dg 3 -db 3 -dr 1 -ca 500 -cp 50" \

stderr attached.
stderr.txt

Answer 1 · 2023-12-06T03:03:31.000Z

Given a 9.3gb genome size, trying to make it run in 104 cores and fitting into 48 hours at a time is going to be a tall order and will likely still run for a very long time. I'm not sure what you mean by "My compute resource cannot run grid jobs". Do you mean compute hosts can't submit jobs? Even if that is the case, you can have canu print submit commands that you run yourself using useGrid=remote. That would hopefully let you access more than 104 CPUs as a grid is strongly recommended for larger genomes.

Without a grid, I'm not sure it's worth generating this assembly. If you really want to spend time trying, I'd suggest adding the faster/lower space parameters from the FAQ. I would also increase the threads per job and decrease the concurrency, you want at least a few of the jobs to complete in the 48 hrs and giving them more threads should help do that. If none of the mhap jobs are completing in 48 hrs, you can decrease the mhap memory to make more smaller partitions to try to finish the jobs. I expect if you're encountering errors in the correction you'll have even more issues with trimming/assembly overlap jobs. There again you could increase the cores and might have to play with ovlRefBlockLength and decrease it from the default of 30000000000. Given that the data do look high quality based on the k-mer histogram, you could try the uncorrected assembly (see quick start) with an error rate of 4-6%.

Answer 2 · 2023-12-10T23:51:09.000Z

Ok, thanks for the advice. I am now trying it without correction, using -trimmed and -corrected as below. I haven't tried reducing ovlRefBlockLength yet and I've left utgmhap mem, conc, and threads default..


label=aust
saveLocation=/my/scratch/
reads=/my/reads-filt.fasta
stageDirectory=$PBS_JOBFS
genomeSize=9.3g
correctedErrorRate=0.05
minReadLength=10000
minOverlapLength=500


/my/bin/canu-2.2/bin/canu \
		-p $label \
		-d ${saveLocation}/canu_nocor_${label} \
		-fast \
		-trimmed \
		-corrected \
		-nanopore $reads \
		genomeSize=${genomeSize} \
		stageDirectory=${stageDirectory} \
		useGrid=false \
		correctedErrorRate=${correctedErrorRate} \
		minReadLength=${minReadLength} \
		minOverlapLength=${minOverlapLength} \
		"batOptions=-dg 3 -db 3 -dr 1 -ca 500 -cp 50" \

The stderr after first 48 hours is below. I am not sure if re-starting will run the next set of mhap jobs or if it just starts again (and will never finish).

........
    ./precompute.sh 57 > ./precompute.000057.out 2>&1
    ./precompute.sh 58 > ./precompute.000058.out 2>&1
    ./precompute.sh 59 > ./precompute.000059.out 2>&1

-- Finished on Thu Dec  7 17:49:21 2023 (21153 seconds) with 4671033.048 GB free disk space
----------------------------------------
-- All 59 mhap precompute jobs finished successfully.
-- Finished stage 'utg-mhapPrecomputeCheck', reset canuIteration.
--
-- Running jobs.  First attempt out of 2.
----------------------------------------
-- Starting 'utgmhap' concurrent execution on Thu Dec  7 17:49:22 2023 with 4671033.048 GB free disk space (143 processes; 8 concurrently)

    cd unitigging/1-overlapper
    ./mhap.sh 1 > ./mhap.000001.out 2>&1
    ./mhap.sh 2 > ./mhap.000002.out 2>&1
    ./mhap.sh 3 > ./mhap.000003.out 2>&1
    ./mhap.sh 4 > ./mhap.000004.out 2>&1
    ./mhap.sh 5 > ./mhap.000005.out 2>&1
    ./mhap.sh 6 > ./mhap.000006.out 2>&1
    ./mhap.sh 7 > ./mhap.000007.out 2>&1
    ./mhap.sh 8 > ./mhap.000008.out 2>&1
=>> PBS: job killed: walltime 172898 exceeded limit 172800

Answer 3 · 2023-12-11T01:44:43.000Z

Those options don't look correct, I was referencing the FAQ here: https://canu.readthedocs.io/en/latest/quick-start.html#assembling-with-multiple-technologies-and-multiple-files. You still need trimming. I wouldn't use the fast option either. I doubt it will finish any jobs unless you adjusted the job size down too (using ovlRefBlockLength) but I'm not sure it is possible to reduce the jobs to fit in 48 hrs given the size of your genome.

Answer 4 · 2023-12-11T01:53:38.000Z

Ok, so you are saying to trim first (but no correction), then use -trimmed -corrected in the assembly step? And use reduced ovlRefBlockLength for both? I have stopped the assembly - you are probably right that it will not finish. I'm currently waiting for some HiFi reads so I can try an hybrid approach, which hopefully will speed things up a lot. Thanks.

Answer 5 · 2023-12-11T01:56:05.000Z

You can run both in one command using the options: -untrimmed correctedErrorRate=0.12 maxInputCoverage=100 'batOptions=-eg 0.10 -sb 0.01 -dg 2 -db 1 -dr 3' -pacbio-hifi <your ont reads> and add a lower value for ovlRefBlockLength, start with 10000000000 which is 1/3 of the default.

Answer 6 · 2023-12-11T03:13:48.000Z

Thanks, I'll try that.

Answer 7 · 2023-12-13T03:35:48.000Z

After 48 hours with those settings, overlapper got to 108th job of 4698. So I guess that means it will take unfeasibly long to finish (90 days?). I will have to wait for additional HiFi data and /or a grid resource. Thanks for your help.

Answer 8 · 2023-12-26T15:12:02.000Z

Yes, I think more cores would be your best option. Was this timing for overlapper with the uncorrected option or something else?