Gaius-Augustus/GALBA

Long runtime for 3 Gb genome

Closed this issue · 5 comments

Hi Katharina,

I had some questions regarding optimal running of GALBA for large mammalian genomes (~3 Gb). Currently GALBA downsamples to 8000 if there are too many training genes are identified out of miniprot

if($nLociGb3 > 8000){

I came across this (but not many other supporting claims) suggesting that training with more than 1000 has limited benefit. Do you know if that is still generally true, or perhaps the downsampling limit can be exposed as a parameter? I'd rather get a "pretty good" annotation done in reasonable time than the "best possible" annotation being killed after 24 hours on 8 cores.

The other question is the manuscript discusses runtimes when using 72 cores, but the README seems to suggest that running with more than 8 has limited benefit due to optimize_augustus.pl and other steps only using up to 8 cores. Is one of these statements more useful to follow?

@CEPHAS-01, did you get GALBA to finish on your assemblies (and if so, what was the CPU time)?

Best,
Alex

Thanks for the detailed response. I'll bump the threads up towards 40.

Also for the BRAKER2 Figure S3, I (naively) would be happy getting ~54% accuracy w/ 1000 training genes compared to ~58% accuracy w/ 8000 training genes. Assuming there is somewhat linear scaling with genes and the optimising/training stages seem to be bottlenecks, I would happily trade 4% accuracy for 5x speed up. I'm looking at annotation as a bonus on multiple new assemblies, rather than annotating a reference to high quality.

Do you think subsetting to 1000 training genes would be appropriate for that goal? Hopefully I can get the normal version with 8k to finish and then compare with 1k.

@CEPHAS-01, did you get GALBA to finish on your assemblies (and if so, what was the CPU time)?

Hi Alex,

The run did not complete due to the error I described earlier in the thread here, so it is difficult to estimate the CPU time for the run. You may be able to successfully complete the run, and I would love to hear your feedback on this.

Warm regards,
Temitayo

In the end with 32 threads this took about 44 wall hours (1086 CPU hours) peaking at 115 GB of RAM. So increasing the threads was quite useful in the end. It looks like GALBA predicts too many genes, although some script (maybe from some BRAKER issue?) reports that about 1/3 of the genes have low/no evidence from the hints, which brings it closer to expectations.