estimation of footprint to assemble a large plant genome

Question

estimation of footprint to assemble a large plant genome

Opened this issue 5 years ago · 4 comments

Hello,
we would like to use Ra to assemble a ~5 Gb plant genome (the genome is actually 2.5Gb in size, but it is highly-heterozygous, so we want to distinguish the two alleles in separate contigs). I have about 45x (of 5 Gb) of ONT data (N50 15 kb, QV >7) and we wonder if there is a way to predict the size of the minimap2 file and optimize the alignment step, since it is the most expensive.
For example, to simplify the output, adding/tweaking the minimap2 options: -X -p -N
and to be more sensitive: -A -B -O -E -z.
Do you think there is room in that part to increase specificity of alignments and decrease footprint and computation time?

Lastly, can you estimate the memory usage for an assembly where we have up to 36 CPUs (hyperthreading to 72) and max 500 GB RAM available? Will the polishing/graph construction steps be more memory demanding than the alignment step?
Thanks,
Dario

Answer 1 · 2019-09-05T16:43:01.000Z

Hello Dario,
I think that the memory consumption should be around the size of the FASTQ file you have plus some epsilon. It is really hard to predict the file sizes, what you can do is replace sequence names with identifiers in FASTQ file which will decrease the PAF file significantly. You can also edit scripts/ra.py so that the consensus is not run if you want to change alignment parameters of minimap2 and run the polishing manually (changes at the bottom). We are working on decreasing memory usage and removing disk usage completely.

Best regards,
Robert

Changes for ra.py:

remove stdout=layout_file at https://github.com/lbcb-sci/ra/blob/master/scripts/ra.py#L143
add return at https://github.com/lbcb-sci/ra/blob/master/scripts/ra.py#L150
run cmake before running (so that the "executable" in bin is updated)

Answer 2 · 2019-09-06T10:26:24.000Z

Thank you Robert,
I will apply your tips now.

Yesterday I run Ra on a small subset of my input file to test its behavior on a cluster.
The job died (probably for running out of memory), and the files in the working directory are gone:

[copettid@eu-login-11-ng ra_assemblies]$ tail lsf.o99583173
2b295004e000-2b2954000000 ---p 00000000 00:00 0
2b295a256000-2b2966563000 rw-p 00000000 00:00 0
2b2968000000-2b2968021000 rw-p 00000000 00:00 0
2b2968021000-2b296c000000 ---p 00000000 00:00 0
2b297e565000-2b299e566000 rw-p 00000000 00:00 0
2b2b5e569000-2b2d5e56a000 rw-p 00000000 00:00 0
7fff4286e000-7fff42890000 rw-p 00000000 00:00 0                          [stack]
7fff42903000-7fff42905000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
[Ra::__exit__] warning: unable to clean work directory!
[copettid@eu-login-11-ng ra_assemblies]$ ls ra_work_directory_1567688969.28/
[copettid@eu-login-11-ng ra_assemblies]$

is this indicative of too little memory? I am pretty sure the alignment step was already completed.
The stdout says at one point:

[M::worker_pipeline::1339.338*41.21] mapped 23468 sequences
[M::main] Version: 2.14-r892-dirty
[M::main] CMD: /cluster/home/copettid/bin/ra/vendor/minimap2/minimap2 -t 68 -x ava-ont subset.fa subset.fa
[M::main] Real time: 1341.867 sec; CPU: 55197.934 sec; Peak RSS: 33.331 GB
[Ra::run] preconstruction stage
[rala::Graph::initialize] loaded sequences
[rala::Graph::initialize] loaded overlaps
[rala::Graph::initialize] number of prefiltered sequences = 8476
[rala::Graph::initialize] elapsed time = 4356.13812 s
[rala::Graph::construct] loaded overlaps
*** Error in `/cluster/home/copettid/bin/ra/build/vendor/rala/bin/rala': free(): invalid next size (fast): 0x0000000034a72b10 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x81499)[0x2b283175c499]
/cluster/home/copettid/bin/ra/build/vendor/rala/bin/rala(_ZN4rala5Graph10preprocessERSt6vectorISt10unique_ptrINS_7OverlapESt14default_deleteIS3_EESaIS6_EES9_+0x1248)[0x41bda8]
/cluster/home/copettid/bin/ra/build/vendor/rala/bin/rala(_ZN4rala5Graph9constructERKSs+0x53d)[0x41cc9d]
/cluster/home/copettid/bin/ra/build/vendor/rala/bin/rala(main+0x22c)[0x40d2ec]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b28316fd445]
/cluster/home/copettid/bin/ra/build/vendor/rala/bin/rala[0x40d59c]
======= Memory map: ========
00400000-0044c000 r-xp 00000000 00:33 86210240                           /cluster/home/copettid/bin/ra/build/vendor/rala/bin/rala
0064c000-0064d000 r--p 0004c000 00:33 86210240                           /cluster/home/copettid/bin/ra/build/vendor/rala/bin/rala
0

This test was run with 352,825 fastas (5% of what I want to use), asking for 36 CPUs (to be hyperthreaded to 72, in a single node) and 144 GB of RAM, the command was
ra -t 68 -x ont subset.fa
The max memory requested was 185 GB.
How do I get to have a complete run on a subset at first?

Also, would it be possible to restart the assembly from the step just after the minimap2 alignment? In this way I can submit two jobs, each with the resources optimized for the task (CPUs at first, memory later).
Thanks,

Dario

Answer 3 · 2019-09-06T10:45:36.000Z

I would not say that this is a memory issue, maybe some bug. Would you mind sharing the subset causing the error so I can see locally?

Unfortunately, you can not restart the assembly after minimap2 step with ra.py, but you could manually run all the steps as in the script.

Best regards,
Robert

Answer 4 · 2019-09-12T12:03:09.000Z

@dcopetti, you can also try the updated version (all in memory) here: https://github.com/lbcb-sci/raven.