Index generation

Question

Index generation

Akazhiel opened this issue 3 years ago · 3 comments

Hello!

I'm trying to generate the index but it's taking way too much ram, tried running it on a 60GB machine and it wasn't enough. I guess my fastq file may be too big, since it's 255Gb compressed. Is there a way that I could split the fastq, then generate multiple indexs and combine them?

Best regards,

Jonatan

Answer 1 · 2021-10-29T13:38:50.000Z

Is there a way that I could split the fastq, then generate multiple indexs and combine them?

Theoretically, yes, but in practice, I'm unaware of a recent implementation that will likely help with your memory issue.

Can you tell me the approach you're using for construction and where the main source of memory consumption is in that process?

Answer 2 · 2021-10-29T13:41:41.000Z

Hello.

I'm skipping the sort step since you mention in the README that it can be skipped if we are only using it for correcting reads. What causes the major memory consumption is the ropebwt2 step since I believe it keeps everything in memory and it doesn't generate an output until it finishes, thus why the memory consumption keeps increasing all the time.

Answer 3 · 2021-10-29T14:07:09.000Z

Okay, that's what I figured since most are using that method. The short answer is there probably isn't a great workaround other than getting a machine with more memory, or simply using less of the data (this will obviously impact fmlrc2's downstream performance).

If you're feeling experimental, you could try a tool I'm currently working on: msbwt-build. Anecdotally, it tends to use slightly less memory (probably only 5-15% less), but I haven't formally benched the memory usage across a bunch of datasets to know for sure.