HudsonAlpha/fmlrc2

Index generation

Akazhiel opened this issue · 3 comments

Hello!

I'm trying to generate the index but it's taking way too much ram, tried running it on a 60GB machine and it wasn't enough. I guess my fastq file may be too big, since it's 255Gb compressed. Is there a way that I could split the fastq, then generate multiple indexs and combine them?

Best regards,

Jonatan

Is there a way that I could split the fastq, then generate multiple indexs and combine them?

Theoretically, yes, but in practice, I'm unaware of a recent implementation that will likely help with your memory issue.

Can you tell me the approach you're using for construction and where the main source of memory consumption is in that process?

Hello.

Yes, I'm following the exact same command that's explained in the wiki:
gunzip -c H2009_shortreads.fastq.gz | awk 'NR % 4 == 2' | tr NT TN | ropebwt2 -LR | tr NT TN | fmlrc2-convert H2009_msbwt.npy

I'm skipping the sort step since you mention in the README that it can be skipped if we are only using it for correcting reads. What causes the major memory consumption is the ropebwt2 step since I believe it keeps everything in memory and it doesn't generate an output until it finishes, thus why the memory consumption keeps increasing all the time.

Okay, that's what I figured since most are using that method. The short answer is there probably isn't a great workaround other than getting a machine with more memory, or simply using less of the data (this will obviously impact fmlrc2's downstream performance).

If you're feeling experimental, you could try a tool I'm currently working on: msbwt-build. Anecdotally, it tends to use slightly less memory (probably only 5-15% less), but I haven't formally benched the memory usage across a bunch of datasets to know for sure.