batch_align.py loads up the whole query fasta into RAM

Question

batch_align.py loads up the whole query fasta into RAM

leoisl opened this issue 2 years ago · 2 comments

See https://github.com/karel-brinda/mof-search/blob/e8e681b67538c3eadff2e577581a36183cd27303/scripts/batch_align.py#L150-L154

This clearly does not scale well when the query fasta is massive (e.g. read sets). One easy and quick way to save a bit more RAM is to just load queries that map to the given batch. Should I implement this @karel-brinda , as it is pretty quick to do? Of course if the whole or most of the read set still maps to the batch, we will still load lots of things. Only way through this I think is to create a fasta index on the query fasta and load only the fasta IDs, with the sequences being loaded from the disk by demand...

This does not matter much if mof-search use case does not concern read sets mapping, which is what I thought from the beginning, but I know you've been mapping ONT datasets with it...

Answer 1 · 2023-05-16T12:46:50.000Z

Completely agree that the current implementation is not great and this will have to be somehow addressed if we want to support even large query files. However, how specifically would you implement this?

Imagine you have eg a nanopore seq experiment and there's one batch where essentially all reads go. Then every ref genome can have basically any subset of reads. So how would having them in a distinct file help? Or do you mean that it would help the other batches to use less resources?

In this case, what about having just query names in these batch files? So that we don't store the same sequences too many times in hundreds of files.

Answer 2 · 2023-05-16T13:12:40.000Z

I don't quite remember this issue/the code well now, but there is definitely a way to not load up the whole fasta file into RAM, which can easily be tens of GB (we are loading the uncompressed sequences in a python dictionary, so it will take even more RAM than the uncompressed fasta in disk). I think we can simply store the identifiers for each read (e.g. <read file index, read index>; or read header, etc) and at least ignore the sequence, which would make it much more scalable. I have to get back into this code to know feasible options, but I think we should have a more scalable way rather than loading up the whole query into RAM...