Optimization: add ability to ignore sequences when loading a GFA file

Question

Optimization: add ability to ignore sequences when loading a GFA file

fedarko opened this issue 2 years ago · 6 comments

Thank you for your work on Gfapy!

I have a very large GFA file (over 3 GB), including ~80k segments that I'd like to load using Gfapy. Although this graph contains sequence information for each segment, I don't really care about the segments' sequences—I just want to know the topology of the graph, and the lengths of each segment.

Gfapy takes a very long time to load this file (I'm not sure if it will ever load before running out of memory: I killed it after a few minutes on our cluster). It ended up being easier for me to write a simple [albeit much less robust than Gfapy :) -- i.e. without GFA2 support] line-by-line parser which completely ignores the segments' sequences.

This line-by-line parser takes only a few seconds to run on the aforementioned large GFA file, and I am fairly sure that the reason for this speed-up is that my parser completely ignores segments' sequences (and instead treats each segment line as if it were written as S \t segname \t * \t LN:i:100 or something). I suspect this is also the reason why Gfapy's parser took a long amount of time in issue #23 in this repository; from going through the script that generated the GFA file discussed there, it seems like this script includes sequences in the GFA file, and thus causes Gfapy to spend a lot of time loading these (usually relatively large) sequences into memory.

I am writing some code now to work with arbitrary GFA files, so I'd really like to be able to use Gfapy; however, the long amount of time it takes to load files is a blocker for me. I would therefore like to propose that Gfapy's Gfa.from_file() function add an optional flag parameter to ignore sequences in segment lines. I'm not sure how easy this would be to implement in Gfapy's code, but I think this could really increase the sizes of graphs that Gfapy could work with.

Thanks for your consideration, and thanks again for your work on this project.

Answer 1 · 2022-05-13T06:25:50.000Z

Hi @fedarko

I implemented a new option 'ignore_sequences' in the 'from_file' method. See the branch 'opt_ignore_seqs' of this repository.

I don't know if this really solves the problem with such a huge file. Let me know!

Answer 2 · 2022-05-13T06:34:56.000Z

Hi @ggonnella,

Thank you for the quick response! That's great news.

I downloaded the opt_ignore_seqs branch, but it doesn't look like the ignore_sequences option is available -- as far as I can tell, the opt_ignore_seqs branch is equal to the master branch. Maybe some code is un-pushed?

Answer 3 · 2022-05-13T06:37:33.000Z

Exactly... Sorry. Now it's there.

Answer 4 · 2022-05-13T07:24:31.000Z

Thank you! I downloaded the fixes and tried running Gfa.from_file() on the big > 3 GB GFA file (using ignore_sequences=True). After a few minutes, the from_file() operation is unfortunately still running.

If I can continue asking for things :), I have a suspicion that a reason for this still taking a while is that these long sequences are still being read into memory and passed along—they are only being explicitly overwritten after some time, and in the meantime there are still references to line, gfa_line, data, etc. being kept around (and these strings necessarily include the sequence as a part of them).

It's a bit of a hack, but I wonder if adjusting Gfa.read_file() to detect if line is a segment and, if ignore_sequences is True, "modifying" the line to not include this segment, and then keeping everything downstream the same, might be more efficient?

In any case, thank you for implementing this as is—I really appreciate it!

Answer 5 · 2022-05-13T07:39:51.000Z

I think it can be done, although it's a bit of a "hack", but justified by the goal. I'll try that.

Answer 6 · 2022-05-13T09:01:37.000Z

I implemented a version doing that. Unfortunately I have to do something else now, but in preliminary tests it seems to work and could be faster then the previous version

if you want to try it grab the newest commit in the opt_ignore_seqs branch...