Question about GFA format
Opened this issue · 3 comments
Hi @IlyaMinkin,
It's me again :). TwoPaCo has been working great, but I've run into a small issue regarding the GFA file. I was wondering if you could clear up my confusion. I build a cdBG using TwoPaCo with k=31. As the document states that k is the node size, I'm expecting the cdBG to contain a list of segments (i.e., contigs) that overlap by k-1. However, in the resulting GFA file, all of the contigs seem to instead overlap by k (i.e., they show a 31M
overlap). This is causing some issues downstream, as we expect the invariant that a k-mer (or its reverse complement) appears at most once in the cdBG. However, when the overlap is of size k, we get that a given k-mer may appear as many times as it participates in an overlap.
Have I misunderstood something about the expected format of this graph? Is there an easy way to obtain the cdBG GFA file such that the overlaps are retained as k-1 bases instead of k?
Thanks!
Rob
Hi @rob-p ,
I understand you confusion. The issue is that initially we adopted the edge-centric definition of the graph, i.e. sequences are spelled by edges, with nodes of size
Again, sorry for the confusion, I am aware that it pops up all the time (https://www.biostars.org/p/175058/). I have plans to improve documentation to clear things out (I even put it in for 0.9.3: https://github.com/medvedevgroup/TwoPaCo/blob/master/NEWS.md). I just didn't expect people to start using TwoPaCo right away :)
Hi @IlyaMinkin,
Yup, I understand the confusion here as well. We have often gone back and forth between preferring the node and edge-centric view of the dBG.
I guess my concern with the proposed temporary solution (running with
Thanks for the quick responses!
Rob