Reverse-strand alignments look messy sometimes

Question

Reverse-strand alignments look messy sometimes

Closed this issue 4 years ago · 6 comments

I'm experimenting with using seqwish to convert hal alignments from cactus to pangenome graphs because the (very old) hal2vg uses too much memory.

I'm noticing that the resulting graphs from seqwish seem to have more sequence than from hal2vg though. Here's a small example from a simulated 600kb mouse/rat/ancestor alignment where the seqwish graph contains 7% more sequence.

vg stats -lz evolver-sw.pg
nodes	258042
edges	349693
length	845864

vg stats -lz evolver-h2vg.pg
nodes	234353
edges	317222
length	789076

looking at the first difference I found in the deconstructed VCF gave me:
hal2vg:

seqwish:

as far as I can tell, there's nothing going terribly wrong. the graphs contain identical path sequences. But in the case of reverse-strand matches, it seems that seqwish may be pulling apart some homologies to make unnecessary nodes. I can't rule out a bug in my input PAF (but am not sure what kind of mistake would cause this).

Everything needed to reproduce:
example.tar.gz

Answer 1 · 2020-06-26T08:07:09.000Z

You may want to "groom" the graph to resolve this kind of thing. There is a tool in odgi for that. I'll take a look.

…

On Thu, Jun 25, 2020, 21:29 Glenn Hickey ***@***.***> wrote: I'm experimenting with using seqwish to convert hal <https://github.com/ComparativeGenomicsToolkit/hal> alignments from cactus to pangenome graphs because the (very old) hal2vg <https://github.com/ComparativeGenomicsToolkit/hal2vg> uses too much memory. I'm noticing that the resulting graphs from seqwish seem to have more sequence than from hal2vg though. Here's a small example from a simulated 600kb mouse/rat/ancestor alignment where the seqwish graph contains 7% more sequence. vg stats -lz evolver-sw.pg nodes 258042 edges 349693 length 845864 vg stats -lz evolver-h2vg.pg nodes 234353 edges 317222 length 789076 looking at the first difference I found in the deconstructed VCF gave me: hal2vg: [image: evolver-635336-h2vg] <https://user-images.githubusercontent.com/901102/85785911-c0708b80-b6f7-11ea-80e7-010fdff187ba.png> seqwish: [image: evolver-635336-sw] <https://user-images.githubusercontent.com/901102/85785947-c8c8c680-b6f7-11ea-9ba2-ebea31ffe2d3.png> as far as I can tell, there's nothing going terribly wrong. the graphs contain identical path sequences. But in the case of reverse-strand matches, it seems that seqwish may be pulling apart some homologies to make unnecessary nodes. I can't rule out a bug in my input PAF (but am not sure what kind of mistake would cause this). Everything needed to reproduce: example.tar.gz <https://github.com/ekg/seqwish/files/4833416/example.tar.gz> — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#56>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABDQEI7V6QKG5235JJBWJTRYOQRZANCNFSM4OIVPQ3A> .

Answer 2 · 2020-06-26T08:08:27.000Z

This is caused by the nodes being added in the foeward orientation of the path that is relatively reversed. Grooming should help. There might be a trick to add the sequence in without this kind of pattern. But, I suspect it will be tricky to implement correctly and so postprocessing might be safer.

…

On Fri, Jun 26, 2020, 10:06 Erik Garrison ***@***.***> wrote: You may want to "groom" the graph to resolve this kind of thing. There is a tool in odgi for that. I'll take a look. On Thu, Jun 25, 2020, 21:29 Glenn Hickey ***@***.***> wrote: > I'm experimenting with using seqwish to convert hal > <https://github.com/ComparativeGenomicsToolkit/hal> alignments from > cactus to pangenome graphs because the (very old) hal2vg > <https://github.com/ComparativeGenomicsToolkit/hal2vg> uses too much > memory. > > I'm noticing that the resulting graphs from seqwish seem to have more > sequence than from hal2vg though. Here's a small example from a simulated > 600kb mouse/rat/ancestor alignment where the seqwish graph contains 7% more > sequence. > > vg stats -lz evolver-sw.pg > nodes 258042 > edges 349693 > length 845864 > > vg stats -lz evolver-h2vg.pg > nodes 234353 > edges 317222 > length 789076 > > looking at the first difference I found in the deconstructed VCF gave me: > hal2vg: > [image: evolver-635336-h2vg] > <https://user-images.githubusercontent.com/901102/85785911-c0708b80-b6f7-11ea-80e7-010fdff187ba.png> > seqwish: > [image: evolver-635336-sw] > <https://user-images.githubusercontent.com/901102/85785947-c8c8c680-b6f7-11ea-9ba2-ebea31ffe2d3.png> > > as far as I can tell, there's nothing going terribly wrong. the graphs > contain identical path sequences. But in the case of reverse-strand > matches, it seems that seqwish may be pulling apart some homologies to make > unnecessary nodes. I can't rule out a bug in my input PAF (but am not sure > what kind of mistake would cause this). > > Everything needed to reproduce: > example.tar.gz > <https://github.com/ekg/seqwish/files/4833416/example.tar.gz> > > — > You are receiving this because you are subscribed to this thread. > Reply to this email directly, view it on GitHub > <#56>, or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AABDQEI7V6QKG5235JJBWJTRYOQRZANCNFSM4OIVPQ3A> > . >

Answer 3 · 2020-06-26T13:24:25.000Z

Ok I see that this might be a problem. Thanks for the test case. I'll see what I can do.

Answer 4 · 2020-07-03T19:42:17.000Z

All signs point to this being an issue with hal2paf's cigar strings being wrong. Sorry!

Answer 5 · 2020-07-04T14:59:42.000Z

The reverse strand alignments do look messy, but we checked @glennhickey's test case and found the problem was due to PAF format confusion. Cigars were reversed, and that led to a blowup in the size of the graph, as most of the "matched" sequences was not matching.

Answer 6 · 2020-07-04T15:00:49.000Z

So the way to fix the messy reverse strand alignments is "grooming" as in odgi groom.