Reverse-strand alignments look messy sometimes
Closed this issue · 6 comments
I'm experimenting with using seqwish to convert hal alignments from cactus to pangenome graphs because the (very old) hal2vg uses too much memory.
I'm noticing that the resulting graphs from seqwish seem to have more sequence than from hal2vg though. Here's a small example from a simulated 600kb mouse/rat/ancestor alignment where the seqwish graph contains 7% more sequence.
vg stats -lz evolver-sw.pg
nodes 258042
edges 349693
length 845864
vg stats -lz evolver-h2vg.pg
nodes 234353
edges 317222
length 789076
looking at the first difference I found in the deconstructed VCF gave me:
hal2vg:
seqwish:
as far as I can tell, there's nothing going terribly wrong. the graphs contain identical path sequences. But in the case of reverse-strand matches, it seems that seqwish may be pulling apart some homologies to make unnecessary nodes. I can't rule out a bug in my input PAF (but am not sure what kind of mistake would cause this).
Everything needed to reproduce:
example.tar.gz
Ok I see that this might be a problem. Thanks for the test case. I'll see what I can do.
All signs point to this being an issue with hal2paf
's cigar strings being wrong. Sorry!
The reverse strand alignments do look messy, but we checked @glennhickey's test case and found the problem was due to PAF format confusion. Cigars were reversed, and that led to a blowup in the size of the graph, as most of the "matched" sequences was not matching.
So the way to fix the messy reverse strand alignments is "grooming" as in odgi groom.