ekg/seqwish

Reverse-strand alignments look messy sometimes

Closed this issue · 6 comments

I'm experimenting with using seqwish to convert hal alignments from cactus to pangenome graphs because the (very old) hal2vg uses too much memory.

I'm noticing that the resulting graphs from seqwish seem to have more sequence than from hal2vg though. Here's a small example from a simulated 600kb mouse/rat/ancestor alignment where the seqwish graph contains 7% more sequence.

vg stats -lz evolver-sw.pg
nodes	258042
edges	349693
length	845864

vg stats -lz evolver-h2vg.pg
nodes	234353
edges	317222
length	789076

looking at the first difference I found in the deconstructed VCF gave me:
hal2vg:
evolver-635336-h2vg
seqwish:
evolver-635336-sw

as far as I can tell, there's nothing going terribly wrong. the graphs contain identical path sequences. But in the case of reverse-strand matches, it seems that seqwish may be pulling apart some homologies to make unnecessary nodes. I can't rule out a bug in my input PAF (but am not sure what kind of mistake would cause this).

Everything needed to reproduce:
example.tar.gz

ekg commented
ekg commented
ekg commented

Ok I see that this might be a problem. Thanks for the test case. I'll see what I can do.

All signs point to this being an issue with hal2paf's cigar strings being wrong. Sorry!

ekg commented

The reverse strand alignments do look messy, but we checked @glennhickey's test case and found the problem was due to PAF format confusion. Cigars were reversed, and that led to a blowup in the size of the graph, as most of the "matched" sequences was not matching.

ekg commented

So the way to fix the messy reverse strand alignments is "grooming" as in odgi groom.