ekg/seqwish

Seqwish does not transitively align through SNPs

glennhickey opened this issue · 4 comments

Here's a simple case inspired from real data from a cactus alignment:

# dummy.fa
>Anc
AAATAAA
>1
AAATAAA
>2
AAATAAA
>3
AAAGAAA
>4
AAAGAAA

# dummy.paf
1	7	0	7	+	Anc	7	0	7	7	7	255	cg:Z:7M
2	7	0	7	+	Anc	7	0	7	7	7	255	cg:Z:7M
3	7	0	7	+	Anc	7	0	7	7	7	255	cg:Z:7M
4	7	0	7	+	Anc	7	0	7	7	7	255	cg:Z:7M

seqwish -p dummy.paf -s dummy.fa -g dummy.gfa -P

Those Gs are transitively aligned in the PAF by way of the aligning to T in Anc but they don't come out in the graph:
dummy

This is the exact same issue as ComparativeGenomicsToolkit/hal2vg#26, funnily enough.

ekg commented
ekg commented

OK, these aren't transitively aligned so the result you're getting is not a bug but an exact representation of the input alignments.

To try to write this out:

Anc AAATAAA
1   AAATAAA
2   AAATAAA
3   AAAgAAA
4   AAAgAAA

The upper-case characters match Anc. The lower-case gs aren't matched to anything, so seqwish does not close through them, nor map them to each other. It's not given any information to indicate that they match.

An all-to-all alignment would provide this, or adding in the alignments within each SNP or variant (but that seems tedious). Alternatively, you could push the output through smoothxg:

seqwish -p dummy.paf -s dummy.fa -g dummy.gfa
smoothxg -g dummy.gfa >dummy.smooth.gfa

This result matches your expectation:

H       VN:Z:1.0
S       1       AAA     DP:i:5  RC:i:15
L       1       +       3       +       0M
L       1       +       2       +       0M
S       2       T       DP:i:3  RC:i:3
L       2       +       4       +       0M
S       3       G       DP:i:2  RC:i:2
L       3       +       4       +       0M
S       4       AAA     DP:i:5  RC:i:15
P       Anc     1+,2+,4+        *
P       1       1+,2+,4+        *
P       2       1+,2+,4+        *
P       3       1+,3+,4+        *
P       4       1+,3+,4+        *

dummy smooth

Cool thanks. Kind of figured it was by design but thought I'd bring it up anyway, mostly to make me feel slightly less bad about forgetting to handle this in hal2vg

ekg commented

I don't think it's trivial to handle! Yeah, seqwish is meant to be a pretty "dumb" algorithm. It just takes the input that's given, with some optional filters (-k, -l, and -r) to remove or ignore exact matches of given lengths or implied local/global self-repeat count.