algbio/themisto

Sequence and reverse complement generating different unitigs

krobison13 opened this issue · 1 comments

If I give themisto-build a file with just a sequence and its reverse complement, extract-unitigs is generating two different unitigs -- is this the expected behavior?

e.g. if 80.fna contains

>k80
ATCAGCAGCGACATGGCGGTCATCACCGTAGTCGAGGCAAGCAATAATGGACGGCGCCCG
ACGTGGTCGATGATCGCAGA
>rc.k80
TCTGCGATCATCGACCACGTCGGGCGCCGTCCATTATTGCTTGCCTCGACTACGGTGATG
ACCGCCATGTCGCTGCTGAT

and then run
themisto build -k 31 -i 80.fna -o 80.k31 --temp-dir .
themisto extract-unitigs -i 80.k31 --colors-out 80.k31.colors --gfa-out 80.k31.gfa

I get a file with two lines in the colors file and two segments in the GFA file

H VN:Z:1.0
S 86 ATCAGCAGCGACATGGCGGTCATCACCGTAGTCGAGGCAAGCAATAATGGACGGCGCCCGACGTGGTCGATGATCGCAGA
S 77 TCTGCGATCATCGACCACGTCGGGCGCCGTCCATTATTGCTTGCCTCGACTACGGTGATGACCGCCATGTCGCTGCTGAT

Yes, this is expected. Our index structure is not aware of reverse complements.

We could add a flag to extract-unitigs to compute the bidirected de Bruijn graph for better interoperability with other tools. Meanwhile, you can work around this by concatenating the input with its reverse complement before building the index. This will create two copies for each unitig: one for the forward and one for the reverse complement (except for those that are reverse complements of themselves). You can extract the bidirected de Bruijn graph from this with some post processing.