cosmo-team/cosmo

KMC miscounting kmers?

Closed this issue · 1 comments

I just brought up this issue in the KMC repo
refresh-bio/KMC#46

It seems like when I try to count the 3-mers in this dummy sequence

>dummy
AATGGGTCCCTGTTTCGCGATAAAATGCCAATCGCTCTAAATATCGCGCTAGC

it's reporting 25 unique 3mers instead of 34 (which I get by just having a sliding window and storing the substrings of length 3 in a Python set). I also tried setting k to 2 and 4 to see if it was an off-by-one error and it seems to not be.

So after looking through the code, it looks like discrepancies were due to KMC only counting the canonical forms of kmers and you guys including both when constructing graphs. Everything makes sense now.