What's in reality the proportion of matching k-mers?
karel-brinda opened this issue · 3 comments
I'm not currently sure what does this number imply, see eg the following figure, which is from an experiment with the k-mer threshold 0.4
: 94
matching k-mers out of (2185-31+1)
:
Does this proportion refer to the mini-k-mers to which the "big" k-mers are decomposed inside COBS? How can we relate the two proportions? It should be somehow explained in the documentation as it's a critical parameter for the search, and even I don't have any good intuition for this.
Hey, sorry for disappearing, I am back at mof-search now. I think you mean the cobs_kmer_thres
parameter in config.yaml
? That is supposed to be the proportion of kmer presence between the query and the reference for matching, i.e. for your case COBS should have output only samples with a number of kmers matches >= 0.4*(2185-31+1)
, which is clearly not the case here. There could be some subtle reasons to this though: non-ACGT bases in the sequence, repeated kmers, etc... I'd need the query file to properly debug this...
Update: this is working as intended.
The number besides the read is the number of samples the read matched to (see https://github.com/iqbal-lab-org/cobs/blob/9601c8ebd4a08c3a58f9e1391a052faeceeaefb0/cobs/query/search.hpp#L105).
The number besides the sample is the number of matched kmers between the read and the sample (see https://github.com/iqbal-lab-org/cobs/blob/9601c8ebd4a08c3a58f9e1391a052faeceeaefb0/cobs/query/search.hpp#L108 and https://github.com/iqbal-lab-org/cobs/blob/9601c8ebd4a08c3a58f9e1391a052faeceeaefb0/cobs/query/search.hpp#L30-L31).
In the output, COBS don't show the number of kmers in the read
@leoisl Thank you very much for looking into this! Would it be possible to add a quick note in the documentation about the intermediate formats saying exactly this? These formats are important to understand for all debugging of mof-search.