karel-brinda/Phylign

What's in reality the proportion of matching k-mers?

karel-brinda opened this issue · 3 comments

I'm not currently sure what does this number imply, see eg the following figure, which is from an experiment with the k-mer threshold 0.4: 94 matching k-mers out of (2185-31+1):

image

Does this proportion refer to the mini-k-mers to which the "big" k-mers are decomposed inside COBS? How can we relate the two proportions? It should be somehow explained in the documentation as it's a critical parameter for the search, and even I don't have any good intuition for this.

Hey, sorry for disappearing, I am back at mof-search now. I think you mean the cobs_kmer_thres parameter in config.yaml? That is supposed to be the proportion of kmer presence between the query and the reference for matching, i.e. for your case COBS should have output only samples with a number of kmers matches >= 0.4*(2185-31+1), which is clearly not the case here. There could be some subtle reasons to this though: non-ACGT bases in the sequence, repeated kmers, etc... I'd need the query file to properly debug this...

Update: this is working as intended.

The number besides the read is the number of samples the read matched to (see https://github.com/iqbal-lab-org/cobs/blob/9601c8ebd4a08c3a58f9e1391a052faeceeaefb0/cobs/query/search.hpp#L105).

The number besides the sample is the number of matched kmers between the read and the sample (see https://github.com/iqbal-lab-org/cobs/blob/9601c8ebd4a08c3a58f9e1391a052faeceeaefb0/cobs/query/search.hpp#L108 and https://github.com/iqbal-lab-org/cobs/blob/9601c8ebd4a08c3a58f9e1391a052faeceeaefb0/cobs/query/search.hpp#L30-L31).

In the output, COBS don't show the number of kmers in the read

@leoisl Thank you very much for looking into this! Would it be possible to add a quick note in the documentation about the intermediate formats saying exactly this? These formats are important to understand for all debugging of mof-search.