refresh-bio/KMC

Dumping minimizers/superKmers/signatures

mr-eyes opened this issue · 3 comments

Hi,

Is there a way to dump the canonical minimizers or superKmers without having to reach the final stage? I have read the API docs but didn't find an exposed class to achieve it. I would appreciate any leads on that!

Thank you,

Hi,

I am not sure if I understand what exactly you want. Do you want for example to run 10% of the input data and get all minimizes found with their counts or something else? Some clarification would be great. Currently there is no simple way to do this, I mean not using the C++ API or CLI, although you may try to modify the code (which may be not very easy and a little time-consuming). Anyway, we are planing to refactor some KMC parts, so maybe also extend its API, so we are looking for suggestions.

Best,
Marek

Hi @marekkokot, thanks for the prompt reply!

Well, I can dig into the code, but yes having extended API with refactoring the current code will make things much much easier.
What I wanted to do is simply to use KMC as canonical minimizers extractor (as if I don't want to get into the kmer counting step). I am only interested in the minimizers/superKmers to use in other processing. I know that the KMC isn't being developed for that purpose, but I believe the engineering effort done in KMC would make it a very fast kmers/minimizers extraction tool.

I hope I made it more clear that time :)

Ok,

extracting only super-k-mers should be quite easy, as they are stored in the temporary files which are all available after running stage 1. As far as I remember they are deleted after reading in stage 2 (but maybe some additional parameter must be passed, I'm not sure without looking into the code, if you will need help on that let me know).

They are in binary format (but this format is currently quite simple, it may be changed in the future though).
The format is (roughly):
[a_1][super-k-mer_1][a_2][super-k-mer_2]...[a_n][super-k-mer_n]
for a file containing n super-k-mers.
a_1 is a number stored on 1 byte. k + a_i is the length of i-th super-k-mer in a file (so a_i is the number of additional symbols (above k)).
Internally we also keep in memory some additional info for accessing the file in parallel, but as for now maybe just read the file sequentially.

Important details:

  • The number of files is 512 (in the default mode).
  • KMC uses signatures instead of minimizers (it may also change in the future, or rather the way signatures are defined may change) -> so the super-k-mers you may read are kmc-signatures-based (not minimizer-based)
  • We just store super-k-mers, so there is no info what is the signature of a given super-k-mer or its location in super-k-mer

Out of curiosity, why do you need super-k-mers/minimizers?