dib-lab/kmerDecoder

Chunk size, kmers or seqs?

Opened this issue · 4 comments

Rethinking regarding the chunk size, should we define the chunk size as the number of sequences or the number of kmers?

Chunk size as the number of sequences should work when the sequence lengths are relatively small. In genomes for example, if we set the chunk size to 10k that will consume a lot of memory per single chunk. On the other hand, it will work smoothly when processing transcripts due to their short and the average length is small.

Chunk size as the number of kmers will work just fine on the previous examples and we can set a fixed multiplier of thousands or millions.

@drtamermansour what do you think?

I like that

Here is a proposed design.

The user will set a maximum memory, say, 1GB. The chunk size will be auto-calculated chunk_size ≈ 1e9 Bytes ÷ [ kSize + 8 ] that means 1GB can store ~25 millions kmers.

Kmers are stored in the following data structure.

flat_hash_map<std::string, std::vector<kmer_row>> kmers;

When implementing this design, we will need to attach the sequence name to every kmer, increasing the memory. Why is that? Because a sequence's kmers will most likely split over two chunks. Or we can set another data structure to hold this information without redundancy.

This design will significantly change the kmerDecoder API, which means it needs to be changed in every part used in kProcessor.

So, I think we are going to defer this for now

The memory based approach is nice.
I think we can implement this without changing the design.
Here is a suggestion: kmerDecoder will create a temp file. It will read the sequences from the input file as usual. Once it reaches the max no of kmers, it will write the remaining sequence (after adjusting for the kmer overlap of course) in the temp file with the same sequence name. In the next chunk, it will check for the temp file to read from it before reading from the input stream. What do you think?

If the chunk size is set to be small, and the sequences file is large, it will require many writing times to the temp file. I will rethink for an alternative design and post it here after implementing the aa-encoding #14 .