mars-research/DRAMHiT

kseq parser

Opened this issue · 3 comments

  • clean up kseqparser, add comments on issues below in code itself
  • what happens when individual sequences greater than 4096
  • when to encode, along with parsing?
  • encoding needs to handle N character
  • sliding window logic: character N logic
  • kseq.h expects a whole file (offset = beginning of file), but we are giving it offsets into middle of the file. So when one thread starts parsing from the middle of the file, kseq.h skips characters until it encounters the next "record" (the next ">" character). Need to handle this.
  • clean up kseqparser, add comments on issues below in code itself
  • what happens when individual sequences greater than 4096
    kseq_read is able to read into buffer sequences long
  • when to encode, along with parsing?
    As of now, each individual sequence in read into kseq buffer, and then fully encoded into DnaBitset object (3 bit encoding).
  • encoding needs to handle N character
  • sliding window logic: character N logic
    Enqueues kmers of length k, if N character (encoded as 100) is found, pointer is shifted.
  • kseq.h expects a whole file (offset = beginning of file), but we are giving it offsets into middle of the file. So when one thread starts parsing from the middle of the file, kseq.h skips characters until it encounters the next "record" (the next ">" character). Need to handle this.

Kseq_read handles this