kseq parser
Opened this issue · 3 comments
harishankarv commented
- clean up kseqparser, add comments on issues below in code itself
- what happens when individual sequences greater than 4096
- when to encode, along with parsing?
- encoding needs to handle N character
- sliding window logic: character N logic
harishankarv commented
- kseq.h expects a whole file (offset = beginning of file), but we are giving it offsets into middle of the file. So when one thread starts parsing from the middle of the file, kseq.h skips characters until it encounters the next "record" (the next ">" character). Need to handle this.
utsavjainb commented
- clean up kseqparser, add comments on issues below in code itself
- what happens when individual sequences greater than 4096
kseq_read is able to read into buffer sequences long- when to encode, along with parsing?
As of now, each individual sequence in read into kseq buffer, and then fully encoded into DnaBitset object (3 bit encoding).- encoding needs to handle N character
- sliding window logic: character N logic
Enqueues kmers of length k, if N character (encoded as 100) is found, pointer is shifted.
utsavjainb commented
- kseq.h expects a whole file (offset = beginning of file), but we are giving it offsets into middle of the file. So when one thread starts parsing from the middle of the file, kseq.h skips characters until it encounters the next "record" (the next ">" character). Need to handle this.
Kseq_read handles this