gmarcais/Jellyfish

Null characters in text output and missing k-mers

fluhus opened this issue · 0 comments

Hello,

I was going over the text output of Jellyfish and got some unexpected results. After some debugging, I found that the text output included null characters (\x00) which affected my downstream parsing, and some missing kmers. I am using Jellyfish 2.3.0 on Linux.

Input fastq, in quote format, to make sure there are no strange characters lurking:

"@A00806:9:HJHTVDMXX:1:1138:32678:16000 1:N:0:ATGCGCAG+ACTGCATA\nCAAGGAGGAGCTTGCAGACCCCGAGGGACGGGAGTTTCAGGCTGTACGTGACGAACTTAACAAGCACTATGACCGCCTTTCGTTGAAAGACAATTATTCA\n+\n:FFFFFFF:FFFFFFFFFFFFFF:FFFFFFFFFF:FFFFFFFFFFFFF:FFFFFF:FFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF\n@A00806:9:HJHTVDMXX:1:1139:31096:26725 1:N:0:ATGCGCAG+ACTGCATA\nGAATAATTGTCTTTCAACGAAAGGCGGTCATAGTGCTTGTTAAGTTCGTCACGTACAGCCTGAAACTCCCGTCCCTCGGGGTCTGCAAGCTCCTCCTTGT\n+\nFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF\n"

Command:

jellyfish-linux count -m 32 -s 20 -Q "!" --text -o myfile.jf myfile.fastq

Output 1st line, in quote format:

"000000735{\"alignment\":8,\"canonical\":false,\"cmdline\":[\"count\",\"-m\",\"32\",\"-s\",\"20\",\"-Q\",\"!\",\"--text\",\"-o\",\"/tmp/amitmit/stupid.jf2\",\"/tmp/amitmit/stupid.fastq\"],\"exe_path\":\"/net/mraid08/export/genie/LabData/Analyses/amitmit/jellyfish-linux\",\"format\":\"text/sorted\",\"hostname\":\"genie40.mcl2.weizmann.ac.il\",\"key_len\":64,\"matrix1\":{\"c\":64,\"columns\":[188,176,78,231,155,110,47,48,156,86,53,120,58,201,42,78,210,10,145,157,2,109,236,226,164,77,165,4,188,141,251,211,37,7,70,89,35,106,226,165,225,40,16,101,68,58,127,36,33,152,179,74,154,132,216,36,146,99,10,5,202,167,224,80],\"identity\":false,\"r\":8},\"max_reprobe\":7,\"pwd\":\"/home/amitmit/Desktop/kmers/queue\",\"reprobes\":[1,1,3,6,10,15,21,28],\"size\":256,\"time\":\"Tue Feb 16 14:59:11 2021\",\"val_len\":7}\x00\x00\x00AGCACTATGACCGCCTTTCGTTGAAAGACAAT 1\n"

Notice the null characters following the {...} part.

K-mers missing from my result:

  • AGCACTATGACCGCCTTTCGTTGAAAGACAAT
  • TGAATAATTGTCTTTCAACGAAAGGCGGTCAT (reverse complement is in the input)
  • ACAAGGAGGAGCTTGCAGACCCCGAGGGACGG (reverse complement is in the input)

The null characters seem to appear consistently in different runs on different inputs. Their amount varies from run to run.

As for the missing k-mers, could they be getting filtered out? I am not sure if that's something missing in my params or a bug on Jellyfish's side.