Some random reads have the same read ID
JustinChu opened this issue · 4 comments
It looks like random sequences are given a random number between 0-255, resulting in duplicate IDs. Some programs break on these sequences. I can parse these sequences to rename them, but wouldn't it make sense for the sequences to be given a larger random ID based on a count?
I am not seeing it in the code, could you print a couple examples?
Actually, the numbers may exceed 255, but I'm still seeing duplicates.
grep rand test.bwa.read1.fastq
@rand_0_0_0_0_1_1_0:0:0_0:0:0_c/1
@rand_0_0_0_0_1_1_0:0:0_0:0:0_46/1
@rand_0_0_0_0_1_1_0:0:0_0:0:0_47/1
@rand_0_0_0_0_1_1_0:0:0_0:0:0_4a/1
...
grep @rand_0_0_0_0_1_1_0:0:0_0:0:0_c/1 test.bwa.read1.fastq | wc -l
180
@JustinChu I think I see it. It starts its counter over again for reads within the same chromosome, even random ones. Let me see if I can get to a fix after ASHG, so probably late next week.
@JustinChu better late than never: #30